Automated

Introduction

This guide covers setting up your Composabl training cluster using pulumi, an Infrastructure as Code tool.

This example uses Azure Kubernetes Service, but can be adapted to other supported providers.

Prerequisites

An Azure subscription with sufficient permissions to create and update various resources
A working installation of pulumi
If you're following along in typescript, a working installation of NodeJS
A new pulumi project, as per the pulumi documentation. You can find the documentation for Azure here

Overview

We will be deploying the following resources to your Azure subscription:

Resource group, containing all resources
A container registry, to hold simulator images
An AKS cluster

Resource group

The resource group will contain all resources. It is also what determines in what Azure location the resources will be deployed.

import * as resources from "@pulumi/azure-native/resources/index.js";

const resourceGroup = new resources.ResouceGroup('my-resource-group-', {
  location: 'eastus'
});

export const rgName = pulumi.interpolate`${resourceGroup.name}`;

At the end, we export the name of the resource group (which will be randomized by pulumi) for further use in our definition

Container registry

The container registry is where you will be able to privately store your simulator docker images, if any.

import * as containerregistry from "@pulumi/azure-native/containerregistry/index.js";

const registry = new containerregistry.Registry("registry", {
  resourceGroupName: resourceGroup.name,
  sku: {
    name: "Basic",
  },
  adminUserEnabled: true,
});

export const registryName = pulumi.interpolate`${registry.name}`;

Kubernetes Cluster

The cluster is where both the Composabl components and your training will be running. This configuration is more complex, so additional information will be provided as comments in the typescript definition:

import * as containerservice from "@pulumi/azure-native/containerservice/index.js";

const k8sCluster = new containerservice.ManagedCluster("aks", {
  resourceGroupName: resourceGroup.name, // Here, we reference the resourceGroup we created earlier
  location: resourceGroup.location,

  dnsPrefix: "composabl-aks",
  kubernetesVersion: "1.31.1", // you can get supported versions using the Azure CLI: az aks get-versions -l <location> -o table - replace <location> with the location you set in your resourcegroup.
  enableRBAC: true,

  // Assign a managed identity to the cluster
  identity: {
    type: "UserAssigned",
    userAssignedIdentities: [appMiAKS.id],
  },

  // Configure 3 pools
  // 1. Main (the kubernetes control plane nodes)
  // 2. Train (Composabl system components and training workers)
  // 3. Sims (Composabl simulators)
  agentPoolProfiles: [
    // The Main pool has 3 small nodes to act as a control plane
    {
      name: "main",
      count: 3,
      vmSize: "Standard_B2s", // (2 core, 4GB RAM, 0.041/hour)
      osType: "Linux",
      osSKU: "Ubuntu",
      mode: "System",
    }
  ],
  sku: {
    name: "Base",
    tier: "Standard"
  },
  // This is an optional part, unless using very large clusters with several 100s of nodes.
  networkProfile: {
    networkPlugin: "azure",
    networkPolicy: "calico",
  }
});

// the "Composabl" agent pool will run the composabl system components (Controller, Historian)
const composablPool = new containerservice.AgentPool("composabl",
  {
    resourceGroupName: resourceGroup.name,
    resourceName: k8sCluster.name,
    agentPoolName: "composabl",
    count: 1,
    vmSize: "Standard_D4s_v3", // (4 core, 16GB RAM)
    osType: "Linux",
    osSKU: "Ubuntu",
    mode: "System",
    osDiskSizeGB: 100,
    osDiskType: "Premium_LRS",
  },
  { replaceOnChanges: ["vmSize"] }
);

// the Env Runners will contain the part of the SDK that deals with data gathering from the simulators
// If training using GPU is disabled, all training will happen on these nodes as well
const envrunnersPool = new containerservice.AgentPool("envrunners",
  {
    resourceGroupName: resourceGroup.name,
    resourceName: k8sCluster.name,
    agentPoolName: "envrunners",
    vmSize: "Standard_D8d_v4",
    count: 1,
    minCount: 1,
    maxCount: 10,
    enableAutoScaling: true,
    osType: "Linux",
    osSKU: "Ubuntu",
  },
  { replaceOnChanges: ["vmSize"] }
);

// The Sims-CPU pull will run all simulator instances
const simsCpuPool = new containerservice.AgentPool("simscpu",
  {
    resourceGroupName: resourceGroup.name,
    resourceName: k8sCluster.name,
    agentPoolName: "simscpu",
    vmSize: "Standard_D8d_v4",
    count: 2,
    minCount: 2,
    maxCount: 1000,
    enableAutoScaling: true,
    osType: "Linux",
    osSKU: "Ubuntu",
  },
  { replaceOnChanges: ["vmSize"] }
);

GPU Training and simulators

If you want to enable GPU training and GPU-enhanced simulators, you will also need to add the following pools.

In addition, you will also need to install the nvidia-gpu-operator on the cluster. This can be done according to the instructions on the project website.

Finally, GPU_ENABLED must be set to true on the Composabl controller deployment, if it hasn't been already.

// The learners will run the learning part of the training, accelerated by GPU
const learnersPool = new containerservice.AgentPool("learners",
  {
    resourceGroupName: resourceGroup.name,
    resourceName: k8sCluster.name,
    agentPoolName: "learners",
    vmSize: "Standard_NC4as_T4_v3", // (4vCPU, 28GB RAM, 1GPU (Nvidia Tesla T4), 0.0570$/hour)
    count: 1,
    minCount: 1,
    maxCount: 10,
    enableAutoScaling: true,
    osType: "Linux",
    osSKU: "Ubuntu",
    osDiskSizeGB: 100,
    osDiskType: "Premium_LRS",
  },
  { replaceOnChanges: ["vmSize"] }
);

// Optional - if you also want to run Simulators on machines with GPUs, provision this pool as well:
const simsGpuPool = new containerservice.AgentPool("simsgpu",
  {
    resourceGroupName: resourceGroup.name,
    resourceName: k8sCluster.name,
    agentPoolName: "simsgpu",
    vmSize: "Standard_NC4as_T4_v3", // (4vCPU, 28GB RAM, 1GPU (Nvidia Tesla T4), 0.0570$/hour)
    count: 1,
    minCount: 1,
    maxCount: 10,
    enableAutoScaling: true,
    osType: "Linux",
    osSKU: "Ubuntu",
  },
  { replaceOnChanges: ["vmSize"] }
);

Notes:

Autoscaling:
- This template enables autoscaling to have the cluster automatically scale to the required size and back down afterward to reduce costs.
- You can disable autoscaling by removing the minCount, maxCount and enableAutoScaling properties, but you'll have to set the count value accordingly.
vmSize: The vmSizes used above can be adjusted to instances that adhere more to your needs.

PreviousManual NextAzure

Last updated 9 months ago