The cluster is where both the Composabl components and your training will be running. This configuration is more complex, so additional information will be provided as comments in the typescript definition:
import*as containerservice from"@pulumi/azure-native/containerservice/index.js";constk8sCluster=newcontainerservice.ManagedCluster("aks", { resourceGroupName:resourceGroup.name,// Here, we reference the resourceGroup we created earlier location:resourceGroup.location, dnsPrefix:"composabl-aks", kubernetesVersion: "1.31.1", // you can get supported versions using the Azure CLI: az aks get-versions -l <location> -o table - replace <location> with the location you set in your resourcegroup.
enableRBAC:true,// Assign a managed identity to the cluster identity: { type:"UserAssigned", userAssignedIdentities: [appMiAKS.id], },// Configure 3 pools// 1. Main (the kubernetes control plane nodes)// 2. Train (Composabl system components and training workers)// 3. Sims (Composabl simulators) agentPoolProfiles: [// The Main pool has 3 small nodes to act as a control plane { name:"main", count:3, vmSize:"Standard_B2s",// (2 core, 4GB RAM, 0.041/hour) osType:"Linux", osSKU:"Ubuntu", mode:"System", } ], sku: { name:"Base", tier:"Standard" },// This is an optional part, unless using very large clusters with several 100s of nodes. networkProfile: { networkPlugin:"azure", networkPolicy:"calico", }});// the "Composabl" agent pool will run the composabl system components (Controller, Historian)constcomposablPool=newcontainerservice.AgentPool("composabl", { resourceGroupName:resourceGroup.name, resourceName:k8sCluster.name, agentPoolName:"composabl", count:1, vmSize:"Standard_D4s_v3",// (4 core, 16GB RAM) osType:"Linux", osSKU:"Ubuntu", mode:"System", osDiskSizeGB:100, osDiskType:"Premium_LRS", }, { replaceOnChanges: ["vmSize"] });// the Env Runners will contain the part of the SDK that deals with data gathering from the simulators// If training using GPU is disabled, all training will happen on these nodes as wellconstenvrunnersPool=newcontainerservice.AgentPool("envrunners", { resourceGroupName:resourceGroup.name, resourceName:k8sCluster.name, agentPoolName:"envrunners", vmSize:"Standard_D8d_v4", count:1, minCount:1, maxCount:10, enableAutoScaling:true, osType:"Linux", osSKU:"Ubuntu", }, { replaceOnChanges: ["vmSize"] });// The Sims-CPU pull will run all simulator instancesconstsimsCpuPool=newcontainerservice.AgentPool("simscpu", { resourceGroupName:resourceGroup.name, resourceName:k8sCluster.name, agentPoolName:"simscpu", vmSize:"Standard_D8d_v4", count:2, minCount:2, maxCount:1000, enableAutoScaling:true, osType:"Linux", osSKU:"Ubuntu", }, { replaceOnChanges: ["vmSize"] });
GPU Training and simulators
If you want to enable GPU training and GPU-enhanced simulators, you will also need to add the following pools.
In addition, you will also need to install the nvidia-gpu-operator on the cluster. This can be done according to the instructions on the project website.
Finally, GPU_ENABLED must be set to true on the Composabl controller deployment, if it hasn't been already.
// The learners will run the learning part of the training, accelerated by GPUconstlearnersPool=newcontainerservice.AgentPool("learners", { resourceGroupName:resourceGroup.name, resourceName:k8sCluster.name, agentPoolName:"learners", vmSize:"Standard_NC4as_T4_v3",// (4vCPU, 28GB RAM, 1GPU (Nvidia Tesla T4), 0.0570$/hour) count:1, minCount:1, maxCount:10, enableAutoScaling:true, osType:"Linux", osSKU:"Ubuntu", osDiskSizeGB:100, osDiskType:"Premium_LRS", }, { replaceOnChanges: ["vmSize"] });// Optional - if you also want to run Simulators on machines with GPUs, provision this pool as well:constsimsGpuPool=newcontainerservice.AgentPool("simsgpu", { resourceGroupName:resourceGroup.name, resourceName:k8sCluster.name, agentPoolName:"simsgpu", vmSize:"Standard_NC4as_T4_v3",// (4vCPU, 28GB RAM, 1GPU (Nvidia Tesla T4), 0.0570$/hour) count:1, minCount:1, maxCount:10, enableAutoScaling:true, osType:"Linux", osSKU:"Ubuntu", }, { replaceOnChanges: ["vmSize"] });
Notes:
Autoscaling:
This template enables autoscaling to have the cluster automatically scale to the required size and back down afterward to reduce costs.
You can disable autoscaling by removing the minCount, maxCount and enableAutoScaling properties, but you'll have to set the count value accordingly.
vmSize: The vmSizes used above can be adjusted to instances that adhere more to your needs.