Set up Kubernetes
You can use W&B Launch to execute W&B Launch Jobs as a Kubernetes Job or Custom workload resource in a Kubernetes cluster. This is particularly useful if you want to use Kubernetes to manage your compute cluster and want a simple interface to execute training, transformation, or ML workloads in your cluster.
W&B maintains an official launch agent image that can be deployed to your cluster with a helm chart that is managed by W&B.
It is possible to start a launch agent outside of a Kubernetes cluster. However, we recommend that you deploy the launch agent directly into the Kubernetes cluster that is responsible for running the Kubernetes Job or Custom workload.
The launch agent submits workloads to the cluster specified by the current Kubernetes cluster context.
W&B uses the Kaniko builder to enable the launch agent to build Docker images in a Kubernetes cluster. To learn more on how to set up Kaniko for the launch agent, see Advanced agent set up.
Configure a queue for Kubernetes
The launch queue configuration for a Kubernetes target resource will resemble either a Kubernetes Job spec or a Kubernetes Custom Resource spec. You can control any aspect of the Kubernetes workload resource spec when you create a launch queue.
- Kubernetes Job Spec
- Custom Resource Spec
spec:
template:
spec:
containers:
- env:
- name: MY_ENV_VAR
value: some-value
resources:
requests:
cpu: 1000m
memory: 1Gi
metadata:
labels:
queue: k8s-test
namespace: wandb
In some use cases, you might want to use CustomResource
definitions. CustomResource
definitions are useful if, for example, you want to perform multi-node distributed training. See the tutorial for using Launch with multinode jobs using Volcano for an example application. Another use case might be that you want to use W&B Launch with Kubeflow.
The following YAML snippet shows a sample launch queue config that uses Kubeflow:
kubernetes:
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: '${image_uri}'
imagePullPolicy: Always
restartPolicy: Never
Worker:
replicas: 2
template:
spec:
containers:
- name: pytorch
image: '${image_uri}'
imagePullPolicy: Always
restartPolicy: Never
ttlSecondsAfterFinished: 600
metadata:
name: '${run_id}-pytorch-job'
apiVersion: kubeflow.org/v1
For security reasons, W&B will inject the following resources into your launch queue if they are not specified:
securityContext
backOffLimit
ttlSecondsAfterFinished
The following YAML snippet demonstrates how these values will appear in your launch queue:
spec:
template:
`backOffLimit`: 0
ttlSecondsAfterFinished: 60
securityContext:
allowPrivilegeEscalation: False,
capabilities:
drop:
- ALL,
seccompProfile:
type: "RuntimeDefault"
Create a queue
Create a queue in the W&B App that uses Kubernetes as its compute resource:
- Navigate to the Launch page.
- Click on the Create Queue button.
- Select the Entity you would like to create the queue in.
- Provide a name for your queue in the Name field.
- Select Kubernetes as the Resource.
- Within the Configuration field, provide the Kubernetes Job workflow spec or Custom Resource spec you configured in the previous section.
Configure a launch agent with helm
Use the helm chart provided by W&B to deploy the launch agent into your Kubernetes cluster. Control the behavior of the launch agent with the values.yaml
file.
Specify the contents that would normally by defined in your launch agent config file (~/.config/wandb/launch-config.yaml
) within the launchConfig
key in thevalues.yaml
file.
For example, suppose you have launch agent config that enables you to run a launch agent in EKS that uses the Kaniko Docker image builder:
queues:
- <queue name>
max_jobs: <n concurrent jobs>
environment:
type: aws
region: us-east-1
registry:
type: ecr
uri: <my-registry-uri>
builder:
type: kaniko
build-context-store: <s3-bucket-uri>
Within your values.yaml
file, this might look like:
agent:
labels: {}
# W&B API key.
apiKey: ""
# Container image to use for the agent.
image: wandb/launch-agent-dev:latest
# Image pull policy for agent image.
imagePullPolicy: Always
# Resources block for the agent spec.
resources:
limits:
cpu: 1000m
memory: 1Gi
# Namespace to deploy launch agent into
namespace: wandb
# W&B api url (Set yours here)
baseUrl: https://api.wandb.ai
# Additional target namespaces that the launch agent can deploy into
additionalTargetNamespaces:
- default
- wandb
# This should be set to the literal contents of your launch agent config.
launchConfig: |
queues:
- <queue name>
max_jobs: <n concurrent jobs>
environment:
type: aws
region: <aws-region>
registry:
type: ecr
uri: <my-registry-uri>
builder:
type: kaniko
build-context-store: <s3-bucket-uri>
# Set to false to disable volcano install.
volcano: true
# The contents of a git credentials file. This will be stored in a k8s secret
# and mounted into the agent container. Set this if you want to clone private
# repos.
gitCreds: |
# Annotations for the wandb service account. Useful when setting up workload identity on gcp.
serviceAccount:
annotations:
iam.gke.io/gcp-service-account:
azure.workload.identity/client-id:
# Set to access key for azure storage if using kaniko with azure.
azureStorageAccessKey: ""
You can control whether the Volcano scheduler is installed into your cluster.
For more information on registries, environments and required agent permissions see Advanced agent set up.
Follow the instructions in the helm chart repo to deploy your agent.