This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Set up Launch

1: Configure launch queue
2: Set up launch agent
3: Tutorial: Set up W&B Launch on Kubernetes
4: Tutorial: Set up W&B Launch on SageMaker
5: Tutorial: Set up W&B Launch on Vertex AI
6: Tutorial: Set up W&B Launch with Docker

This page describes the high-level steps required to set up W&B Launch:

Set up a queue: Queues are FIFO and possess a queue configuration. A queue’s configuration controls where and how jobs are executed on a target resource.
Set up an agent: Agents run on your machine/infrastructure and poll one or more queues for launch jobs. When a job is pulled, the agent ensures that the image is built and available. The agent then submits the job to the target resource.

Set up a queue

Launch queues must be configured to point to a specific target resource along with any additional configuration specific to that resource. For example, a launch queue that points to a Kubernetes cluster might include environment variables or set a custom namespace its launch queue configuration. When you create a queue, you will specify both the target resource you want to use and the configuration for that resource to use.

When an agent receives a job from a queue, it also receives the queue configuration. When the agent submits the job to the target resource, it includes the queue configuration along with any overrides from the job itself. For example, you can use a job configuration to specify the Amazon SageMaker instance type for that job instance only. In this case, it is common to use queue config templates as the end user interface.

Create a queue

Navigate to Launch App at wandb.ai/launch.
Click the create queue button on the top right of the screen.

From the Entity dropdown menu, select the entity the queue will belong to.
Provide a name for your queue in the Queue field.
From the Resource dropdown, select the compute resource you want jobs added to this queue to use.
Choose whether to allow Prioritization for this queue. If prioritization is enabled, a user on your team can define a priority for their launch job when they enqueue them. Higher priority jobs are executed before lower priority jobs.
Provide a resource configuration in either JSON or YAML format in the Configuration field. The structure and semantics of your configuration document will depend on the resource type that the queue is pointing to. For more details, see the dedicated set up page for your target resource.

Set up a launch agent

Launch agents are long running processes that poll one or more launch queues for jobs. Launch agents dequeue jobs in first in, first out (FIFO) order or in priority order depending on the queues they pull from. When an agent dequeues a job from a queue, it optionally builds an image for that job. The agent then submits the job to the target resource along with configuration options specified in the queue configuration.

Agents are highly flexible and can be configured to support a wide variety of use cases. The required configuration for your agent will depend on your specific use case. See the dedicated page for Docker, Amazon SageMaker, Kubernetes, or Vertex AI.

W&B recommends you start agents with a service account’s API key, rather than a specific user’s API key. There are two benefits to using a service account’s API key:

The agent isn’t dependent on an individual user.
The author associated with a run created through Launch is viewed by Launch as the user who submitted the launch job, rather than the user associated with the agent.

Agent configuration

Configure the launch agent with a YAML file named launch-config.yaml. By default, W&B checks for the config file in ~/.config/wandb/launch-config.yaml. You can optionally specify a different directory when you activate the launch agent.

The contents of your launch agent’s configuration file will depend on your launch agent’s environment, the launch queue’s target resource, Docker builder requirements, cloud registry requirements, and so forth.

Independent of your use case, there are core configurable options for the launch agent:

max_jobs: maximum number of jobs the agent can execute in parallel
entity: the entity that the queue belongs to
queues: the name of one or more queues for the agent to watch

You can use the W&B CLI to specify universal configurable options for the launch agent (instead of the config YAML file): maximum number of jobs, W&B entity, and launch queues. See the wandb launch-agent command for more information.

The following YAML snippet shows how to specify core launch agent config keys:

# Max number of concurrent runs to perform. -1 = no limit
max_jobs: -1

entity: <entity-name>

# List of queues to poll.
queues:
  - <queue-name>

Configure a container builder

The launch agent can be configured to build images. You must configure the agent to use a container builder if you intend to use launch jobs created from git repositories or code artifacts. See the Create a launch job for more information on how to create a launch job.

W&B Launch supports three builder options:

Docker: The Docker builder uses a local Docker daemon to build images.
Kaniko: Kaniko is a Google project that enables image building in environments where a Docker daemon is unavailable.
Noop: The agent will not try to build jobs, and instead only pull pre-built images.

Use the Kaniko builder if your agent is polling in an environment where a Docker daemon is unavailable (for example, a Kubernetes cluster).

See the Set up Kubernetes for details about the Kaniko builder.

To specify an image builder, include the builder key in your agent configuration. For example, the following code snippet shows a portion of the launch config (launch-config.yaml) that specifies to use Docker or Kaniko:

builder:
  type: docker | kaniko | noop

Configure a container registry

In some cases, you might want to connect a launch agent to a cloud registry. Common scenarios where you might want to connect a launch agent to a cloud registry include:

You want to run a job in an envirnoment other than where you built it, such as a powerful workstation or cluster.
You want to use the agent to build images and run these images on Amazon SageMaker or VertexAI.
You want the launch agent to provide credentials to pull from an image repository.

To learn more about how to configure the agent to interact with a container registry, see the Advanced agent set up page.

Activate the launch agent

Activate the launch agent with the launch-agent W&B CLI command:

wandb launch-agent -q <queue-1> -q <queue-2> --max-jobs 5

In some use cases, you might want to have a launch agent polling queues from within a Kubernetes cluster. See the Advanced queue set up page for more information.

1 - Configure launch queue

The following page describes how to configure launch queue options.

Set up queue config templates

Administer and manage guardrails on compute consumption with Queue Config Templates. Set defaults, minimums, and maximum values for fields such as memory consumption, GPU, and runtime duration.

After you configure a queue with config templates, members of your team can alter fields you defined only within the specified range you defined.

Configure queue template

You can configure a queue template on an existing queue or create a new queue.

Navigate to the W&B Launch App.
Select View queue next to the name of the queue you want to add a template to.
Select the Config tab. This will show information about your queue such as when the queue was created, the queue config, and existing launch-time overrides.
Navigate to the Queue config section.
Identify the config key-values you want to create a template for.
Replace the value in the config with a template field. Template fields take the form of {{variable-name}}.
Click on the Parse configuration button. When you parse your configuration, W&B will automatically create tiles below the queue config for each template you created.
For each tile generated, you must first specify the data type (string, integer, or float) the queue config can allow. To do this, select the data type from the Type dropdown menu.
Based on your data type, complete the fields that appear within each tile.
Click on Save config.

For example, suppose you want to create a template that limits which AWS instances your team can use. Before you add a template field, your queue config might look something similar to:

RoleArn: arn:aws:iam:region:account-id:resource-type/resource-id
ResourceConfig:
  InstanceType: ml.m4.xlarge
  InstanceCount: 1
  VolumeSizeInGB: 2
OutputDataConfig:
  S3OutputPath: s3://bucketname
StoppingCondition:
  MaxRuntimeInSeconds: 3600

When you add a template field for the InstanceType, your config will look like:

RoleArn: arn:aws:iam:region:account-id:resource-type/resource-id
ResourceConfig:
  InstanceType: "{{aws_instance}}"
  InstanceCount: 1
  VolumeSizeInGB: 2
OutputDataConfig:
  S3OutputPath: s3://bucketname
StoppingCondition:
  MaxRuntimeInSeconds: 3600

Next, you click on Parse configuration. A new tile labeled aws-instance will appear underneath the Queue config.

From there, you select String as the datatype from the Type dropdown. This will populate fields where you can specify values a user can choose from. For example, in the following image the admin of the team configured two different AWS instance types that users can choose from (ml.m4.xlarge and ml.p3.xlarge):

Dynamically configure launch jobs

Queue configs can be dynamically configured using macros that are evaluated when the agent dequeues a job from the queue. You can set the following macros:

Macro	Description
`${project_name}`	The name of the project the run is being launched to.
`${entity_name}`	The owner of the project the run being launched to.
`${run_id}`	The id of the run being launched.
`${run_name}`	The name of the run that is launching.
`${image_uri}`	The URI of the container image for this run.

Any custom macro not listed in the preceding table (for example ${MY_ENV_VAR}), is substituted with an environment variable from the agent’s environment.

Use the launch agent to build images that execute on accelerators (GPUs)

You might need to specify an accelerator base image if you use launch to build images that are executed in an accelerator environment.

This accelerator base image must satisfy the following requirements:

Debian compatibility (the Launch Dockerfile uses apt-get to fetch python)
Compatibility CPU & GPU hardware instruction set (Make sure your CUDA version is supported by the GPU you intend on using)
Compatibility between the accelerator version you provide and the packages installed in your ML algorithm
Packages installed that require extra steps for setting up compatibility with hardware

How to use GPUs with TensorFlow

Ensure TensorFlow properly utilizes your GPU. To accomplish this, specify a Docker image and its image tag for the builder.accelerator.base_image key in the queue resource configuration.

For example, the tensorflow/tensorflow:latest-gpu base image ensures TensorFlow properly uses your GPU. This can be configured using the resource configuration in the queue.

The following JSON snippet demonstrates how to specify the TensorFlow base image in your queue config:

{
    "builder": {
        "accelerator": {
            "base_image": "tensorflow/tensorflow:latest-gpu"
        }
    }
}

2 - Set up launch agent

Advanced agent setup

This guide provides information on how to set up the W&B Launch agent to build container images in different environments.

Build is only required for git and code artifact jobs. Image jobs do not require build.

See Create a launch job for more information on job types.

Builders

The Launch agent can build images using Docker or Kaniko.

Kaniko: builds a container image in Kubernetes without running the build as a privileged container.
Docker: builds a container image by executing a docker build command locally.

The builder type can be controlled by the builder.type key in the launch agent config to either docker, kaniko, or noop to turn off build. By default, the agent helm chart sets the builder.type to noop. Additional keys in the builder section will be used to configure the build process.

If no builder is specified in the agent config and a working docker CLI is found, the agent will default to using Docker. If Docker is not available the agent will default to noop.

Use Kaniko for building images in a Kubernetes cluster. Use Docker for all other cases.

Pushing to a container registry

The launch agent tags all images it builds with a unique source hash. The agent pushes the image to the registry specified in the builder.destination key.

For example, if the builder.destination key is set to my-registry.example.com/my-repository, the agent will tag and push the image to my-registry.example.com/my-repository:<source-hash>. If the image exists in the registry, the build is skipped.

Agent configuration

If you are deploying the agent via our Helm chart, the agent config should be provided in the agentConfig key in the values.yaml file.

If you are invoking the agent yourself with wandb launch-agent, you can provide the agent config as a path to a YAML file with the --config flag. By default, the config will be loaded from ~/.config/wandb/launch-config.yaml.

Within your launch agent config (launch-config.yaml), provide the name of the target resource environment and the container registry for the environment and registry keys, respectively.

The following tabs demonstrates how to configure the launch agent based on your environment and registry.

The AWS environment configuration requires the region key. The region should be the AWS region that the agent runs in.

environment:
  type: aws
  region: <aws-region>
builder:
  type: <kaniko|docker>
  # URI of the ECR repository where the agent will store images.
  # Make sure the region matches what you have configured in your
  # environment.
  destination: <account-id>.ecr.<aws-region>.amazonaws.com/<repository-name>
  # If using Kaniko, specify the S3 bucket where the agent will store the
  # build context.
  build-context-store: s3://<bucket-name>/<path>

The agent uses boto3 to load the default AWS credentials. See the boto3 documentation for more information on how to configure default AWS credentials.

The Google Cloud environment requires region and project keys. Set region to the region that the agent runs in. Set project to the Google Cloud project that the agent runs in. The agent uses google.auth.default() in Python to load the default credentials.

environment:
  type: gcp
  region: <gcp-region>
  project: <gcp-project-id>
builder:
  type: <kaniko|docker>
  # URI of the Artifact Registry repository and image name where the agent
  # will store images. Make sure the region and project match what you have
  # configured in your environment.
  uri: <region>-docker.pkg.dev/<project-id>/<repository-name>/<image-name>
  # If using Kaniko, specify the GCS bucket where the agent will store the
  # build context.
  build-context-store: gs://<bucket-name>/<path>

See the google-auth documentation for more information on how to configure default GCP credentials so they are available to the agent.

The Azure environment does not require any additional keys. When the agent starts, it use azure.identity.DefaultAzureCredential() to load the default Azure credentials.

environment:
  type: azure
builder:
  type: <kaniko|docker>
  # URI of the Azure Container Registry repository where the agent will store images.
  destination: https://<registry-name>.azurecr.io/<repository-name>
  # If using Kaniko, specify the Azure Blob Storage container where the agent
  # will store the build context.
  build-context-store: https://<storage-account-name>.blob.core.windows.net/<container-name>

See the azure-identity documentation for more information on how to configure default Azure credentials.

Agent permissions

The agent permissions required vary by use case.

Cloud registry permissions

Below are the permissions that are generally required by launch agents to interact with cloud registries.

{
  'Version': '2012-10-17',
  'Statement':
    [
      {
        'Effect': 'Allow',
        'Action':
          [
            'ecr:CreateRepository',
            'ecr:UploadLayerPart',
            'ecr:PutImage',
            'ecr:CompleteLayerUpload',
            'ecr:InitiateLayerUpload',
            'ecr:DescribeRepositories',
            'ecr:DescribeImages',
            'ecr:BatchCheckLayerAvailability',
            'ecr:BatchDeleteImage',
          ],
        'Resource': 'arn:aws:ecr:<region>:<account-id>:repository/<repository>',
      },
      {
        'Effect': 'Allow',
        'Action': 'ecr:GetAuthorizationToken',
        'Resource': '*',
      },
    ],
}

artifactregistry.dockerimages.list;
artifactregistry.repositories.downloadArtifacts;
artifactregistry.repositories.list;
artifactregistry.repositories.uploadArtifacts;

Add the AcrPush role if you use the Kaniko builder.

Storage permissions for Kaniko

The launch agent requires permission to push to cloud storage if the agent uses the Kaniko builder. Kaniko uses a context store outside of the pod running the build job.

The recommended context store for the Kaniko builder on AWS is Amazon S3. The following policy can be used to give the agent access to an S3 bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ListObjectsInBucket",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::<BUCKET-NAME>"]
    },
    {
      "Sid": "AllObjectActions",
      "Effect": "Allow",
      "Action": "s3:*Object",
      "Resource": ["arn:aws:s3:::<BUCKET-NAME>/*"]
    }
  ]
}

On GCP, the following IAM permissions are required for the agent to upload build contexts to GCS:

storage.buckets.get;
storage.objects.create;
storage.objects.delete;
storage.objects.get;

The Storage Blob Data Contributor role is required in order for the agent to upload build contexts to Azure Blob Storage.

Customizing the Kaniko build

Specify the Kubernetes Job spec that the Kaniko job uses in the builder.kaniko-config key of the agent configuration. For example:

builder:
  type: kaniko
  build-context-store: <my-build-context-store>
  destination: <my-image-destination>
  build-job-name: wandb-image-build
  kaniko-config:
    spec:
      template:
        spec:
          containers:
          - args:
            - "--cache=false" # Args must be in the format "key=value"
            env:
            - name: "MY_ENV_VAR"
              value: "my-env-var-value"

Deploy Launch agent into CoreWeave

Optionally deploy the W&B Launch agent to CoreWeave Cloud infrastructure. CoreWeave is a cloud infrastructure that is purpose built for GPU-accelerated workloads.

For information on how to deploy the Launch agent to CoreWeave, see the CoreWeave documentation.

You will need to create a CoreWeave account in order to deploy the Launch agent into a CoreWeave infrastructure.

3 - Tutorial: Set up W&B Launch on Kubernetes

You can use W&B Launch to push ML workloads to a Kubernetes cluster, giving ML engineers a simple interface right in W&B to use the resources you already manage with Kubernetes.

W&B maintains an official Launch agent image that can be deployed to your cluster with a Helm chart that W&B maintains.

W&B uses the Kaniko builder to enable the Launch agent to build Docker images in a Kubernetes cluster. To learn more on how to set up Kaniko for the Launch agent, or how to turn off job building and only use prebuilt Docker images, see Advanced agent set up.

To install Helm and apply or upgrade W&B’s Launch agent Helm chart, you need kubectl access to the cluster with sufficient permissions to create, update, and delete Kubernetes resources. Typically, a user with cluster-admin or a custom role with equivalent permissions is required.

Configure a queue for Kubernetes

The Launch queue configuration for a Kubernetes target resource will resemble either a Kubernetes Job spec or a Kubernetes Custom Resource spec.

You can control any aspect of the Kubernetes workload resource spec when you create a Launch queue.

spec:
  template:
    spec:
      containers:
        - env:
            - name: MY_ENV_VAR
              value: some-value
          resources:
            requests:
              cpu: 1000m
              memory: 1Gi
metadata:
  labels:
    queue: k8s-test
namespace: wandb

In some use cases, you might want to use CustomResource definitions. CustomResource definitions are useful if, for example, you want to perform multi-node distributed training. See the tutorial for using Launch with multi-node jobs using Volcano for an example application. Another use case might be that you want to use W&B Launch with Kubeflow.

The following YAML snippet shows a sample Launch queue config that uses Kubeflow:

kubernetes:
  kind: PyTorchJob
  spec:
    pytorchReplicaSpecs:
      Master:
        replicas: 1
        template:
          spec:
            containers:
              - name: pytorch
                image: '${image_uri}'
                imagePullPolicy: Always
        restartPolicy: Never
      Worker:
        replicas: 2
        template:
          spec:
            containers:
              - name: pytorch
                image: '${image_uri}'
                imagePullPolicy: Always
        restartPolicy: Never
    ttlSecondsAfterFinished: 600
  metadata:
    name: '${run_id}-pytorch-job'
  apiVersion: kubeflow.org/v1

For security reasons, W&B will inject the following resources into your Launch queue if they are not specified:

securityContext
backOffLimit
ttlSecondsAfterFinished

The following YAML snippet demonstrates how these values will appear in your launch queue:

spec:
  template:
    `backOffLimit`: 0
    ttlSecondsAfterFinished: 60
    securityContext:
      allowPrivilegeEscalation: False,
      capabilities:
        drop:
          - ALL,
      seccompProfile:
        type: "RuntimeDefault"

Create a queue

Create a queue in the W&B App that uses Kubernetes as its compute resource:

Navigate to the Launch page.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Provide a name for your queue in the Name field.
Select Kubernetes as the Resource.
Within the Configuration field, provide the Kubernetes Job workflow spec or Custom Resource spec you configured in the previous section.

Configure a Launch agent with Helm

Use the Helm chart provided by W&B to deploy the Launch agent into your Kubernetes cluster. Control the behavior of the launch agent with the values.yaml file.

Specify the contents that would normally by defined in your launch agent config file (~/.config/wandb/launch-config.yaml) within the launchConfig key in thevalues.yaml file.

For example, suppose you have Launch agent config that enables you to run a Launch agent in EKS that uses the Kaniko Docker image builder:

queues:
	- <queue name>
max_jobs: <n concurrent jobs>
environment:
	type: aws
	region: us-east-1
registry:
	type: ecr
	uri: <my-registry-uri>
builder:
	type: kaniko
	build-context-store: <s3-bucket-uri>

Within your values.yaml file, this might look like:

agent:
  labels: {}
  # W&B API key.
  apiKey: ''
  # Container image to use for the agent.
  image: wandb/launch-agent:latest
  # Image pull policy for agent image.
  imagePullPolicy: Always
  # Resources block for the agent spec.
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi

# Namespace to deploy launch agent into
namespace: wandb

# W&B api url (Set yours here)
baseUrl: https://api.wandb.ai

# Additional target namespaces that the launch agent can deploy into
additionalTargetNamespaces:
  - default
  - wandb

# This should be set to the literal contents of your launch agent config.
launchConfig: |
  queues:
    - <queue name>
  max_jobs: <n concurrent jobs>
  environment:
    type: aws
    region: <aws-region>
  registry:
    type: ecr
    uri: <my-registry-uri>
  builder:
    type: kaniko
    build-context-store: <s3-bucket-uri>

# The contents of a git credentials file. This will be stored in a k8s secret
# and mounted into the agent container. Set this if you want to clone private
# repos.
gitCreds: |

# Annotations for the wandb service account. Useful when setting up workload identity on gcp.
serviceAccount:
  annotations:
    iam.gke.io/gcp-service-account:
    azure.workload.identity/client-id:

# Set to access key for azure storage if using kaniko with azure.
azureStorageAccessKey: ''

For more information on registries, environments, and required agent permissions see Advanced agent set up.

4 - Tutorial: Set up W&B Launch on SageMaker

You can use W&B Launch to submit launch jobs to Amazon SageMaker to train machine learning models using provided or custom algorithms on the SageMaker platform. SageMaker takes care of spinning up and releasing compute resources, so it can be a good choice for teams without an EKS cluster.

Launch jobs sent to a W&B Launch queue connected to Amazon SageMaker are executed as SageMaker Training Jobs with the CreateTrainingJob API. Use the launch queue configuration to control arguments sent to the CreateTrainingJob API.

Amazon SageMaker uses Docker images to execute training jobs. Images pulled by SageMaker must be stored in the Amazon Elastic Container Registry (ECR). This means that the image you use for training must be stored on ECR.

This guide shows how to execute SageMaker Training Jobs. For information on how to deploy to models for inference on Amazon SageMaker, see this example Launch job.

Prerequisites

Before you get started, ensure you satisfy the following prerequisites:

Decide if you want the Launch agent to build a Docker image for you.
Set up AWS resources and gather information about S3, ECR, and Sagemaker IAM roles.
Create an IAM role for the Launch agent.

Decide if you want the Launch agent to build a Docker images

Decide if you want the W&B Launch agent to build a Docker image for you. There are two options you can choose from:

Permit the launch agent build a Docker image, push the image to Amazon ECR, and submit SageMaker Training jobs for you. This option can offer some simplicity to ML Engineers rapidly iterating over training code.
The launch agent uses an existing Docker image that contains your training or inference scripts. This option works well with existing CI systems. If you choose this option, you will need to manually upload your Docker image to your container registry on Amazon ECR.

Set up AWS resources

Ensure you have the following AWS resources configured in your preferred AWS region:

An ECR repository to store container images.
One or more S3 buckets to store inputs and outputs for your SageMaker Training jobs.
An IAM role for Amazon SageMaker that permits SageMaker to run training jobs and interact with Amazon ECR and Amazon S3.

Make a note of the ARNs for these resources. You will need the ARNs when you define the Launch queue configuration.

Create a IAM Policy for Launch agent

From the IAM screen in AWS, create a new policy.
Toggle to the JSON policy editor, then paste the following policy based on your use case. Substitute values enclosed with <> with your own values:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogStreams",
        "SageMaker:AddTags",
        "SageMaker:CreateTrainingJob",
        "SageMaker:DescribeTrainingJob"
      ],
      "Resource": "arn:aws:sagemaker:<region>:<account-id>:*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<account-id>:role/<RoleArn-from-queue-config>"
    },
  {
      "Effect": "Allow",
      "Action": "kms:CreateGrant",
      "Resource": "<ARN-OF-KMS-KEY>",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "SageMaker.<region>.amazonaws.com",
          "kms:GrantIsForAWSResource": "true"
        }
      }
    }
  ]
}

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogStreams",
        "SageMaker:AddTags",
        "SageMaker:CreateTrainingJob",
        "SageMaker:DescribeTrainingJob"
      ],
      "Resource": "arn:aws:sagemaker:<region>:<account-id>:*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<account-id>:role/<RoleArn-from-queue-config>"
    },
     {
    "Effect": "Allow",
    "Action": [
      "ecr:CreateRepository",
      "ecr:UploadLayerPart",
      "ecr:PutImage",
      "ecr:CompleteLayerUpload",
      "ecr:InitiateLayerUpload",
      "ecr:DescribeRepositories",
      "ecr:DescribeImages",
      "ecr:BatchCheckLayerAvailability",
      "ecr:BatchDeleteImage"
    ],
    "Resource": "arn:aws:ecr:<region>:<account-id>:repository/<repository>"
  },
  {
    "Effect": "Allow",
    "Action": "ecr:GetAuthorizationToken",
    "Resource": "*"
  },
  {
      "Effect": "Allow",
      "Action": "kms:CreateGrant",
      "Resource": "<ARN-OF-KMS-KEY>",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "SageMaker.<region>.amazonaws.com",
          "kms:GrantIsForAWSResource": "true"
        }
      }
    }
  ]
}

Click Next.
Give the policy a name and description.
Click Create policy.

Create an IAM role for Launch agent

The Launch agent needs permission to create Amazon SageMaker training jobs. Follow the procedure below to create an IAM role:

From the IAM screen in AWS, create a new role.
For Trusted Entity, select AWS Account (or another option that suits your organization’s policies).
Scroll through the permissions screen and select the policy name you just created above.
Give the role a name and description.
Select Create role.
Note the ARN for the role. You will specify the ARN when you set up the launch agent.

To create IAM roles, see the AWS Identity and Access Management Documentation.

If you want the launch agent to build images, see the Advanced agent set up for additional permissions required.
The kms:CreateGrant permission for SageMaker queues is required only if the associated ResourceConfig has a specified VolumeKmsKeyId and the associated role does not have a policy that permits this action.

Configure launch queue for SageMaker

Next, create a queue in the W&B App that uses SageMaker as its compute resource:

Navigate to the Launch App.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Provide a name for your queue in the Name field.
Select SageMaker as the Resource.
Within the Configuration field, provide information about your SageMaker job. By default, W&B will populate a YAML and JSON CreateTrainingJob request body:

{
  "RoleArn": "<REQUIRED>", 
  "ResourceConfig": {
      "InstanceType": "ml.m4.xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 2
  },
  "OutputDataConfig": {
      "S3OutputPath": "<REQUIRED>"
  },
  "StoppingCondition": {
      "MaxRuntimeInSeconds": 3600
  }
}

You must at minimum specify:

RoleArn : ARN of the SageMaker execution IAM role (see prerequisites). Not to be confused with the launch agent IAM role.
OutputDataConfig.S3OutputPath : An Amazon S3 URI specifying where SageMaker outputs will be stored.
ResourceConfig: Required specification of a resource config. Options for resource config are outlined here.
StoppingCondition: Required specification of the stopping conditions for the training job. Options outlined here.

Click on the Create Queue button.

Set up the launch agent

The following section describes where you can deploy your agent and how to configure your agent based on where it is deployed.

There are several options for how the Launch agent is deployed for a Amazon SageMaker queue: on a local machine, on an EC2 instance, or in an EKS cluster. Configure your launch agent appropriately based on the where you deploy your agent.

Decide where to run the Launch agent

For production workloads and for customers who already have an EKS cluster, W&B recommends deploying the Launch agent to the EKS cluster using this Helm chart.

For production workloads without an current EKS cluster, an EC2 instance is a good option. Though the launch agent instance will keep running all the time, the agent doesn’t need more than a t2.micro sized EC2 instance which is relatively affordable.

For experimental or solo use cases, running the Launch agent on your local machine can be a fast way to get started.

Based on your use case, follow the instructions provided in the following tabs to properly configure up your launch agent:

W&B strongly encourages that you use the W&B managed helm chart to install the agent in an EKS cluster.

Navigate to the Amazon EC2 Dashboard and complete the following steps:

Click Launch instance.
Provide a name for the Name field. Optionally add a tag.
From the Instance type, select an instance type for your EC2 container. You do not need more than 1vCPU and 1GiB of memory (for example a t2.micro).
Create a key pair for your organization within the Key pair (login) field. You will use this key pair to connect to your EC2 instance with SSH client at a later step.
Within Network settings, select an appropriate security group for your organization.
Expand Advanced details. For IAM instance profile, select the launch agent IAM role you created above.
Review the Summary field. If correct, select Launch instance.

Navigate to Instances within the left panel of the EC2 Dashboard on AWS. Ensure that the EC2 instance you created is running (see the Instance state column). Once you confirm your EC2 instance is running, navigate to your local machine’s terminal and complete the following:

Select Connect.
Select the SSH client tab and following the instructions outlined to connect to your EC2 instance.
Within your EC2 instance, install the following packages:

sudo yum install python311 -y && python3 -m ensurepip --upgrade && pip3 install wandb && pip3 install wandb[launch]

Next, install and start Docker within your EC2 instance:

sudo yum update -y && sudo yum install -y docker python3 && sudo systemctl start docker && sudo systemctl enable docker && sudo usermod -a -G docker ec2-user

newgrp docker

Now you can proceed to setting up the Launch agent config.

Use the AWS config files located at ~/.aws/config and ~/.aws/credentials to associate a role with an agent that is polling on a local machine. Provide the IAM role ARN that you created for the launch agent in the previous step.

[profile SageMaker-agent]
role_arn = arn:aws:iam::<account-id>:role/<agent-role-name>
source_profile = default

[default]
aws_access_key_id=<access-key-id>
aws_secret_access_key=<secret-access-key>
aws_session_token=<session-token>

Note that session tokens have a max length of 1 hour or 3 days depending on the principal they are associated with.

Configure a launch agent

Configure the launch agent with a YAML config file named launch-config.yaml.

By default, W&B will check for the config file in ~/.config/wandb/launch-config.yaml. You can optionally specify a different directory when you activate the launch agent with the -c flag.

The following YAML snippet demonstrates how to specify the core config agent options:

max_jobs: -1
queues:
  - <queue-name>
environment:
  type: aws
  region: <your-region>
registry:
  type: ecr
  uri: <ecr-repo-arn>
builder: 
  type: docker

Now start the agent with wandb launch-agent

(Optional) Push your launch job Docker image to Amazon ECR

This section applies only if your launch agent uses existing Docker images that contain your training or inference logic. There are two options on how your launch agent behaves.

Upload your Docker image that contains your launch job to your Amazon ECR repo. Your Docker image needs to be in your ECR registry before you submit new launch jobs if you are using image-based jobs.

5 - Tutorial: Set up W&B Launch on Vertex AI

You can use W&B Launch to submit jobs for execution as Vertex AI training jobs. With Vertex AI training jobs, you can train machine learning models using either provided, or custom algorithms on the Vertex AI platform. Once a launch job is initiated, Vertex AI manages the underlying infrastructure, scaling, and orchestration.

W&B Launch works with Vertex AI through the CustomJob class in the google-cloud-aiplatform SDK. The parameters of a CustomJob can be controlled with the launch queue configuration. Vertex AI cannot be configured to pull images from a private registry outside of GCP. This means that you must store container images in GCP or in a public registry if you want to use Vertex AI with W&B Launch. See the Vertex AI documentation for more information on making container images accessible to Vertex jobs.

Prerequisites

Create or access a GCP project with the Vertex AI API enabled. See the GCP API Console docs for more information on enabling an API.
Create a GCP Artifact Registry repository to store images you want to execute on Vertex. See the GCP Artifact Registry documentation for more information.
Create a staging GCS bucket for Vertex AI to store its metadata. Note that this bucket must be in the same region as your Vertex AI workloads in order to be used as a staging bucket. The same bucket can be used for staging and build contexts.
Create a service account with the necessary permissions to spin up Vertex AI jobs. See the GCP IAM documentation for more information on assigning permissions to service accounts.
Grant your service account permission to manage Vertex jobs

Permission	Resource Scope	Description
`aiplatform.customJobs.create`	Specified GCP Project	Allows creation of new machine learning jobs within the project.
`aiplatform.customJobs.list`	Specified GCP Project	Allows listing of machine learning jobs within the project.
`aiplatform.customJobs.get`	Specified GCP Project	Allows retrieval of information about specific machine learning jobs within the project.

If you want your Vertex AI workloads to assume the identity of a non-standard service account, refer to the Vertex AI documentation for instructions on service account creation and necessary permissions. The spec.service_account field of the launch queue configuration can be used to select a custom service account for your W&B runs.

Configure a queue for Vertex AI

The queue configuration for Vertex AI resources specify inputs to the CustomJob constructor in the Vertex AI Python SDK, and the run method of the CustomJob. Resource configurations are stored under the spec and run keys:

The spec key contains values for the named arguments of the CustomJob constructor in the Vertex AI Python SDK.
The run key contains values for the named arguments of the run method of the CustomJob class in the Vertex AI Python SDK.

Customizations of the execution environment happens primarily in the spec.worker_pool_specs list. A worker pool spec defines a group of workers that will run your job. The worker spec in the default config asks for a single n1-standard-4 machine with no accelerators. You can change the machine type, accelerator type and count to suit your needs.

For more information on available machine types and accelerator types, see the Vertex AI documentation.

Create a queue

Create a queue in the W&B App that uses Vertex AI as its compute resource:

Navigate to the Launch page.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Provide a name for your queue in the Name field.
Select GCP Vertex as the Resource.
Within the Configuration field, provide information about your Vertex AI CustomJob you defined in the previous section. By default, W&B will populate a YAML and JSON request body similar to the following:

spec:
  worker_pool_specs:
    - machine_spec:
        machine_type: n1-standard-4
        accelerator_type: ACCELERATOR_TYPE_UNSPECIFIED
        accelerator_count: 0
      replica_count: 1
      container_spec:
        image_uri: ${image_uri}
  staging_bucket: <REQUIRED>
run:
  restart_job_on_worker_restart: false

After you configure your queue, click on the Create Queue button.

You must at minimum specify:

spec.worker_pool_specs : non-empty list of worker pool specifications.
spec.staging_bucket : GCS bucket to be used for staging Vertex AI assets and metadata.

Some of the Vertex AI docs show worker pool specifications with all keys in camel case,for example, workerPoolSpecs. The Vertex AI Python SDK uses snake case for these keys, for example worker_pool_specs.

Every key in the launch queue configuration should use snake case.

Configure a launch agent

The launch agent is configurable through a config file that is, by default, located at ~/.config/wandb/launch-config.yaml.

max_jobs: <n-concurrent-jobs>
queues:
  - <queue-name>

If you want the launch agent to build images for you that are executed in Vertex AI, see Advanced agent set up.

Set up agent permissions

There are multiple methods to authenticate as this service account. This can be achieved through Workload Identity, a downloaded service account JSON, environment variables, the Google Cloud Platform command-line tool, or a combination of these methods.

6 - Tutorial: Set up W&B Launch with Docker

The following guide describes how to configure W&B Launch to use Docker on a local machine for both the launch agent environment and for the queue’s target resource.

Using Docker to execute jobs and as the launch agent’s environment on the same local machine is particularly useful if your compute is installed on a machine that does not have a cluster management system (such as Kubernetes).

You can also use Docker queues to run workloads on powerful workstations.

This set up is common for users who perform experiments on their local machine, or that have a remote machine that they SSH in to, to submit launch jobs.

When you use Docker with W&B Launch, W&B will first build an image, and then build and run a container from that image. The image is built with the Docker docker run <image-uri> command. The queue configuration is interpreted as additional arguments that are passed to the docker run command.

Configure a Docker queue

The launch queue configuration (for a Docker target resource) accepts the same options defined in the docker run CLI command.

The agent receives options defined in the queue configuration. The agent then merges the received options with any overrides from the launch job’s configuration to produce a final docker run command that is executed on the target resource (in this case, a local machine).

There are two syntax transformations that take place:

Repeated options are defined in the queue configuration as a list.
Flag options are defined in the queue configuration as a Boolean with the value true.

For example, the following queue configuration:

{
  "env": ["MY_ENV_VAR=value", "MY_EXISTING_ENV_VAR"],
  "volume": "/mnt/datasets:/mnt/datasets",
  "rm": true,
  "gpus": "all"
}

Results in the following docker run command:

docker run \
  --env MY_ENV_VAR=value \
  --env MY_EXISTING_ENV_VAR \
  --volume "/mnt/datasets:/mnt/datasets" \
  --rm <image-uri> \
  --gpus all

Volumes can be specified either as a list of strings, or a single string. Use a list if you specify multiple volumes.

Docker automatically passes environment variables, that are not assigned a value, from the launch agent environment. This means that, if the launch agent has an environment variable MY_EXISTING_ENV_VAR, that environment variable is available in the container. This is useful if you want to use other config keys without publishing them in the queue configuration.

The --gpus flag of the docker run command allows you to specify GPUs that are available to a Docker container. For more information on how to use the gpus flag, see the Docker documentation.

Install the NVIDIA Container Toolkit to use GPUs within a Docker container.
If you build images from a code or artifact-sourced job, you can override the base image used by the agent to include the NVIDIA Container Toolkit. For example, within your launch queue, you can override the base image to tensorflow/tensorflow:latest-gpu:
```
{
  "builder": {
    "accelerator": {
      "base_image": "tensorflow/tensorflow:latest-gpu"
    }
  }
}
```

Create a queue

Create a queue that uses Docker as compute resource with the W&B CLI:

Navigate to the Launch page.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Enter a name for your queue in the Name field.
Select Docker as the Resource.
Define your Docker queue configuration in the Configuration field.
Click on the Create Queue button to create the queue.

Configure a launch agent on a local machine

Configure the launch agent with a YAML config file named launch-config.yaml. By default, W&B will check for the config file in ~/.config/wandb/launch-config.yaml. You can optionally specify a different directory when you activate the launch agent.

You can use the W&B CLI to specify core configurable options for the launch agent (instead of the config YAML file): maximum number of jobs, W&B entity, and launch queues. See the wandb launch-agent command for more information.

Core agent config options

The following tabs demonstrate how to specify the core config agent options with the W&B CLI and with a YAML config file:

wandb launch-agent -q <queue-name> --max-jobs <n>

max_jobs: <n concurrent jobs>
queues:
	- <queue-name>

Docker image builders

The launch agent on your machine can be configured to build Docker images. By default, these images are stored on your machine’s local image repository. To enable your launch agent to build Docker images, set the builder key in the launch agent config to docker:

builder:
	type: docker

If you don’t want the agent to build Docker images, and instead use prebuilt images from a registry, set the builder key in the launch agent config to noop

builder:
  type: noop

Container registries

Launch uses external container registeries such as Dockerhub, Google Container Registry, Azure Container Registry, and Amazon ECR.
If you want to run a job on a different environment from where you built it, configure your agent to be able to pull from a container registry.

To learn more about how connect the launch agent with a cloud registry, see the Advanced agent setup page.