This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Launch

Easily scale and manage ML jobs using W&B Launch.

1: Tutorial: W&B Launch basics
2: Launch terms and concepts
3: Set up Launch

3.1: Configure launch queue
3.2: Set up launch agent
3.3: Tutorial: Set up W&B Launch on Kubernetes
3.4: Tutorial: Set up W&B Launch on SageMaker
3.5: Tutorial: Set up W&B Launch on Vertex AI
3.6: Tutorial: Set up W&B Launch with Docker

4: Create and deploy jobs

4.1: Add job to queue
4.2: Create a launch job
4.3: Manage job inputs
4.4: Monitor launch queue
4.5: View launch jobs

5: Create sweeps with W&B Launch
6: Launch FAQ

6.1: Are there best practices for using Launch effectively?
6.2: Can I specify a Dockerfile and let W&B build a Docker image for me?
6.3: Can Launch automatically provision (and spin down) compute resources for me in the target environment?
6.4: Can you specify secrets for jobs/automations? For instance, an API key which you do not wish to be directly visible to users?
6.5: Does Launch support parallelization? How can I limit the resources consumed by a job?
6.6: How can admins restrict which users have modify access?
6.7: How do I control who can push to a queue?
6.8: How do I fix a "permission denied" error in Launch?
6.9: How do I make W&B Launch work with Tensorflow on GPU?
6.10: How does W&B Launch build images?
6.11: I do not like clicking- can I use Launch without going through the UI?
6.12: I do not want W&B to build a container for me, can I still use Launch?
6.13: Is `wandb launch -d` or `wandb job create image` uploading a whole docker artifact and not pulling from a registry?
6.14: What permissions does the agent require in Kubernetes?
6.15: What requirements does the accelerator base image have?
6.16: When multiple jobs in a Docker queue download the same artifact, is any caching used, or is it re-downloaded every run?

7: Launch integration guides

7.1: Dagster
7.2: Launch multinode jobs with Volcano
7.3: NVIDIA NeMo Inference Microservice Deploy Job
7.4: Spin up a single node GPU cluster with Minikube

1 - Tutorial: W&B Launch basics

Getting started guide for W&B Launch.

What is Launch?

Try in Colab

Easily scale training runs from your desktop to a compute resource like Amazon SageMaker, Kubernetes and more with W&B Launch. Once W&B Launch is configured, you can quickly run training scripts, model evaluation suites, prepare models for production inference, and more with a few clicks and commands.

How it works

Launch is composed of three fundamental components: launch jobs, queues, and agents.

A launch job is a blueprint for configuring and running tasks in your ML workflow. Once you have a launch job, you can add it to a launch queue. A launch queue is a first-in, first-out (FIFO) queue where you can configure and submit your jobs to a particular compute target resource, such as Amazon SageMaker or a Kubernetes cluster.

As jobs are added to the queue, launch agents poll that queue and execute the job on the system targeted by the queue.

Based on your use case, you (or someone on your team) will configure the launch queue according to your chosen compute resource target (for example Amazon SageMaker) and deploy a launch agent on your own infrastructure.

See the Terms and concepts page for more information about Launch.

How to get started

Depending on your use case, explore the following resources to get started with W&B Launch:

If this is your first time using W&B Launch, we recommend you go through the Launch walkthrough guide.
Learn how to set up W&B Launch.
Create a launch job.
Check out the W&B Launch public jobs GitHub repository for templates of common tasks like deploying to Triton, evaluating an LLM, or more.
- View launch jobs created from this repository in this public wandb/jobs project W&B project.

Walkthrough

This page walks through the basics of the W&B Launch workflow.

W&B Launch runs machine learning workloads in containers. Familiarity with containers is not required but may be helpful for this walkthrough. See the Docker documentation for a primer on containers.

Prerequisites

Before you get started, ensure you have satisfied the following prerequisites:

Sign up for an account at https://wandb.ai/site and then log in to your W&B account.
This walkthrough requires terminal access to a machine with a working Docker CLI and engine. See the Docker installation guide for more information.
Install W&B Python SDK version 0.17.1 or higher:

pip install wandb>=0.17.1

Within your terminal, execute wandb login or set the WANDB_API_KEY environment variable to authenticate with W&B.

Within your terminal execute:

```bash
wandb login
```

```bash
WANDB_API_KEY=<your-api-key>
```

Replace `<your-api-key>` with your W&B API key.

Create a launch job

Create a launch job in one of three ways: with a Docker image, from a git repository or from local source code:

To run a pre-made container that logs a message to W&B, open a terminal and run the following command:

wandb launch --docker-image wandb/job_hello_world:main --project launch-quickstart

The preceding command downloads and runs the container image wandb/job_hello_world:main.

Launch configures the container to report everything logged with wandb to the launch-quickstart project. The container logs a message to W&B and displays a link to the newly created run in W&B. Click the link to view the run in the W&B UI.

To launch the same hello-world job from its source code in the W&B Launch jobs repository, run the following command:

wandb launch --uri https://github.com/wandb/launch-jobs.git \\
--job-name hello-world-git --project launch-quickstart \\ 
--build-context jobs/hello_world --dockerfile Dockerfile.wandb \\ 
--entry-point "python job.py"

The command does the following:

Clone the W&B Launch jobs repository to a temporary directory.
Create a job named hello-world-git in the hello project. This job tracks the exact source code and configuration used to run execute the code.
Build a container image from the jobs/hello_world directory and the Dockerfile.wandb.
Start the container and run the job.py python script.

The console output shows the image build and execution. The output of the container should be nearly identical to the previous example.

Code not versioned in a git repository can be launched by specifying a local directory path to the --uri argument.

Create an empty directory and add a Python script named train.py with the following content:

import wandb

with wandb.init() as run:
    run.log({"hello": "world"})

Add a file requirements.txt with the following content:

wandb>=0.17.1

From within the directory, run the following command:

wandb launch --uri . --job-name hello-world-code --project launch-quickstart --entry-point "python train.py"

The command does the following:

Log the contents of the current directory to W&B as a Code Artifact.
Create a job named hello-world-code in the launch-quickstart project.
Build a container image by copying train.py and requirements.txt into a base image and pip install the requirements.
Start the container and run python train.py.

Create a queue

Launch is designed to help teams build workflows around shared compute. In the examples so far, the wandb launch command has executed a container synchronously on the local machine. Launch queues and agents enable asynchronous execution of jobs on shared resources and advanced features like prioritization and hyperparameter optimization. To create a basic queue, follow these steps:

Navigate to wandb.ai/launch and click the Create a queue button.
Select an Entity to associate the queue with.
Enter a Queue name.
Select Docker as the Resource.
Leave Configuration blank, for now.
Click Create queue :rocket:

After clicking the button, the browser will redirect to the Agents tab of the queue view. The queue remains in the Not active state until an agent starts polling.

For advanced queue configuration options, see the advanced queue setup page.

Connect an agent to the queue

The queue view displays an Add an agent button in a red banner at the top of the screen if the queue has no polling agents. Click the button to view copy the command to run an agent. The command should look like the following:

wandb launch-agent --queue <queue-name> --entity <entity-name>

Run the command in a terminal to start the agent. The agent polls the specified queue for jobs to run. Once received, the agent downloads or builds and then executes a container image for the job, as if the wandb launch command was run locally.

Navigate back to the Launch page and verify that the queue now shows as Active.

Submit a job to the queue

Navigate to your new launch-quickstart project in your W&B account and open the jobs tab from the navigation on the left side of the screen.

The Jobs page displays a list of W&B Jobs that were created from previously executed runs. Click on your launch job to view source code, dependencies, and any runs created from the job. After completing this walkthrough there should be three jobs in the list.

Pick one of the new jobs and follow these instructions to submit it to the queue:

Click the Launch button to submit the job to a queue. The Launch drawer will appear.
Select the Queue you created earlier and click Launch.

This submits the job to the queue. The agent polling this queue picks up and executes the job. The progress of the job can be monitored from the W&B UI or by inspecting the output of the agent in the terminal.

The wandb launch command can push jobs to the queue directly by specifying the --queue argument. For example, to submit the hello-world container job to the queue, run the following command:

wandb launch --docker-image wandb/job_hello_world:main --project launch-quickstart --queue <queue-name>

2 - Launch terms and concepts

With W&B Launch, you enqueue jobs onto queues to create runs. Jobs are python scripts instrumented with W&B. Queues hold a list of jobs to execute on a target resource. Agents pull jobs from queues and execute the jobs on target resources. W&B tracks launch jobs similarly to how W&B tracks runs.

Launch job

A launch job is a specific type of W&B Artifact that represents a task to complete. For example, common launch jobs include training a model or triggering a model evaluation. Job definitions include:

Python code and other file assets, including at least one runnable entrypoint.
Information about the input (config parameter) and output (metrics logged).
Information about the environment. (for example, requirements.txt, base Dockerfile).

There are three main kinds of job definitions:

Job types	Definition	How to run this job type
Artifact-based (or code-based) jobs	Code and other assets are saved as a W&B artifact.	To run artifact-based jobs, Launch agent must be configured with a builder.
Git-based jobs	Code and other assets are cloned from a certain commit, branch, or tag in a git repository.	To run git-based jobs, Launch agent must be configured with a builder and git repository credentials.
Image-based jobs	Code and other assets are baked into a Docker image.	To run image-based jobs, Launch agent might need to be configured with image repository credentials.

While Launch jobs can perform activities not related to model training–for example, deploy a model to a Triton inference server–all jobs must call wandb.init to complete successfully. This creates a run for tracking purposes in a W&B workspace.

Find jobs you created in the W&B App under the Jobs tab of your project workspace. From there, jobs can be configured and sent to a launch queue to be executed on a variety of target resources.

Launch queue

Launch queues are ordered lists of jobs to execute on a specific target resource. Launch queues are first-in, first-out. (FIFO). There is no practical limit to the number of queues you can have, but a good guideline is one queue per target resource. Jobs can be enqueued with the W&B App UI, W&B CLI or Python SDK. You can then configure one or more Launch agents to pull items from the queue and execute them on the queue’s target resource.

Target resources

The compute environment that a Launch queue is configured to execute jobs on is called the target resource.

W&B Launch supports the following target resources:

Each target resource accepts a different set of configuration parameters called resource configurations. Resource configurations take on default values defined by each Launch queue, but can be overridden independently by each job. See the documentation for each target resource for more details.

Launch agent

Launch agents are lightweight, persistent programs that periodically check Launch queues for jobs to execute. When a launch agent receives a job, it first builds or pulls the image from the job definition then runs it on the target resource.

One agent may poll multiple queues, however the agent must be configured properly to support all of the backing target resources for each queue it is polling.

Launch agent environment

The agent environment is the environment where a launch agent is running, polling for jobs.

The agent’s runtime environment is independent of a queue’s target resource. In other words, agents can be deployed anywhere as long as they are configured sufficiently to access the required target resources.

3 - Set up Launch

This page describes the high-level steps required to set up W&B Launch:

Set up a queue: Queues are FIFO and possess a queue configuration. A queue’s configuration controls where and how jobs are executed on a target resource.
Set up an agent: Agents run on your machine/infrastructure and poll one or more queues for launch jobs. When a job is pulled, the agent ensures that the image is built and available. The agent then submits the job to the target resource.

Set up a queue

Launch queues must be configured to point to a specific target resource along with any additional configuration specific to that resource. For example, a launch queue that points to a Kubernetes cluster might include environment variables or set a custom namespace its launch queue configuration. When you create a queue, you will specify both the target resource you want to use and the configuration for that resource to use.

When an agent receives a job from a queue, it also receives the queue configuration. When the agent submits the job to the target resource, it includes the queue configuration along with any overrides from the job itself. For example, you can use a job configuration to specify the Amazon SageMaker instance type for that job instance only. In this case, it is common to use queue config templates as the end user interface.

Create a queue

Navigate to Launch App at wandb.ai/launch.
Click the create queue button on the top right of the screen.

From the Entity dropdown menu, select the entity the queue will belong to.
Provide a name for your queue in the Queue field.
From the Resource dropdown, select the compute resource you want jobs added to this queue to use.
Choose whether to allow Prioritization for this queue. If prioritization is enabled, a user on your team can define a priority for their launch job when they enqueue them. Higher priority jobs are executed before lower priority jobs.
Provide a resource configuration in either JSON or YAML format in the Configuration field. The structure and semantics of your configuration document will depend on the resource type that the queue is pointing to. For more details, see the dedicated set up page for your target resource.

Set up a launch agent

Launch agents are long running processes that poll one or more launch queues for jobs. Launch agents dequeue jobs in first in, first out (FIFO) order or in priority order depending on the queues they pull from. When an agent dequeues a job from a queue, it optionally builds an image for that job. The agent then submits the job to the target resource along with configuration options specified in the queue configuration.

Agents are highly flexible and can be configured to support a wide variety of use cases. The required configuration for your agent will depend on your specific use case. See the dedicated page for Docker, Amazon SageMaker, Kubernetes, or Vertex AI.

W&B recommends you start agents with a service account’s API key, rather than a specific user’s API key. There are two benefits to using a service account’s API key:

The agent isn’t dependent on an individual user.
The author associated with a run created through Launch is viewed by Launch as the user who submitted the launch job, rather than the user associated with the agent.

Agent configuration

Configure the launch agent with a YAML file named launch-config.yaml. By default, W&B checks for the config file in ~/.config/wandb/launch-config.yaml. You can optionally specify a different directory when you activate the launch agent.

The contents of your launch agent’s configuration file will depend on your launch agent’s environment, the launch queue’s target resource, Docker builder requirements, cloud registry requirements, and so forth.

Independent of your use case, there are core configurable options for the launch agent:

max_jobs: maximum number of jobs the agent can execute in parallel
entity: the entity that the queue belongs to
queues: the name of one or more queues for the agent to watch

You can use the W&B CLI to specify universal configurable options for the launch agent (instead of the config YAML file): maximum number of jobs, W&B entity, and launch queues. See the wandb launch-agent command for more information.

The following YAML snippet shows how to specify core launch agent config keys:

# Max number of concurrent runs to perform. -1 = no limit
max_jobs: -1

entity: <entity-name>

# List of queues to poll.
queues:
  - <queue-name>

Configure a container builder

The launch agent can be configured to build images. You must configure the agent to use a container builder if you intend to use launch jobs created from git repositories or code artifacts. See the Create a launch job for more information on how to create a launch job.

W&B Launch supports three builder options:

Docker: The Docker builder uses a local Docker daemon to build images.
Kaniko: Kaniko is a Google project that enables image building in environments where a Docker daemon is unavailable.
Noop: The agent will not try to build jobs, and instead only pull pre-built images.

Use the Kaniko builder if your agent is polling in an environment where a Docker daemon is unavailable (for example, a Kubernetes cluster).

See the Set up Kubernetes for details about the Kaniko builder.

To specify an image builder, include the builder key in your agent configuration. For example, the following code snippet shows a portion of the launch config (launch-config.yaml) that specifies to use Docker or Kaniko:

builder:
  type: docker | kaniko | noop

Configure a container registry

In some cases, you might want to connect a launch agent to a cloud registry. Common scenarios where you might want to connect a launch agent to a cloud registry include:

You want to run a job in an envirnoment other than where you built it, such as a powerful workstation or cluster.
You want to use the agent to build images and run these images on Amazon SageMaker or VertexAI.
You want the launch agent to provide credentials to pull from an image repository.

To learn more about how to configure the agent to interact with a container registry, see the Advanced agent set up page.

Activate the launch agent

Activate the launch agent with the launch-agent W&B CLI command:

wandb launch-agent -q <queue-1> -q <queue-2> --max-jobs 5

In some use cases, you might want to have a launch agent polling queues from within a Kubernetes cluster. See the Advanced queue set up page for more information.

3.1 - Configure launch queue

The following page describes how to configure launch queue options.

Set up queue config templates

Administer and manage guardrails on compute consumption with Queue Config Templates. Set defaults, minimums, and maximum values for fields such as memory consumption, GPU, and runtime duration.

After you configure a queue with config templates, members of your team can alter fields you defined only within the specified range you defined.

Configure queue template

You can configure a queue template on an existing queue or create a new queue.

Navigate to the W&B Launch App.
Select View queue next to the name of the queue you want to add a template to.
Select the Config tab. This will show information about your queue such as when the queue was created, the queue config, and existing launch-time overrides.
Navigate to the Queue config section.
Identify the config key-values you want to create a template for.
Replace the value in the config with a template field. Template fields take the form of {{variable-name}}.
Click on the Parse configuration button. When you parse your configuration, W&B will automatically create tiles below the queue config for each template you created.
For each tile generated, you must first specify the data type (string, integer, or float) the queue config can allow. To do this, select the data type from the Type dropdown menu.
Based on your data type, complete the fields that appear within each tile.
Click on Save config.

For example, suppose you want to create a template that limits which AWS instances your team can use. Before you add a template field, your queue config might look something similar to:

RoleArn: arn:aws:iam:region:account-id:resource-type/resource-id
ResourceConfig:
  InstanceType: ml.m4.xlarge
  InstanceCount: 1
  VolumeSizeInGB: 2
OutputDataConfig:
  S3OutputPath: s3://bucketname
StoppingCondition:
  MaxRuntimeInSeconds: 3600

When you add a template field for the InstanceType, your config will look like:

RoleArn: arn:aws:iam:region:account-id:resource-type/resource-id
ResourceConfig:
  InstanceType: "{{aws_instance}}"
  InstanceCount: 1
  VolumeSizeInGB: 2
OutputDataConfig:
  S3OutputPath: s3://bucketname
StoppingCondition:
  MaxRuntimeInSeconds: 3600

Next, you click on Parse configuration. A new tile labeled aws-instance will appear underneath the Queue config.

From there, you select String as the datatype from the Type dropdown. This will populate fields where you can specify values a user can choose from. For example, in the following image the admin of the team configured two different AWS instance types that users can choose from (ml.m4.xlarge and ml.p3.xlarge):

Dynamically configure launch jobs

Queue configs can be dynamically configured using macros that are evaluated when the agent dequeues a job from the queue. You can set the following macros:

Macro	Description
`${project_name}`	The name of the project the run is being launched to.
`${entity_name}`	The owner of the project the run being launched to.
`${run_id}`	The id of the run being launched.
`${run_name}`	The name of the run that is launching.
`${image_uri}`	The URI of the container image for this run.

Any custom macro not listed in the preceding table (for example ${MY_ENV_VAR}), is substituted with an environment variable from the agent’s environment.

Use the launch agent to build images that execute on accelerators (GPUs)

You might need to specify an accelerator base image if you use launch to build images that are executed in an accelerator environment.

This accelerator base image must satisfy the following requirements:

Debian compatibility (the Launch Dockerfile uses apt-get to fetch python)
Compatibility CPU & GPU hardware instruction set (Make sure your CUDA version is supported by the GPU you intend on using)
Compatibility between the accelerator version you provide and the packages installed in your ML algorithm
Packages installed that require extra steps for setting up compatibility with hardware

How to use GPUs with TensorFlow

Ensure TensorFlow properly utilizes your GPU. To accomplish this, specify a Docker image and its image tag for the builder.accelerator.base_image key in the queue resource configuration.

For example, the tensorflow/tensorflow:latest-gpu base image ensures TensorFlow properly uses your GPU. This can be configured using the resource configuration in the queue.

The following JSON snippet demonstrates how to specify the TensorFlow base image in your queue config:

{
    "builder": {
        "accelerator": {
            "base_image": "tensorflow/tensorflow:latest-gpu"
        }
    }
}

3.2 - Set up launch agent

Advanced agent setup

This guide provides information on how to set up the W&B Launch agent to build container images in different environments.

Build is only required for git and code artifact jobs. Image jobs do not require build.

See Create a launch job for more information on job types.

Builders

The Launch agent can build images using Docker or Kaniko.

Kaniko: builds a container image in Kubernetes without running the build as a privileged container.
Docker: builds a container image by executing a docker build command locally.

The builder type can be controlled by the builder.type key in the launch agent config to either docker, kaniko, or noop to turn off build. By default, the agent helm chart sets the builder.type to noop. Additional keys in the builder section will be used to configure the build process.

If no builder is specified in the agent config and a working docker CLI is found, the agent will default to using Docker. If Docker is not available the agent will default to noop.

Use Kaniko for building images in a Kubernetes cluster. Use Docker for all other cases.

Pushing to a container registry

The launch agent tags all images it builds with a unique source hash. The agent pushes the image to the registry specified in the builder.destination key.

For example, if the builder.destination key is set to my-registry.example.com/my-repository, the agent will tag and push the image to my-registry.example.com/my-repository:<source-hash>. If the image exists in the registry, the build is skipped.

Agent configuration

If you are deploying the agent via our Helm chart, the agent config should be provided in the agentConfig key in the values.yaml file.

If you are invoking the agent yourself with wandb launch-agent, you can provide the agent config as a path to a YAML file with the --config flag. By default, the config will be loaded from ~/.config/wandb/launch-config.yaml.

Within your launch agent config (launch-config.yaml), provide the name of the target resource environment and the container registry for the environment and registry keys, respectively.

The following tabs demonstrates how to configure the launch agent based on your environment and registry.

The AWS environment configuration requires the region key. The region should be the AWS region that the agent runs in.

environment:
  type: aws
  region: <aws-region>
builder:
  type: <kaniko|docker>
  # URI of the ECR repository where the agent will store images.
  # Make sure the region matches what you have configured in your
  # environment.
  destination: <account-id>.ecr.<aws-region>.amazonaws.com/<repository-name>
  # If using Kaniko, specify the S3 bucket where the agent will store the
  # build context.
  build-context-store: s3://<bucket-name>/<path>

The agent uses boto3 to load the default AWS credentials. See the boto3 documentation for more information on how to configure default AWS credentials.

The Google Cloud environment requires region and project keys. Set region to the region that the agent runs in. Set project to the Google Cloud project that the agent runs in. The agent uses google.auth.default() in Python to load the default credentials.

environment:
  type: gcp
  region: <gcp-region>
  project: <gcp-project-id>
builder:
  type: <kaniko|docker>
  # URI of the Artifact Registry repository and image name where the agent
  # will store images. Make sure the region and project match what you have
  # configured in your environment.
  uri: <region>-docker.pkg.dev/<project-id>/<repository-name>/<image-name>
  # If using Kaniko, specify the GCS bucket where the agent will store the
  # build context.
  build-context-store: gs://<bucket-name>/<path>

See the google-auth documentation for more information on how to configure default GCP credentials so they are available to the agent.

The Azure environment does not require any additional keys. When the agent starts, it use azure.identity.DefaultAzureCredential() to load the default Azure credentials.

environment:
  type: azure
builder:
  type: <kaniko|docker>
  # URI of the Azure Container Registry repository where the agent will store images.
  destination: https://<registry-name>.azurecr.io/<repository-name>
  # If using Kaniko, specify the Azure Blob Storage container where the agent
  # will store the build context.
  build-context-store: https://<storage-account-name>.blob.core.windows.net/<container-name>

See the azure-identity documentation for more information on how to configure default Azure credentials.

Agent permissions

The agent permissions required vary by use case.

Cloud registry permissions

Below are the permissions that are generally required by launch agents to interact with cloud registries.

{
  'Version': '2012-10-17',
  'Statement':
    [
      {
        'Effect': 'Allow',
        'Action':
          [
            'ecr:CreateRepository',
            'ecr:UploadLayerPart',
            'ecr:PutImage',
            'ecr:CompleteLayerUpload',
            'ecr:InitiateLayerUpload',
            'ecr:DescribeRepositories',
            'ecr:DescribeImages',
            'ecr:BatchCheckLayerAvailability',
            'ecr:BatchDeleteImage',
          ],
        'Resource': 'arn:aws:ecr:<region>:<account-id>:repository/<repository>',
      },
      {
        'Effect': 'Allow',
        'Action': 'ecr:GetAuthorizationToken',
        'Resource': '*',
      },
    ],
}

artifactregistry.dockerimages.list;
artifactregistry.repositories.downloadArtifacts;
artifactregistry.repositories.list;
artifactregistry.repositories.uploadArtifacts;

Add the AcrPush role if you use the Kaniko builder.

Storage permissions for Kaniko

The launch agent requires permission to push to cloud storage if the agent uses the Kaniko builder. Kaniko uses a context store outside of the pod running the build job.

The recommended context store for the Kaniko builder on AWS is Amazon S3. The following policy can be used to give the agent access to an S3 bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ListObjectsInBucket",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::<BUCKET-NAME>"]
    },
    {
      "Sid": "AllObjectActions",
      "Effect": "Allow",
      "Action": "s3:*Object",
      "Resource": ["arn:aws:s3:::<BUCKET-NAME>/*"]
    }
  ]
}

On GCP, the following IAM permissions are required for the agent to upload build contexts to GCS:

storage.buckets.get;
storage.objects.create;
storage.objects.delete;
storage.objects.get;

The Storage Blob Data Contributor role is required in order for the agent to upload build contexts to Azure Blob Storage.

Customizing the Kaniko build

Specify the Kubernetes Job spec that the Kaniko job uses in the builder.kaniko-config key of the agent configuration. For example:

builder:
  type: kaniko
  build-context-store: <my-build-context-store>
  destination: <my-image-destination>
  build-job-name: wandb-image-build
  kaniko-config:
    spec:
      template:
        spec:
          containers:
          - args:
            - "--cache=false" # Args must be in the format "key=value"
            env:
            - name: "MY_ENV_VAR"
              value: "my-env-var-value"

Deploy Launch agent into CoreWeave

Optionally deploy the W&B Launch agent to CoreWeave Cloud infrastructure. CoreWeave is a cloud infrastructure that is purpose built for GPU-accelerated workloads.

For information on how to deploy the Launch agent to CoreWeave, see the CoreWeave documentation.

You will need to create a CoreWeave account in order to deploy the Launch agent into a CoreWeave infrastructure.

3.3 - Tutorial: Set up W&B Launch on Kubernetes

You can use W&B Launch to push ML workloads to a Kubernetes cluster, giving ML engineers a simple interface right in W&B to use the resources you already manage with Kubernetes.

W&B maintains an official Launch agent image that can be deployed to your cluster with a Helm chart that W&B maintains.

W&B uses the Kaniko builder to enable the Launch agent to build Docker images in a Kubernetes cluster. To learn more on how to set up Kaniko for the Launch agent, or how to turn off job building and only use prebuilt Docker images, see Advanced agent set up.

To install Helm and apply or upgrade W&B’s Launch agent Helm chart, you need kubectl access to the cluster with sufficient permissions to create, update, and delete Kubernetes resources. Typically, a user with cluster-admin or a custom role with equivalent permissions is required.

Configure a queue for Kubernetes

The Launch queue configuration for a Kubernetes target resource will resemble either a Kubernetes Job spec or a Kubernetes Custom Resource spec.

You can control any aspect of the Kubernetes workload resource spec when you create a Launch queue.

spec:
  template:
    spec:
      containers:
        - env:
            - name: MY_ENV_VAR
              value: some-value
          resources:
            requests:
              cpu: 1000m
              memory: 1Gi
metadata:
  labels:
    queue: k8s-test
namespace: wandb

In some use cases, you might want to use CustomResource definitions. CustomResource definitions are useful if, for example, you want to perform multi-node distributed training. See the tutorial for using Launch with multi-node jobs using Volcano for an example application. Another use case might be that you want to use W&B Launch with Kubeflow.

The following YAML snippet shows a sample Launch queue config that uses Kubeflow:

kubernetes:
  kind: PyTorchJob
  spec:
    pytorchReplicaSpecs:
      Master:
        replicas: 1
        template:
          spec:
            containers:
              - name: pytorch
                image: '${image_uri}'
                imagePullPolicy: Always
        restartPolicy: Never
      Worker:
        replicas: 2
        template:
          spec:
            containers:
              - name: pytorch
                image: '${image_uri}'
                imagePullPolicy: Always
        restartPolicy: Never
    ttlSecondsAfterFinished: 600
  metadata:
    name: '${run_id}-pytorch-job'
  apiVersion: kubeflow.org/v1

For security reasons, W&B will inject the following resources into your Launch queue if they are not specified:

securityContext
backOffLimit
ttlSecondsAfterFinished

The following YAML snippet demonstrates how these values will appear in your launch queue:

spec:
  template:
    `backOffLimit`: 0
    ttlSecondsAfterFinished: 60
    securityContext:
      allowPrivilegeEscalation: False,
      capabilities:
        drop:
          - ALL,
      seccompProfile:
        type: "RuntimeDefault"

Create a queue

Create a queue in the W&B App that uses Kubernetes as its compute resource:

Navigate to the Launch page.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Provide a name for your queue in the Name field.
Select Kubernetes as the Resource.
Within the Configuration field, provide the Kubernetes Job workflow spec or Custom Resource spec you configured in the previous section.

Configure a Launch agent with Helm

Use the Helm chart provided by W&B to deploy the Launch agent into your Kubernetes cluster. Control the behavior of the launch agent with the values.yaml file.

Specify the contents that would normally by defined in your launch agent config file (~/.config/wandb/launch-config.yaml) within the launchConfig key in thevalues.yaml file.

For example, suppose you have Launch agent config that enables you to run a Launch agent in EKS that uses the Kaniko Docker image builder:

queues:
	- <queue name>
max_jobs: <n concurrent jobs>
environment:
	type: aws
	region: us-east-1
registry:
	type: ecr
	uri: <my-registry-uri>
builder:
	type: kaniko
	build-context-store: <s3-bucket-uri>

Within your values.yaml file, this might look like:

agent:
  labels: {}
  # W&B API key.
  apiKey: ''
  # Container image to use for the agent.
  image: wandb/launch-agent:latest
  # Image pull policy for agent image.
  imagePullPolicy: Always
  # Resources block for the agent spec.
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi

# Namespace to deploy launch agent into
namespace: wandb

# W&B api url (Set yours here)
baseUrl: https://api.wandb.ai

# Additional target namespaces that the launch agent can deploy into
additionalTargetNamespaces:
  - default
  - wandb

# This should be set to the literal contents of your launch agent config.
launchConfig: |
  queues:
    - <queue name>
  max_jobs: <n concurrent jobs>
  environment:
    type: aws
    region: <aws-region>
  registry:
    type: ecr
    uri: <my-registry-uri>
  builder:
    type: kaniko
    build-context-store: <s3-bucket-uri>

# The contents of a git credentials file. This will be stored in a k8s secret
# and mounted into the agent container. Set this if you want to clone private
# repos.
gitCreds: |

# Annotations for the wandb service account. Useful when setting up workload identity on gcp.
serviceAccount:
  annotations:
    iam.gke.io/gcp-service-account:
    azure.workload.identity/client-id:

# Set to access key for azure storage if using kaniko with azure.
azureStorageAccessKey: ''

For more information on registries, environments, and required agent permissions see Advanced agent set up.

3.4 - Tutorial: Set up W&B Launch on SageMaker

You can use W&B Launch to submit launch jobs to Amazon SageMaker to train machine learning models using provided or custom algorithms on the SageMaker platform. SageMaker takes care of spinning up and releasing compute resources, so it can be a good choice for teams without an EKS cluster.

Launch jobs sent to a W&B Launch queue connected to Amazon SageMaker are executed as SageMaker Training Jobs with the CreateTrainingJob API. Use the launch queue configuration to control arguments sent to the CreateTrainingJob API.

Amazon SageMaker uses Docker images to execute training jobs. Images pulled by SageMaker must be stored in the Amazon Elastic Container Registry (ECR). This means that the image you use for training must be stored on ECR.

This guide shows how to execute SageMaker Training Jobs. For information on how to deploy to models for inference on Amazon SageMaker, see this example Launch job.

Prerequisites

Before you get started, ensure you satisfy the following prerequisites:

Decide if you want the Launch agent to build a Docker image for you.
Set up AWS resources and gather information about S3, ECR, and Sagemaker IAM roles.
Create an IAM role for the Launch agent.

Decide if you want the Launch agent to build a Docker images

Decide if you want the W&B Launch agent to build a Docker image for you. There are two options you can choose from:

Permit the launch agent build a Docker image, push the image to Amazon ECR, and submit SageMaker Training jobs for you. This option can offer some simplicity to ML Engineers rapidly iterating over training code.
The launch agent uses an existing Docker image that contains your training or inference scripts. This option works well with existing CI systems. If you choose this option, you will need to manually upload your Docker image to your container registry on Amazon ECR.

Set up AWS resources

Ensure you have the following AWS resources configured in your preferred AWS region:

An ECR repository to store container images.
One or more S3 buckets to store inputs and outputs for your SageMaker Training jobs.
An IAM role for Amazon SageMaker that permits SageMaker to run training jobs and interact with Amazon ECR and Amazon S3.

Make a note of the ARNs for these resources. You will need the ARNs when you define the Launch queue configuration.

Create a IAM Policy for Launch agent

From the IAM screen in AWS, create a new policy.
Toggle to the JSON policy editor, then paste the following policy based on your use case. Substitute values enclosed with <> with your own values:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogStreams",
        "SageMaker:AddTags",
        "SageMaker:CreateTrainingJob",
        "SageMaker:DescribeTrainingJob"
      ],
      "Resource": "arn:aws:sagemaker:<region>:<account-id>:*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<account-id>:role/<RoleArn-from-queue-config>"
    },
  {
      "Effect": "Allow",
      "Action": "kms:CreateGrant",
      "Resource": "<ARN-OF-KMS-KEY>",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "SageMaker.<region>.amazonaws.com",
          "kms:GrantIsForAWSResource": "true"
        }
      }
    }
  ]
}

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogStreams",
        "SageMaker:AddTags",
        "SageMaker:CreateTrainingJob",
        "SageMaker:DescribeTrainingJob"
      ],
      "Resource": "arn:aws:sagemaker:<region>:<account-id>:*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<account-id>:role/<RoleArn-from-queue-config>"
    },
     {
    "Effect": "Allow",
    "Action": [
      "ecr:CreateRepository",
      "ecr:UploadLayerPart",
      "ecr:PutImage",
      "ecr:CompleteLayerUpload",
      "ecr:InitiateLayerUpload",
      "ecr:DescribeRepositories",
      "ecr:DescribeImages",
      "ecr:BatchCheckLayerAvailability",
      "ecr:BatchDeleteImage"
    ],
    "Resource": "arn:aws:ecr:<region>:<account-id>:repository/<repository>"
  },
  {
    "Effect": "Allow",
    "Action": "ecr:GetAuthorizationToken",
    "Resource": "*"
  },
  {
      "Effect": "Allow",
      "Action": "kms:CreateGrant",
      "Resource": "<ARN-OF-KMS-KEY>",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "SageMaker.<region>.amazonaws.com",
          "kms:GrantIsForAWSResource": "true"
        }
      }
    }
  ]
}

Click Next.
Give the policy a name and description.
Click Create policy.

Create an IAM role for Launch agent

The Launch agent needs permission to create Amazon SageMaker training jobs. Follow the procedure below to create an IAM role:

From the IAM screen in AWS, create a new role.
For Trusted Entity, select AWS Account (or another option that suits your organization’s policies).
Scroll through the permissions screen and select the policy name you just created above.
Give the role a name and description.
Select Create role.
Note the ARN for the role. You will specify the ARN when you set up the launch agent.

To create IAM roles, see the AWS Identity and Access Management Documentation.

If you want the launch agent to build images, see the Advanced agent set up for additional permissions required.
The kms:CreateGrant permission for SageMaker queues is required only if the associated ResourceConfig has a specified VolumeKmsKeyId and the associated role does not have a policy that permits this action.

Configure launch queue for SageMaker

Next, create a queue in the W&B App that uses SageMaker as its compute resource:

Navigate to the Launch App.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Provide a name for your queue in the Name field.
Select SageMaker as the Resource.
Within the Configuration field, provide information about your SageMaker job. By default, W&B will populate a YAML and JSON CreateTrainingJob request body:

{
  "RoleArn": "<REQUIRED>", 
  "ResourceConfig": {
      "InstanceType": "ml.m4.xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 2
  },
  "OutputDataConfig": {
      "S3OutputPath": "<REQUIRED>"
  },
  "StoppingCondition": {
      "MaxRuntimeInSeconds": 3600
  }
}

You must at minimum specify:

RoleArn : ARN of the SageMaker execution IAM role (see prerequisites). Not to be confused with the launch agent IAM role.
OutputDataConfig.S3OutputPath : An Amazon S3 URI specifying where SageMaker outputs will be stored.
ResourceConfig: Required specification of a resource config. Options for resource config are outlined here.
StoppingCondition: Required specification of the stopping conditions for the training job. Options outlined here.

Click on the Create Queue button.

Set up the launch agent

The following section describes where you can deploy your agent and how to configure your agent based on where it is deployed.

There are several options for how the Launch agent is deployed for a Amazon SageMaker queue: on a local machine, on an EC2 instance, or in an EKS cluster. Configure your launch agent appropriately based on the where you deploy your agent.

Decide where to run the Launch agent

For production workloads and for customers who already have an EKS cluster, W&B recommends deploying the Launch agent to the EKS cluster using this Helm chart.

For production workloads without an current EKS cluster, an EC2 instance is a good option. Though the launch agent instance will keep running all the time, the agent doesn’t need more than a t2.micro sized EC2 instance which is relatively affordable.

For experimental or solo use cases, running the Launch agent on your local machine can be a fast way to get started.

Based on your use case, follow the instructions provided in the following tabs to properly configure up your launch agent:

W&B strongly encourages that you use the W&B managed helm chart to install the agent in an EKS cluster.

Navigate to the Amazon EC2 Dashboard and complete the following steps:

Click Launch instance.
Provide a name for the Name field. Optionally add a tag.
From the Instance type, select an instance type for your EC2 container. You do not need more than 1vCPU and 1GiB of memory (for example a t2.micro).
Create a key pair for your organization within the Key pair (login) field. You will use this key pair to connect to your EC2 instance with SSH client at a later step.
Within Network settings, select an appropriate security group for your organization.
Expand Advanced details. For IAM instance profile, select the launch agent IAM role you created above.
Review the Summary field. If correct, select Launch instance.

Navigate to Instances within the left panel of the EC2 Dashboard on AWS. Ensure that the EC2 instance you created is running (see the Instance state column). Once you confirm your EC2 instance is running, navigate to your local machine’s terminal and complete the following:

Select Connect.
Select the SSH client tab and following the instructions outlined to connect to your EC2 instance.
Within your EC2 instance, install the following packages:

sudo yum install python311 -y && python3 -m ensurepip --upgrade && pip3 install wandb && pip3 install wandb[launch]

Next, install and start Docker within your EC2 instance:

sudo yum update -y && sudo yum install -y docker python3 && sudo systemctl start docker && sudo systemctl enable docker && sudo usermod -a -G docker ec2-user

newgrp docker

Now you can proceed to setting up the Launch agent config.

Use the AWS config files located at ~/.aws/config and ~/.aws/credentials to associate a role with an agent that is polling on a local machine. Provide the IAM role ARN that you created for the launch agent in the previous step.

[profile SageMaker-agent]
role_arn = arn:aws:iam::<account-id>:role/<agent-role-name>
source_profile = default

[default]
aws_access_key_id=<access-key-id>
aws_secret_access_key=<secret-access-key>
aws_session_token=<session-token>

Note that session tokens have a max length of 1 hour or 3 days depending on the principal they are associated with.

Configure a launch agent

Configure the launch agent with a YAML config file named launch-config.yaml.

By default, W&B will check for the config file in ~/.config/wandb/launch-config.yaml. You can optionally specify a different directory when you activate the launch agent with the -c flag.

The following YAML snippet demonstrates how to specify the core config agent options:

max_jobs: -1
queues:
  - <queue-name>
environment:
  type: aws
  region: <your-region>
registry:
  type: ecr
  uri: <ecr-repo-arn>
builder: 
  type: docker

Now start the agent with wandb launch-agent

(Optional) Push your launch job Docker image to Amazon ECR

This section applies only if your launch agent uses existing Docker images that contain your training or inference logic. There are two options on how your launch agent behaves.

Upload your Docker image that contains your launch job to your Amazon ECR repo. Your Docker image needs to be in your ECR registry before you submit new launch jobs if you are using image-based jobs.

3.5 - Tutorial: Set up W&B Launch on Vertex AI

You can use W&B Launch to submit jobs for execution as Vertex AI training jobs. With Vertex AI training jobs, you can train machine learning models using either provided, or custom algorithms on the Vertex AI platform. Once a launch job is initiated, Vertex AI manages the underlying infrastructure, scaling, and orchestration.

W&B Launch works with Vertex AI through the CustomJob class in the google-cloud-aiplatform SDK. The parameters of a CustomJob can be controlled with the launch queue configuration. Vertex AI cannot be configured to pull images from a private registry outside of GCP. This means that you must store container images in GCP or in a public registry if you want to use Vertex AI with W&B Launch. See the Vertex AI documentation for more information on making container images accessible to Vertex jobs.

Prerequisites

Create or access a GCP project with the Vertex AI API enabled. See the GCP API Console docs for more information on enabling an API.
Create a GCP Artifact Registry repository to store images you want to execute on Vertex. See the GCP Artifact Registry documentation for more information.
Create a staging GCS bucket for Vertex AI to store its metadata. Note that this bucket must be in the same region as your Vertex AI workloads in order to be used as a staging bucket. The same bucket can be used for staging and build contexts.
Create a service account with the necessary permissions to spin up Vertex AI jobs. See the GCP IAM documentation for more information on assigning permissions to service accounts.
Grant your service account permission to manage Vertex jobs

Permission	Resource Scope	Description
`aiplatform.customJobs.create`	Specified GCP Project	Allows creation of new machine learning jobs within the project.
`aiplatform.customJobs.list`	Specified GCP Project	Allows listing of machine learning jobs within the project.
`aiplatform.customJobs.get`	Specified GCP Project	Allows retrieval of information about specific machine learning jobs within the project.

If you want your Vertex AI workloads to assume the identity of a non-standard service account, refer to the Vertex AI documentation for instructions on service account creation and necessary permissions. The spec.service_account field of the launch queue configuration can be used to select a custom service account for your W&B runs.

Configure a queue for Vertex AI

The queue configuration for Vertex AI resources specify inputs to the CustomJob constructor in the Vertex AI Python SDK, and the run method of the CustomJob. Resource configurations are stored under the spec and run keys:

The spec key contains values for the named arguments of the CustomJob constructor in the Vertex AI Python SDK.
The run key contains values for the named arguments of the run method of the CustomJob class in the Vertex AI Python SDK.

Customizations of the execution environment happens primarily in the spec.worker_pool_specs list. A worker pool spec defines a group of workers that will run your job. The worker spec in the default config asks for a single n1-standard-4 machine with no accelerators. You can change the machine type, accelerator type and count to suit your needs.

For more information on available machine types and accelerator types, see the Vertex AI documentation.

Create a queue

Create a queue in the W&B App that uses Vertex AI as its compute resource:

Navigate to the Launch page.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Provide a name for your queue in the Name field.
Select GCP Vertex as the Resource.
Within the Configuration field, provide information about your Vertex AI CustomJob you defined in the previous section. By default, W&B will populate a YAML and JSON request body similar to the following:

spec:
  worker_pool_specs:
    - machine_spec:
        machine_type: n1-standard-4
        accelerator_type: ACCELERATOR_TYPE_UNSPECIFIED
        accelerator_count: 0
      replica_count: 1
      container_spec:
        image_uri: ${image_uri}
  staging_bucket: <REQUIRED>
run:
  restart_job_on_worker_restart: false

After you configure your queue, click on the Create Queue button.

You must at minimum specify:

spec.worker_pool_specs : non-empty list of worker pool specifications.
spec.staging_bucket : GCS bucket to be used for staging Vertex AI assets and metadata.

Some of the Vertex AI docs show worker pool specifications with all keys in camel case,for example, workerPoolSpecs. The Vertex AI Python SDK uses snake case for these keys, for example worker_pool_specs.

Every key in the launch queue configuration should use snake case.

Configure a launch agent

The launch agent is configurable through a config file that is, by default, located at ~/.config/wandb/launch-config.yaml.

max_jobs: <n-concurrent-jobs>
queues:
  - <queue-name>

If you want the launch agent to build images for you that are executed in Vertex AI, see Advanced agent set up.

Set up agent permissions

There are multiple methods to authenticate as this service account. This can be achieved through Workload Identity, a downloaded service account JSON, environment variables, the Google Cloud Platform command-line tool, or a combination of these methods.

3.6 - Tutorial: Set up W&B Launch with Docker

The following guide describes how to configure W&B Launch to use Docker on a local machine for both the launch agent environment and for the queue’s target resource.

Using Docker to execute jobs and as the launch agent’s environment on the same local machine is particularly useful if your compute is installed on a machine that does not have a cluster management system (such as Kubernetes).

You can also use Docker queues to run workloads on powerful workstations.

This set up is common for users who perform experiments on their local machine, or that have a remote machine that they SSH in to, to submit launch jobs.

When you use Docker with W&B Launch, W&B will first build an image, and then build and run a container from that image. The image is built with the Docker docker run <image-uri> command. The queue configuration is interpreted as additional arguments that are passed to the docker run command.

Configure a Docker queue

The launch queue configuration (for a Docker target resource) accepts the same options defined in the docker run CLI command.

The agent receives options defined in the queue configuration. The agent then merges the received options with any overrides from the launch job’s configuration to produce a final docker run command that is executed on the target resource (in this case, a local machine).

There are two syntax transformations that take place:

Repeated options are defined in the queue configuration as a list.
Flag options are defined in the queue configuration as a Boolean with the value true.

For example, the following queue configuration:

{
  "env": ["MY_ENV_VAR=value", "MY_EXISTING_ENV_VAR"],
  "volume": "/mnt/datasets:/mnt/datasets",
  "rm": true,
  "gpus": "all"
}

Results in the following docker run command:

docker run \
  --env MY_ENV_VAR=value \
  --env MY_EXISTING_ENV_VAR \
  --volume "/mnt/datasets:/mnt/datasets" \
  --rm <image-uri> \
  --gpus all

Volumes can be specified either as a list of strings, or a single string. Use a list if you specify multiple volumes.

Docker automatically passes environment variables, that are not assigned a value, from the launch agent environment. This means that, if the launch agent has an environment variable MY_EXISTING_ENV_VAR, that environment variable is available in the container. This is useful if you want to use other config keys without publishing them in the queue configuration.

The --gpus flag of the docker run command allows you to specify GPUs that are available to a Docker container. For more information on how to use the gpus flag, see the Docker documentation.

Install the NVIDIA Container Toolkit to use GPUs within a Docker container.
If you build images from a code or artifact-sourced job, you can override the base image used by the agent to include the NVIDIA Container Toolkit. For example, within your launch queue, you can override the base image to tensorflow/tensorflow:latest-gpu:
```
{
  "builder": {
    "accelerator": {
      "base_image": "tensorflow/tensorflow:latest-gpu"
    }
  }
}
```

Create a queue

Create a queue that uses Docker as compute resource with the W&B CLI:

Navigate to the Launch page.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Enter a name for your queue in the Name field.
Select Docker as the Resource.
Define your Docker queue configuration in the Configuration field.
Click on the Create Queue button to create the queue.

Configure a launch agent on a local machine

Configure the launch agent with a YAML config file named launch-config.yaml. By default, W&B will check for the config file in ~/.config/wandb/launch-config.yaml. You can optionally specify a different directory when you activate the launch agent.

You can use the W&B CLI to specify core configurable options for the launch agent (instead of the config YAML file): maximum number of jobs, W&B entity, and launch queues. See the wandb launch-agent command for more information.

Core agent config options

The following tabs demonstrate how to specify the core config agent options with the W&B CLI and with a YAML config file:

wandb launch-agent -q <queue-name> --max-jobs <n>

max_jobs: <n concurrent jobs>
queues:
	- <queue-name>

Docker image builders

The launch agent on your machine can be configured to build Docker images. By default, these images are stored on your machine’s local image repository. To enable your launch agent to build Docker images, set the builder key in the launch agent config to docker:

builder:
	type: docker

If you don’t want the agent to build Docker images, and instead use prebuilt images from a registry, set the builder key in the launch agent config to noop

builder:
  type: noop

Container registries

Launch uses external container registeries such as Dockerhub, Google Container Registry, Azure Container Registry, and Amazon ECR.
If you want to run a job on a different environment from where you built it, configure your agent to be able to pull from a container registry.

To learn more about how connect the launch agent with a cloud registry, see the Advanced agent setup page.

4 - Create and deploy jobs

4.1 - Add job to queue

The following page describes how to add launch jobs to a launch queue.

Ensure that you, or someone on your team, has already configured a launch queue. For more information, see the Set up Launch page.

Add jobs to your queue

Add jobs to your queue interactively with the W&B App or programmatically with the W&B CLI.

Add a job to your queue programmatically with the W&B App.

Navigate to your W&B Project Page.
Select the Jobs icon on the left panel:
The Jobs page displays a list of W&B launch jobs that were created from previously executed W&B runs.
Select the Launch button next to the name of the Job name. A modal will appear on the right side of the page.
From the Job version dropdown, select the version of the launch job you want to use. Launch jobs are versioned like any other W&B Artifact. Different versions of the same launch job will be created if you make modifications to the software dependencies or source code used to run the job.
Within the Overrides section, provide new values for any inputs that are configured for your launch job. Common overrides include a new entrypoint command, arguments, or values in the wandb.Run.config of your new W&B run.
You can copy and paste values from other W&B runs that used your launch job by clicking on the Paste from… button.
From the Queue dropdown, select the name of the launch queue you want to add your launch job to.
Use the Job Priority dropdown to specify the priority of your launch job. A launch job’s priority is set to “Medium” if the launch queue does not support prioritization.
(Optional) Follow this step only if a queue config template was created by your team admin
Within the Queue Configurations field, provide values for configuration options that were created by the admin of your team.
For example, in the following example, the team admin configured AWS instance types that can be used by the team. In this case, team members can pick either the ml.m4.xlarge or ml.p3.xlarge compute instance type to train their model.
Select the Destination project, where the resulting run will appear. This project needs to belong to the same entity as the queue.
Select the Launch now button.

Use the wandb launch command to add jobs to a queue. Create a JSON configuration with hyperparameter overrides. For example, using the script from the Quickstart guide, we create a JSON file with the following overrides:

{
  "overrides": {
      "args": [],
      "run_config": {
          "learning_rate": 0,
          "epochs": 0
      },   
      "entry_point": []
  }
}

W&B Launch will use the default parameters if you do not provide a JSON configuration file.

If you want to override the queue configuration, or if your launch queue does not have a configuration resource defined, you can specify the resource_args key in your config.json file. For example, following continuing the example above, your config.json file might look similar to the following:

{
  "overrides": {
      "args": [],
      "run_config": {
          "learning_rate": 0,
          "epochs": 0
      },
      "entry_point": []
  },
  "resource_args": {
        "<resource-type>" : {
            "<key>": "<value>"
        }
  }
}

Replace values within the <> with your own values.

Provide the name of the queue for the queue(-q) flag, the name of the job for the job(-j) flag, and the path to the configuration file for the config(-c) flag.

wandb launch -j <job> -q <queue-name> \ 
-e <entity-name> -c path/to/config.json

If you work within a W&B Team, we suggest you specify the entity flag (-e) to indicate which entity the queue will use.

4.2 - Create a launch job

Try in Colab

Launch jobs are blueprints for reproducing W&B runs. Jobs are W&B Artifacts that capture the source code, dependencies, and inputs required to execute a workload.

Create and run jobs with the wandb launch command.

To create a job without submitting it for execution, use the wandb job create command. See the command reference docs for more information.

Git jobs

You can create a Git-based job where code and other tracked assets are cloned from a certain commit, branch, or tag in a remote git repository with W&B Launch. Use the --uri or -u flag to specify the URI containing the code, along with optionally a --build-context flag to specify a subdirectory.

Run a “hello world” job from a git repository with the following command:

wandb launch --uri "https://github.com/wandb/launch-jobs.git" --build-context jobs/hello_world --dockerfile Dockerfile.wandb --project "hello-world" --job-name "hello-world" --entry-point "python job.py"

The command does the following:

Clones the W&B Launch jobs repository to a temporary directory.
Creates a job named hello-world-git in the hello project. The job is associated with the commit at the head of the default branch of the repository.
Builds a container image from the jobs/hello_world directory and the Dockerfile.wandb.
Starts the container and runs python job.py.

To build a job from a specific branch or commit hash, append the -g, --git-hash argument. For a full list of arguments, run wandb launch --help.

Remote URL format

The git remote associated with a Launch job can be either an HTTPS or an SSH URL. The URL type determines the protocol used to fetch job source code.

Remote URL Type	URL Format	Requirements for access and authentication
https	`https://github.com/organization/repository.git`	username and password to authenticate with the git remote
ssh	`git@github.com:organization/repository.git`	ssh key to authenticate with the git remote

Note that the exact URL format varies by hosting provider. Jobs created with wandb launch --uri will use the transfer protocol specified in the provided --uri.

Code artifact jobs

Jobs can be created from any source code stored in a W&B Artifact. Use a local directory with the --uri or -u argument to create a new code artifact and job.

To get started, create an empty directory and add a Python script named main.py with the following content:

import wandb

with wandb.init() as run:
    run.log({"metric": 0.5})

Add a file requirements.txt with the following content:

wandb>=0.17.1

Log the directory as a code artifact and launch a job with the following command:

wandb launch --uri . --job-name hello-world-code --project launch-quickstart --entry-point "python main.py"

The preceding command does the following:

Logs the current directory as a code artifact named hello-world-code.
Creates a job named hello-world-code in the launch-quickstart project.
Builds a container image from the current directory and Launch’s default Dockerfile. The default Dockerfile will install the requirements.txt file and set the entry point to python main.py.

Image jobs

Alternatively, you can build jobs off of pre-made Docker images. This is useful when you already have an established build system for your ML code, or when you don’t expect to adjust the code or requirements for the job but do want to experiment with hyperparameters or different infrastructure scales.

The image is pulled from a Docker registry and run with the specified entry point, or the default entry point if none is specified. Pass a full image tag to the --docker-image option to create and run a job from a Docker image.

To run a simple job from a pre-made image, use the following command:

wandb launch --docker-image "wandb/job_hello_world:main" --project "hello-world"

Automatic job creation

W&B will automatically create and track a job for any run with tracked source code, even if that run was not created with Launch. Runs are considered to have tracked source code if any of the three following conditions are met:

The run has an associated git remote and commit hash.
The run logged a code artifact. See Run.log_code.
The run was executed in a Docker container with the WANDB_DOCKER environment variable set to an image tag

The Git remote URL is inferred from the local git repository if your Launch job is created automatically by a W&B run.

Launch job names

By default, W&B automatically generates a job name for you. The name is generated depending on how the job is created (GitHub, code artifact, or Docker image). Alternatively, you can define a Launch job’s name with environment variables or with the W&B Python SDK.

The following table describes the job naming convention used by default based on job source:

Source	Naming convention
GitHub	`job-<git-remote-url>-<path-to-script>`
Code artifact	`job-<code-artifact-name>`
Docker image	`job-<image-name>`

Name your job with a W&B environment variable or with the W&B Python SDK

Set the WANDB_JOB_NAME environment variable to your preferred job name. For example:

WANDB_JOB_NAME=awesome-job-name

Define the name of your job with wandb.Settings. Then pass this object when you initialize W&B with wandb.init. For example:

settings = wandb.Settings(job_name="my-job-name")
wandb.init(settings=settings)

For docker image jobs, the version alias is automatically added as an alias to the job.

Containerization

Jobs are executed in a container. Image jobs use a pre-built Docker image, while Git and code artifact jobs require a container build step.

Job containerization can be customized with arguments to wandb launch and files within the job source code.

Build context

The term build context refers to the tree of files and directories that are sent to the Docker daemon to build a container image. By default, Launch uses the root of the job source code as the build context. To specify a subdirectory as the build context, use the --build-context argument of wandb launch when creating and launching a job.

The --build-context argument is particularly useful for working with Git jobs that refer to a monorepo with multiple projects. By specifying a subdirectory as the build context, you can build a container image for a specific project within the monorepo.

See the example above for a demonstration of how to use the --build-context argument with the official W&B Launch jobs repository.

Dockerfile

The Dockerfile is a text file that contains instructions for building a Docker image. By default, Launch uses a default Dockerfile that installs the requirements.txt file. To use a custom Dockerfile, specify the path to the file with the --dockerfile argument of wandb launch.

The Dockerfile path is specified relative to the build context. For example, if the build context is jobs/hello_world, and the Dockerfile is located in the jobs/hello_world directory, the --dockerfile argument should be set to Dockerfile.wandb. See the example above for a demonstration of how to use the --dockerfile argument with the official W&B Launch jobs repository.

Requirements file

If no custom Dockerfile is provided, Launch will look in the build context for Python dependencies to install. If a requirements.txt file is found at the root of the build context, Launch will install the dependencies listed in the file. Otherwise, if a pyproject.toml file is found, Launch will install dependencies from the project.dependencies section.

4.3 - Manage job inputs

The core experience of Launch is easily experimenting with different job inputs like hyperparameters and datasets, and routing these jobs to appropriate hardware. Once a job is created, users beyond the original author can adjust these inputs via the W&B GUI or CLI. For information on how job inputs can be set when launching from the CLI or UI, see the Enqueue jobs guide.

This section describes how to programmatically control the inputs that can be tweaked for a job.

By default, W&B jobs capture the entire Run.config as the inputs to a job, but the Launch SDK provides a function to control select keys in the run config or to specify JSON or YAML files as inputs.

Launch SDK functions require wandb-core. See the wandb-core README for more information.

Reconfigure the `Run` object

The Run object returned by wandb.init in a job can be reconfigured, by default. The Launch SDK provides a way to customize what parts of the Run.config object can be reconfigured when launching the job.

import wandb
from wandb.sdk import launch

# Required for launch sdk use.
wandb.require("core")

config = {
    "trainer": {
        "learning_rate": 0.01,
        "batch_size": 32,
        "model": "resnet",
        "dataset": "cifar10",
        "private": {
            "key": "value",
        },
    },
    "seed": 42,
}


with wandb.init(config=config):
    launch.manage_wandb_config(
        include=["trainer"], 
        exclude=["trainer.private"],
    )
    # Etc.

The function launch.manage_wandb_config configures the job to accept input values for the Run.config object. The optional include and exclude options take path prefixes within the nested config object. This can be useful if, for example, a job uses a library whose options you don’t want to expose to end users.

If include prefixes are provided, only paths within the config that match an include prefix will accept input values. If exclude prefixes are provided, no paths that match the exclude list will be filtered out of the input values. If a path matches both an include and an exclude prefix, the exclude prefix will take precedence.

In the preceding example, the path ["trainer.private"] will filter out the private key from the trainer object, and the path ["trainer"] will filter out all keys not under the trainer object.

Use a \-escaped . to filter out keys with a . in their name.

For example, r"trainer\.private" filters out the trainer.private key rather than the private key under the trainer object.

Note that the r prefix above denotes a raw string.

If the code above is packaged and run as a job, the input types of the job will be:

{
    "trainer": {
        "learning_rate": "float",
        "batch_size": "int",
        "model": "str",
        "dataset": "str",
    },
}

When launching the job from the W&B CLI or UI, the user will be able to override only the four trainer parameters.

Access run config inputs

Jobs launched with run config inputs can access the input values through the Run.config. The Run returned by wandb.init in the job code will have the input values automatically set. Use

from wandb.sdk import launch

run_config_overrides = launch.load_wandb_config()

to load the run config input values anywhere in the job code.

Reconfigure a file

The Launch SDK also provides a way to manage input values stored in config files in the job code. This is a common pattern in many deep learning and large language model use cases, like this torchtune example or this Axolotl config).

Sweeps on Launch does not support the use of config file inputs as sweep parameters. Sweep parameters must be controlled through the Run.config object.

The launch.manage_config_file function can be used to add a config file as an input to the Launch job, giving you access to edit values within the config file when launching the job.

By default, no run config inputs will be captured if launch.manage_config_file is used. Calling launch.manage_wandb_config overrides this behavior.

Consider the following example:

import yaml
import wandb
from wandb.sdk import launch

# Required for launch sdk use.
wandb.require("core")

launch.manage_config_file("config.yaml")

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

with wandb.init(config=config):
    # Etc.
    pass

Imagine the code is run with an adjacent file config.yaml:

learning_rate: 0.01
batch_size: 32
model: resnet
dataset: cifar10

The call to launch.manage_config_file will add the config.yaml file as an input to the job, making it reconfigurable when launching from the W&B CLI or UI.

The include and exclude keyword arugments may be used to filter the acceptable input keys for the config file in the same way as launch.manage_wandb_config.

Access config file inputs

When launch.manage_config_file is called in a run created by Launch, launch patches the contents of the config file with the input values. The patched config file is available in the job environment.

Call launch.manage_config_file before reading the config file in the job code to ensure input values are used.

Customize a job’s launch drawer UI

Defining a schema for a job’s inputs allows you to create a custom UI for launching the job. To define a job’s schema, include it in the call to launch.manage_wandb_config or launch.manage_config_file. The schema can either be a python dict in the form of a JSON Schema or a Pydantic model class.

Job input schemas are not used to validate inputs. They are only used to define the UI in the launch drawer.

The following example shows a schema with these properties:

seed, an integer
trainer, a dictionary with some keys specified:
- trainer.learning_rate, a float that must be greater than zero
- trainer.batch_size, an integer that must be either 16, 64, or 256
- trainer.dataset, a string that must be either cifar10 or cifar100

schema = {
    "type": "object",
    "properties": {
        "seed": {
          "type": "integer"
        }
        "trainer": {
            "type": "object",
            "properties": {
                "learning_rate": {
                    "type": "number",
                    "description": "Learning rate of the model",
                    "exclusiveMinimum": 0,
                },
                "batch_size": {
                    "type": "integer",
                    "description": "Number of samples per batch",
                    "enum": [16, 64, 256]
                },
                "dataset": {
                    "type": "string",
                    "description": "Name of the dataset to use",
                    "enum": ["cifar10", "cifar100"]
                }
            }
        }
    }
}

launch.manage_wandb_config(
    include=["seed", "trainer"], 
    exclude=["trainer.private"],
    schema=schema,
)

In general, the following JSON Schema attributes are supported:

Attribute	Required	Notes
`type`	Yes	Must be one of `number`, `integer`, `string`, or `object`
`title`	No	Overrides the property’s display name
`description`	No	Gives the property helper text
`enum`	No	Creates a dropdown select instead of a freeform text entry
`minimum`	No	Allowed only if `type` is `number` or `integer`
`maximum`	No	Allowed only if `type` is `number` or `integer`
`exclusiveMinimum`	No	Allowed only if `type` is `number` or `integer`
`exclusiveMaximum`	No	Allowed only if `type` is `number` or `integer`
`properties`	No	If `type` is `object`, used to define nested configurations

The following example shows a schema with these properties:

seed, an integer
trainer, a schema with some sub-attributes specified:
- trainer.learning_rate, a float that must be greater than zero
- trainer.batch_size, an integer that must be between 1 and 256, inclusive
- trainer.dataset, a string that must be either cifar10 or cifar100

class DatasetEnum(str, Enum):
    cifar10 = "cifar10"
    cifar100 = "cifar100"

class Trainer(BaseModel):
    learning_rate: float = Field(gt=0, description="Learning rate of the model")
    batch_size: int = Field(ge=1, le=256, description="Number of samples per batch")
    dataset: DatasetEnum = Field(title="Dataset", description="Name of the dataset to use")

class Schema(BaseModel):
    seed: int
    trainer: Trainer

launch.manage_wandb_config(
    include=["seed", "trainer"],
    exclude=["trainer.private"],
    schema=Schema,
)

You can also use an instance of the class:

t = Trainer(learning_rate=0.01, batch_size=32, dataset=DatasetEnum.cifar10)
s = Schema(seed=42, trainer=t)
launch.manage_wandb_config(
    include=["seed", "trainer"],
    exclude=["trainer.private"],
    input_schema=s,
)

Adding a job input schema will create a structured form in the launch drawer, making it easier to launch the job.

4.4 - Monitor launch queue

Use the interactive Queue monitoring dashboard to view when a launch queue is in heavy use or idle, visualize workloads that are running, and spot inefficient jobs. The launch queue dashboard is especially useful for deciding whether or not you are effectively using your compute hardware or cloud resources.

For deeper analysis, the page links to the W&B experiment tracking workspace and to external infrastructure monitoring providers like Datadog, NVIDIA Base Command, or cloud consoles.

Queue monitoring dashboards are currently available only in the W&B Multi-tenant Cloud deployment option.

Dashboard and plots

Use the Monitor tab to view the activity of a queue that occurred during the last seven days. Use the left panel to control time ranges, grouping, and filters.

The dashboard contains a number of plots answering common questions about performance and efficiency. The proceeding sections describe UI elements of queue dashboards.

Job status

The Job status plot shows how many jobs are running, pending, queued, or completed in each time interval. Use the Job status plot for identifying periods of idleness in the queue.

For example, suppose you have a fixed resource (such as DGX BasePod). If you observe an idle queue with the fixed resource, this might suggest an opportunity to run lower-priority pre-emptible launch jobs such as sweeps.

On the other hand, suppose you use a cloud resource and you see periodic bursts of activity. Periodic bursts of activity might suggest an opportunity to save money by reserving resources for particular times.

To the right of the plot is a key that shows which colors represent the status of a launch job.

Queued items might indicate opportunities to shift workloads to other queues. A spike in failures can identify users who might need help with their launch job setup.

Queued time

The Queued time plots shows the amount of time (in seconds) that a launch job was on a queue for a given date or time range.

The x-axis shows a time frame that you specify and the y-axis shows the time (in seconds) a launch job was on a launch queue. For example, suppose on a given day there are 10 launch jobs queued. The Queue time plot shows 600 seconds if those 10 launch jobs wait an average of 60 seconds each.

Use the Queued time plot to identify users affected by long queue times.

Customize the color of each job with the Grouping control in the left bar.

which can be particularly helpful for identifying which users and jobs are feeling the pain of scarce queue capacity.

Job runs

This plot shows the start and end of every job executed in a time period, with distinct colors for each run. This makes it easy to see at a glance what workloads the queue was processing at a given time.

Use the Select tool in the bottom right of the panel to brush over jobs to populate details in the table below.

CPU and GPU usage

Use the GPU use by a job, CPU use by a job, GPU memory by job, and System memory by job to view the efficiency of your launch jobs.

For example, you can use the GPU memory by job to view if a W&B run took a long time to complete and whether or not it used a low percentage of its CPU cores.

The x-axis of each plot shows the duration of a W&B run (created by a launch job) in seconds. Hover your mouse over a data point to view information about a W&B run such as the run ID, the project the run belongs to, the launch job that created the W&B run and more.

Errors

The Errors panel shows errors that occurred on a given launch queue. More specifically, the Errors panel shows a timestamp of when the error occurred, the name of the launch job where the error comes from, and the error message that was created. By default, errors are ordered from latest to oldest.

Use the Errors panel to identify and unblock users.

External links

The queue observability dashboard’s view is consistent across all queue types, but in many cases, it can be useful to jump directly into environment-specific monitors. To accomplish this, add a link from the console directly from the queue observability dashboard.

At the bottom of the page, click Manage Links to open a panel. Add the full URL of the page you want. Next, add a label. Links that you add appear in the External Links section.

4.5 - View launch jobs

The following page describes how to view information about launch jobs added to queues.

View jobs

View jobs added to a queue with the W&B App.

Navigate to the W&B App at https://wandb.ai/home.
Select Launch within the Applications section of the left sidebar.
Select the All entities dropdown and select the entity the launch job belongs to.
Expand the collapsible UI from the Launch Application page to view a list of jobs added to that specific queue.

A run is created when the launch agent executes a launch job. In other words, each run listed corresponds to a specific job that was added to that queue.

For example, the following image shows two runs that were created from a job called job-source-launch_demo-canonical. The job was added to a queue called Start queue. The first run listed in the queue called resilient-snowball and the second run listed is called earthy-energy-165.

Within the W&B App UI you can find additional information about runs created from launch jobs such as the:

Run: The name of the W&B run assigned to that job.
Job ID: The name of the job.
Project: The name of the project the run belongs to.
Status: The status of the queued run.
Author: The W&B entity that created the run.
Creation date: The timestamp when the queue was created.
Start time: The timestamp when the job started.
Duration: Time, in seconds, it took to complete the job’s run.

List jobs

View a list of jobs that exist within a project with the W&B CLI. Use the W&B job list command and provide the name of the project and entity the launch job belongs to the --project and --entity flags, respectively.

 wandb job list --entity your-entity --project project-name

Check the status of a job

The following table defines the status a queued run can have:

Status	Description
Idle	The run is in a queue with no active agents.
Queued	The run is in a queue waiting for an agent to process it.
Pending	The run has been picked up by an agent but has not yet started. This could be due to resources being unavailable on the cluster.
Running	The run is currently executing.
Killed	The job was killed by the user.
Crashed	The run stopped sending data or did not successfully start.
Failed	The run ended with a non-zero exit code or the run failed to start.
Finished	The job completed successfully.

5 - Create sweeps with W&B Launch

Discover how to automate hyperparamter sweeps on launch.

Try in Colab

Create a hyperparameter tuning job (sweeps) with W&B Launch. With sweeps on launch, a sweep scheduler is pushed to a Launch Queue with the specified hyperparameters to sweep over. The sweep scheduler starts as it is picked up by the agent, launching sweep runs onto the same queue with chosen hyperparameters. This continues until the sweep finishes or is stopped.

You can use the default W&B Sweep scheduling engine or implement your own custom scheduler:

Standard sweep scheduler: Use the default W&B Sweep scheduling engine that controls W&B Sweeps. The familiar bayes, grid, and random methods are available.
Custom sweep scheduler: Configure the sweep scheduler to run as a job. This option enables full customization. An example of how to extend the standard sweep scheduler to include more logging can be found in the section below.

This guide assumes that W&B Launch has been previously configured. If W&B Launch has is not configured, see the how to get started section of the launch documentation.

We recommend you create a sweep on launch using the ‘basic’ method if you are a first time users of sweeps on launch. Use a custom sweeps on launch scheduler when the standard W&B scheduling engine does not meet your needs.

Create a sweep with a W&B standard scheduler

Create W&B Sweeps with Launch. You can create a sweep interactively with the W&B App or programmatically with the W&B CLI. For advanced configurations of Launch sweeps, including the ability to customize the scheduler, use the CLI.

Before you create a sweep with W&B Launch, ensure that you first create a job to sweep over. See the Create a Job page for more information.

Create a sweep interactively with the W&B App.

Navigate to your W&B project on the W&B App.
Select the sweeps icon on the left panel (broom image).
Next, select the Create Sweep button.
Click the Configure Launch 🚀 button.
From the Job dropdown menu, select the name of your job and the job version you want to create a sweep from.
Select a queue to run the sweep on using the Queue dropdown menu.
Use the Job Priority dropdown to specify the priority of your launch job. A launch job’s priority is set to “Medium” if the launch queue does not support prioritization.
(Optional) Configure override args for the run or sweep scheduler. For example, using the scheduler overrides, configure the number of concurrent runs the scheduler manages using num_workers.
(Optional) Select a project to save the sweep to using the Destination Project dropdown menu.
Click Save
Select Launch Sweep.

Programmatically create a W&B Sweep with Launch with the W&B CLI.

Create a Sweep configuration
Specify the full job name within your sweep configuration
Initialize a sweep agent.

Steps 1 and 3 are the same steps you normally take when you create a W&B Sweep.

For example, in the following code snippet, we specify 'wandb/jobs/Hello World 2:latest' for the job value:

# launch-sweep-config.yaml

job: 'wandb/jobs/Hello World 2:latest'
description: sweep examples using launch jobs

method: bayes
metric:
  goal: minimize
  name: loss_metric
parameters:
  learning_rate:
    max: 0.02
    min: 0
    distribution: uniform
  epochs:
    max: 20
    min: 0
    distribution: int_uniform

# Optional scheduler parameters:

# scheduler:
#   num_workers: 1  # concurrent sweep runs
#   docker_image: <base image for the scheduler>
#   resource: <ie. local-container...>
#   resource_args:  # resource arguments passed to runs
#     env: 
#         - WANDB_API_KEY

# Optional Launch Params
# launch: 
#    registry: <registry for image pulling>

For information on how to create a sweep configuration, see the Define sweep configuration page.

Next, initialize a sweep. Provide the path to your config file, the name of your job queue, your W&B entity, and the name of the project.

wandb launch-sweep <path/to/yaml/file> --queue <queue_name> --entity <your_entity>  --project <project_name>

For more information on W&B Sweeps, see the Tune Hyperparameters chapter.

Create a custom sweep scheduler

Create a custom sweep scheduler either with the W&B scheduler or a custom scheduler.

Using scheduler jobs requires wandb cli version >= 0.15.4

Create a launch sweep using the W&B sweep scheduling logic as a job.

Identify the Wandb scheduler job in the public wandb/sweep-jobs project, or use the job name: 'wandb/sweep-jobs/job-wandb-sweep-scheduler:latest'
Construct a configuration yaml with an additional scheduler block that includes a job key pointing to this name, example below.
Use the wandb launch-sweep command with the new config.

Example config:

# launch-sweep-config.yaml  
description: Launch sweep config using a scheduler job
scheduler:
  job: wandb/sweep-jobs/job-wandb-sweep-scheduler:latest
  num_workers: 8  # allows 8 concurrent sweep runs

# training/tuning job that the sweep runs will execute
job: wandb/sweep-jobs/job-fashion-MNIST-train:latest
method: grid
parameters:
  learning_rate:
    min: 0.0001
    max: 0.1

Custom schedulers can be created by creating a scheduler-job. For the purposes of this guide we will be modifying the WandbScheduler to provide more logging.

Clone the wandb/launch-jobs repo (specifically: wandb/launch-jobs/jobs/sweep_schedulers)
Now, we can modify the wandb_scheduler.py to achieve our desired increased logging. Example: Add logging to the function _poll. This is called once every polling cycle (configurable timing), before we launch new sweep runs.
Run the modified file to create a job, with: python wandb_scheduler.py --project <project> --entity <entity> --name CustomWandbScheduler
Identify the name of the job created, either in the UI or in the output of the previous call, which will be a code-artifact job (unless otherwise specified).
Now create a sweep configuration where the scheduler points to your new job.

...
scheduler:
  job: '<entity>/<project>/job-CustomWandbScheduler:latest'
...

Optuna is a hyperparameter optimization framework that uses a variety of algorithms to find the best hyperparameters for a given model (similar to W&B). In addition to the sampling algorithms, Optuna also provides a variety of pruning algorithms that can be used to terminate poorly performing runs early. This is especially useful when running a large number of runs, as it can save time and resources. The classes are highly configurable, just pass in the expected parameters in the scheduler.settings.pruner/sampler.args block of the config file.

Create a launch sweep using Optuna’s scheduling logic with a job.

First, create your own job or use a pre-built Optuna scheduler image job.
- See the wandb/launch-jobs repo for examples on how to create your own job.
- To use a pre-built Optuna image, you can either navigate to job-optuna-sweep-scheduler in the wandb/sweep-jobs project or use can use the job name: wandb/sweep-jobs/job-optuna-sweep-scheduler:latest.
After you create a job, you can now create a sweep. Construct a sweep config that includes a scheduler block with a job key pointing to the Optuna scheduler job (example below).

  # optuna_config_basic.yaml
  description: A basic Optuna scheduler
  job: wandb/sweep-jobs/job-fashion-MNIST-train:latest
  run_cap: 5
  metric:
    name: epoch/val_loss
    goal: minimize

  scheduler:
    job: wandb/sweep-jobs/job-optuna-sweep-scheduler:latest
    resource: local-container  # required for scheduler jobs sourced from images
    num_workers: 2

    # optuna specific settings
    settings:
      pruner:
        type: PercentilePruner
        args:
          percentile: 25.0  # kill 75% of runs
          n_warmup_steps: 10  # pruning turned off for first x steps

  parameters:
    learning_rate:
      min: 0.0001
      max: 0.1

Lastly, launch the sweep to an active queue with the launch-sweep command:

wandb launch-sweep <config.yaml> -q <queue> -p <project> -e <entity>

For the exact implementation of the Optuna sweep scheduler job, see wandb/launch-jobs. For more examples of what is possible with the Optuna scheduler, check out wandb/examples.

Examples of what is possible with custom sweep scheduler jobs are available in the wandb/launch-jobs repo under jobs/sweep_schedulers. This guide shows how to use the publicly available Wandb Scheduler Job, as well demonstrates a process for creating custom sweep scheduler jobs.

How to resume sweeps on launch

It is also possible to resume a launch-sweep from a previously launched sweep. Although hyperparameters and the training job cannot be changed, scheduler-specific parameters can be, as well as the queue it is pushed to.

If the initial sweep used a training job with an alias like ’latest’, resuming can lead to different results if the latest job version has been changed since the last run.

Identify the sweep name/ID for a previously run launch sweep. The sweep ID is an eight character string (for example, hhd16935) that you can find in your project on the W&B App.
If you change the scheduler parameters, construct an updated config file.
In your terminal, execute the following command. Replace content wrapped in < and > with your information:

wandb launch-sweep <optional config.yaml> --resume_id <sweep id> --queue <queue_name>

6 - Launch FAQ

6.1 - Are there best practices for using Launch effectively?

Create the queue before starting the agent to enable easy configuration. Failure to do this results in errors that prevent the agent from functioning until a queue is added.
Create a W&B service account to initiate the agent, ensuring it is not linked to an individual user account.
Use wandb.Run.config to manage hyperparameters, allowing for overwriting during job re-runs. Refer to the configuration with argparse guide for details on using argparse.

6.2 - Can I specify a Dockerfile and let W&B build a Docker image for me?

This feature suits projects with stable requirements but frequently changing codebases.

Format your Dockerfile to use mounts. For further details, visit the Mounts documentation on the Docker Docs website.

After configuring the Dockerfile, specify it in one of three ways to W&B:

Use Dockerfile.wandb
Use W&B CLI
Use W&B App

Include a Dockerfile.wandb file in the same directory as the W&B run’s entrypoint. W&B utilizes this file instead of the built-in Dockerfile.

Use the --dockerfile flag with the wandb launch command to queue a job:

wandb launch --dockerfile path/to/Dockerfile

When adding a job to a queue in the W&B App, provide the Dockerfile path in the Overrides section. Enter it as a key-value pair with "dockerfile" as the key and the path to the Dockerfile as the value.

The following JSON demonstrates how to include a Dockerfile in a local directory:

{
  "args": [],
  "run_config": {
    "lr": 0,
    "batch_size": 0,
    "epochs": 0
  },
  "entrypoint": [],
  "dockerfile": "./Dockerfile"
}

6.3 - Can Launch automatically provision (and spin down) compute resources for me in the target environment?

This process depends on the environment. Resources provision in Amazon SageMaker and Vertex. In Kubernetes, autoscalers automatically adjust resources based on demand. Solution Architects at W&B assist in configuring Kubernetes infrastructure to enable retries, autoscaling, and the use of spot instance node pools. For support, contact support@wandb.com or use your shared Slack channel.

6.4 - Can you specify secrets for jobs/automations? For instance, an API key which you do not wish to be directly visible to users?

Yes. Follow these steps:

Create a Kubernetes secret in the designated namespace for the runs using the command:
kubectl create secret -n <namespace> generic <secret_name> <secret_value>
After creating the secret, configure the queue to inject the secret when runs start. Only cluster administrators can view the secret; end users cannot see it.

6.5 - Does Launch support parallelization? How can I limit the resources consumed by a job?

Launch supports scaling jobs across multiple GPUs and nodes. Refer to the Volcano integration guide for details.

Each launch agent is configured with a max_jobs parameter, which determines the maximum number of simultaneous jobs it can run. Multiple agents can point to a single queue as long as they connect to an appropriate launching infrastructure.

You can set limits on CPU, GPU, memory, and other resources at the queue or job run level in the resource configuration. For information on setting up queues with resource limits on Kubernetes, see the Kubernetes setup guide.

For sweeps, include the following block in the queue configuration to limit the number of concurrent runs:

  scheduler:
    num_workers: 4

6.6 - How can admins restrict which users have modify access?

Control access to certain queue fields for users who are not team administrators through queue config templates. Team administrators define which fields non-admin users can view, and set the editing limits. Only team administrators have the ability to create or edit queues.

6.7 - How do I control who can push to a queue?

Queues are specific to a user team. Define the owning entity during queue creation. To restrict access, modify team membership.

6.8 - How do I fix a "permission denied" error in Launch?

If you encounter the error message Launch Error: Permission denied, it indicates insufficient permissions to log to the desired project. Possible causes include:

You are not logged in on this machine. Run wandb login in the command line.
The specified entity does not exist. The entity must be your username or an existing team’s name. Create a team if necessary with the Subscriptions page.
You lack project permissions. Request the project creator to change the privacy setting to Open to allow logging runs to the project.

6.9 - How do I make W&B Launch work with Tensorflow on GPU?

For TensorFlow jobs using GPUs, specify a custom base image for the container build. This ensures proper GPU utilization during runs. Add an image tag under the builder.accelerator.base_image key in the resource configuration. For example:

{
    "gpus": "all",
    "builder": {
        "accelerator": {
            "base_image": "tensorflow/tensorflow:latest-gpu"
        }
    }
}

In versions prior to W&B 0.15.6, use cuda instead of accelerator as the parent key for base_image.

6.10 - How does W&B Launch build images?

The steps for building an image depend on the job source and the specified accelerator base image in the resource configuration.

When configuring a queue or submitting a job, include a base accelerator image in the queue or job resource configuration:

{
    "builder": {
        "accelerator": {
            "base_image": "image-name"
        }
    }
}

The build process includes the following actions based on the job type and provided accelerator base image:

6.11 - I do not like clicking- can I use Launch without going through the UI?

Yes. The standard wandb CLI includes a launch subcommand to launch jobs. For more information, run:

wandb launch --help

6.12 - I do not want W&B to build a container for me, can I still use Launch?

To launch a pre-built Docker image, execute the following command. Replace the placeholders in the <> with your specific information:

wandb launch -d <docker-image-uri> -q <queue-name> -E <entrypoint>

This command creates a job and starts a run.

To create a job from an image, use the following command:

wandb job create image <image-name> -p <project> -e <entity>

6.13 - Is `wandb launch -d` or `wandb job create image` uploading a whole docker artifact and not pulling from a registry?

No, the wandb launch -d command does not upload images to a registry. Upload images to a registry separately. Follow these steps:

Build an image.
Push the image to a registry.

The workflow is as follows:

docker build -t <repo-url>:<tag> .
docker push <repo-url>:<tag>
wandb launch -d <repo-url>:<tag>

The launch agent then spins up a job pointing to the specified container. See Advanced agent setup for examples on configuring agent access to pull images from a container registry.

For Kubernetes, ensure that the Kubernetes cluster pods have access to the registry where the image is pushed.

6.14 - What permissions does the agent require in Kubernetes?

The following Kubernetes manifest creates a role named wandb-launch-agent in the wandb namespace. This role allows the agent to create pods, configmaps, secrets, and access pod logs in the wandb namespace. The wandb-cluster-role enables the agent to create pods, access pod logs, create secrets, jobs, and check job status across any specified namespace.

6.15 - What requirements does the accelerator base image have?

For jobs utilizing an accelerator, provide a base image that includes the necessary accelerator components. Ensure the following requirements for the accelerator image:

Compatibility with Debian (the Launch Dockerfile uses apt-get to install Python)
Supported CPU and GPU hardware instruction set (confirm the CUDA version compatibility with the intended GPU)
Compatibility between the supplied accelerator version and the packages in the machine learning algorithm
Installation of packages that require additional steps for hardware compatibility

6.16 - When multiple jobs in a Docker queue download the same artifact, is any caching used, or is it re-downloaded every run?

No caching exists. Each launch job operates independently. Configure the queue or agent to mount a shared cache using Docker arguments in the queue configuration.

Additionally, mount the W&B artifacts cache as a persistent volume for specific use cases.

7 - Launch integration guides

7.1 - Dagster

Guide on how to integrate W&B with Dagster.

Use Dagster and W&B (W&B) to orchestrate your MLOps pipelines and maintain ML assets. The integration with W&B makes it easy within Dagster to:

Create and use a W&B Artifact.
Use and create Registered Models in W&B Registry.
Run training jobs on dedicated compute using W&B Launch.
Use the wandb client in ops and assets.

The W&B Dagster integration provides a W&B-specific Dagster resource and IO Manager:

wandb_resource: a Dagster resource used to authenticate and communicate to the W&B API.
wandb_artifacts_io_manager: a Dagster IO Manager used to consume W&B Artifacts.

The following guide demonstrates how to satisfy prerequisites to use W&B in Dagster, how to create and use W&B Artifacts in ops and assets, how to use W&B Launch and recommended best practices.

Before you get started

You will need the following resources to use Dagster within Weights and Biases:

W&B API Key.
W&B entity (user or team): An entity is a username or team name where you send W&B Runs and Artifacts. Make sure to create your account or team entity in the W&B App UI before you log runs. If you do not specify ain entity, the run will be sent to your default entity, which is usually your username. Change your default entity in your settings under Project Defaults.
W&B project: The name of the project where W&B Runs are stored.

Find your W&B entity by checking the profile page for that user or team in the W&B App. You can use a pre-existing W&B project or create a new one. New projects can be created on the W&B App homepage or on user/team profile page. If a project does not exist it will be automatically created when you first use it. The proceeding instructions demonstrate how to get an API key:

How to get an API key

Log in to W&B. Note: if you are using W&B Server ask your admin for the instance host name.
Collect your API key by navigating to the authorize page or in your user/team settings. For a production environment we recommend using a service account to own that key.
Set an environment variable for that API key export WANDB_API_KEY=YOUR_KEY.

The proceeding examples demonstrate where to specify your API key in your Dagster code. Make sure to specify your entity and project name within the wandb_config nested dictionary. You can pass different wandb_config values to different ops/assets if you want to use a different W&B Project. For more information about possible keys you can pass, see the Configuration section below.

Example: configuration for @job

# add this to your config.yaml
# alternatively you can set the config in Dagit's Launchpad or JobDefinition.execute_in_process
# Reference: https://docs.dagster.io/concepts/configuration/config-schema#specifying-runtime-configuration
resources:
 wandb_config:
   config:
     entity: my_entity # replace this with your W&B entity
     project: my_project # replace this with your W&B project


@job(
   resource_defs={
       "wandb_config": make_values_resource(
           entity=str,
           project=str,
       ),
       "wandb_resource": wandb_resource.configured(
           {"api_key": {"env": "WANDB_API_KEY"}}
       ),
       "io_manager": wandb_artifacts_io_manager,
   }
)
def simple_job_example():
   my_op()

Example: configuration for @repository using assets

from dagster_wandb import wandb_artifacts_io_manager, wandb_resource
from dagster import (
   load_assets_from_package_module,
   make_values_resource,
   repository,
   with_resources,
)

from . import assets

@repository
def my_repository():
   return [
       *with_resources(
           load_assets_from_package_module(assets),
           resource_defs={
               "wandb_config": make_values_resource(
                   entity=str,
                   project=str,
               ),
               "wandb_resource": wandb_resource.configured(
                   {"api_key": {"env": "WANDB_API_KEY"}}
               ),
               "wandb_artifacts_manager": wandb_artifacts_io_manager.configured(
                   {"cache_duration_in_minutes": 60} # only cache files for one hour
               ),
           },
           resource_config_by_key={
               "wandb_config": {
                   "config": {
                       "entity": "my_entity", # replace this with your W&B entity
                       "project": "my_project", # replace this with your W&B project
                   }
               }
           },
       ),
   ]

Note that we are configuring the IO Manager cache duration in this example contrary to the example for @job.

Configuration

The proceeding configuration options are used as settings on the W&B-specific Dagster resource and IO Manager provided by the integration.

wandb_resource: Dagster resource used to communicate with the W&B API. It automatically authenticates using the provided API key. Properties:
- api_key: (str, required): a W&B API key necessary to communicate with the W&B API.
- host: (str, optional): the API host server you wish to use. Only required if you are using W&B Server. It defaults to the Public Cloud host, https://api.wandb.ai.
wandb_artifacts_io_manager: Dagster IO Manager to consume W&B Artifacts. Properties:
- base_dir: (int, optional) Base directory used for local storage and caching. W&B Artifacts and W&B Run logs will be written and read from that directory. By default, it’s using the DAGSTER_HOME directory.
- cache_duration_in_minutes: (int, optional) to define the amount of time W&B Artifacts and W&B Run logs should be kept in the local storage. Only files and directories that were not opened for that amount of time are removed from the cache. Cache purging happens at the end of an IO Manager execution. You can set it to 0, if you want to turn off caching completely. Caching improves speed when an Artifact is reused between jobs running on the same machine. It defaults to 30 days.
- run_id: (str, optional): A unique ID for this run, used for resuming. It must be unique in the project, and if you delete a run you can’t reuse the ID. Use the name field for a short descriptive name, or config for saving hyperparameters to compare across runs. The ID cannot contain the following special characters: /\#?%:.. You need to set the Run ID when you are doing experiment tracking inside Dagster to allow the IO Manager to resume the run. By default it’s set to the Dagster Run ID e.g 7e4df022-1bf2-44b5-a383-bb852df4077e.
- run_name: (str, optional) A short display name for this run to help you identify this run in the UI. By default, it is a string with the following format: dagster-run-[8 first characters of the Dagster Run ID]. For example, dagster-run-7e4df022.
- run_tags: (list[str], optional): A list of strings, which will populate the list of tags on this run in the UI. Tags are useful for organizing runs together, or applying temporary labels like baseline or production. It’s easy to add and remove tags in the UI, or filter down to just runs with a specific tag. Any W&B Run used by the integration will have the dagster_wandb tag.

Use W&B Artifacts

The integration with W&B Artifact relies on a Dagster IO Manager.

IO Managers are user-provided objects that are responsible for storing the output of an asset or op and loading it as input to downstream assets or ops. For example, an IO Manager might store and load objects from files on a filesystem.

The integration provides an IO Manager for W&B Artifacts. This allows any Dagster @op or @asset to create and consume W&B Artifacts natively. Here’s a simple example of an @asset producing a W&B Artifact of type dataset containing a Python list.

@asset(
    name="my_artifact",
    metadata={
        "wandb_artifact_arguments": {
            "type": "dataset",
        }
    },
    io_manager_key="wandb_artifacts_manager",
)
def create_dataset():
    return [1, 2, 3] # this will be stored in an Artifact

You can annotate your @op, @asset and @multi_asset with a metadata configuration in order to write Artifacts. Similarly you can also consume W&B Artifacts even if they were created outside Dagster.

Write W&B Artifacts

Before continuing, we recommend you to have a good understanding of how to use W&B Artifacts. Consider reading the Guide on Artifacts.

Return an object from a Python function to write a W&B Artifact. The following objects are supported by W&B:

Python objects (int, dict, list…)
W&B objects (Table, Image, Graph…)
W&B Artifact objects

The proceeding examples demonstrate how to write W&B Artifacts with Dagster assets (@asset):

Anything that can be serialized with the pickle module is pickled and added to an Artifact created by the integration. The content is unpickled when you read that Artifact inside Dagster (see Read artifacts for more details).

@asset(
    name="my_artifact",
    metadata={
        "wandb_artifact_arguments": {
            "type": "dataset",
        }
    },
    io_manager_key="wandb_artifacts_manager",
)
def create_dataset():
    return [1, 2, 3]

W&B supports multiple Pickle-based serialization modules (pickle, dill, cloudpickle, joblib). You can also use more advanced serialization like ONNX or PMML. Please refer to the Serialization section for more information.

Any W&B object, such as a Table or Image, is added to an Artifact created by the integration. This example adds a Table to an Artifact:

import wandb

@asset(
    name="my_artifact",
    metadata={
        "wandb_artifact_arguments": {
            "type": "dataset",
        }
    },
    io_manager_key="wandb_artifacts_manager",
)
def create_dataset_in_table():
    return wandb.Table(columns=["a", "b", "c"], data=[[1, 2, 3]])

For complex use cases, it might be necessary to build your own Artifact object. The integration still provides useful additional features like augmenting the metadata on both sides of the integration.

import wandb

MY_ASSET = "my_asset"

@asset(
    name=MY_ASSET,
    io_manager_key="wandb_artifacts_manager",
)
def create_artifact():
   artifact = wandb.Artifact(MY_ASSET, "dataset")
   table = wandb.Table(columns=["a", "b", "c"], data=[[1, 2, 3]])
   artifact.add(table, "my_table")
   return artifact

Configuration

A configuration dictionary called wandb_artifact_configuration can be set on an @op, @asset and @multi_asset. This dictionary must be passed in the decorator arguments as metadata. This configuration is required to control the IO Manager reads and writes of W&B Artifacts.

For @op, it’s located in the output metadata through the Out metadata argument. For @asset, it’s located in the metadata argument on the asset. For @multi_asset, it’s located in each output metadata through the AssetOut metadata arguments.

The proceeding code examples demonstrate how to configure a dictionary on an @op, @asset and @multi_asset computations:

Example for @op:

@op(
   out=Out(
       metadata={
           "wandb_artifact_configuration": {
               "name": "my_artifact",
               "type": "dataset",
           }
       }
   )
)
def create_dataset():
   return [1, 2, 3]

Example for @asset:

@asset(
   name="my_artifact",
   metadata={
       "wandb_artifact_configuration": {
           "type": "dataset",
       }
   },
   io_manager_key="wandb_artifacts_manager",
)
def create_dataset():
   return [1, 2, 3]

You do not need to pass a name through the configuration because the @asset already has a name. The integration sets the Artifact name as the asset name.

Example for @multi_asset:

@multi_asset(
   name="create_datasets",
   outs={
       "first_table": AssetOut(
           metadata={
               "wandb_artifact_configuration": {
                   "type": "training_dataset",
               }
           },
           io_manager_key="wandb_artifacts_manager",
       ),
       "second_table": AssetOut(
           metadata={
               "wandb_artifact_configuration": {
                   "type": "validation_dataset",
               }
           },
           io_manager_key="wandb_artifacts_manager",
       ),
   },
   group_name="my_multi_asset_group",
)
def create_datasets():
   first_table = wandb.Table(columns=["a", "b", "c"], data=[[1, 2, 3]])
   second_table = wandb.Table(columns=["d", "e"], data=[[4, 5]])

   return first_table, second_table

Supported properties:

name: (str) human-readable name for this artifact, which is how you can identify this artifact in the UI or reference it in use_artifact calls. Names can contain letters, numbers, underscores, hyphens, and dots. The name must be unique across a project. Required for @op.
type: (str) The type of the artifact, which is used to organize and differentiate artifacts. Common types include dataset or model, but you can use any string containing letters, numbers, underscores, hyphens, and dots. Required when the output is not already an Artifact.
description: (str) Free text that offers a description of the artifact. The description is markdown rendered in the UI, so this is a good place to place tables, links, etc.
aliases: (list[str]) An array containing one or more aliases you want to apply on the Artifact. The integration will also add the “latest” tag to that list whether it’s set or not. This is an effective way for you to manage versioning of models and datasets.
add_dirs: (list[dict[str, Any]]): An array containing configuration for each local directory to include in the Artifact.
add_files: (list[dict[str, Any]]): An array containing configuration for each local file to include in the Artifact.
add_references: (list[dict[str, Any]]): An array containing configuration for each external reference to include in the Artifact.
serialization_module: (dict) Configuration of the serialization module to be used. Refer to the Serialization section for more information.
- name: (str) Name of the serialization module. Accepted values: pickle, dill, cloudpickle, joblib. The module needs to be available locally.
- parameters: (dict[str, Any]) Optional arguments passed to the serialization function. It accepts the same parameters as the dump method for that module. For example, {"compress": 3, "protocol": 4}.

Advanced example:

@asset(
   name="my_advanced_artifact",
   metadata={
       "wandb_artifact_configuration": {
           "type": "dataset",
           "description": "My *Markdown* description",
           "aliases": ["my_first_alias", "my_second_alias"],
           "add_dirs": [
               {
                   "name": "My directory",
                   "local_path": "path/to/directory",
               }
           ],
           "add_files": [
               {
                   "name": "validation_dataset",
                   "local_path": "path/to/data.json",
               },
               {
                   "is_tmp": True,
                   "local_path": "path/to/temp",
               },
           ],
           "add_references": [
               {
                   "uri": "https://picsum.photos/200/300",
                   "name": "External HTTP reference to an image",
               },
               {
                   "uri": "s3://my-bucket/datasets/mnist",
                   "name": "External S3 reference",
               },
           ],
       }
   },
   io_manager_key="wandb_artifacts_manager",
)
def create_advanced_artifact():
   return [1, 2, 3]

The asset is materialized with useful metadata on both sides of the integration:

W&B side: the source integration name and version, the python version used, the pickle protocol version and more.
Dagster side:
- Dagster Run ID
- W&B Run: ID, name, path, URL
- W&B Artifact: ID, name, type, version, size, URL
- W&B Entity
- W&B Project

The proceeding image demonstrates the metadata from W&B that was added to the Dagster asset. This information would not be available without the integration.

The following image demonstrates how the provided configuration was enriched with useful metadata on the W&B Artifact. This information should help for reproducibility and maintenance. It would not be available without the integration.

If you use a static type checker like mypy, import the configuration type definition object using:

from dagster_wandb import WandbArtifactConfiguration

Using partitions

The integration natively supports Dagster partitions.

The following is an example with a partitioned using DailyPartitionsDefinition.

@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2023-01-01", end_date="2023-02-01"),
    name="my_daily_partitioned_asset",
    compute_kind="wandb",
    metadata={
        "wandb_artifact_configuration": {
            "type": "dataset",
        }
    },
)
def create_my_daily_partitioned_asset(context):
    partition_key = context.asset_partition_key_for_output()
    context.log.info(f"Creating partitioned asset for {partition_key}")
    return random.randint(0, 100)

This code will produce one W&B Artifact for each partition. View artifacts in the Artifact panel (UI) under the asset name, which has the partition key appended. For example, my_daily_partitioned_asset.2023-01-01, my_daily_partitioned_asset.2023-01-02, ormy_daily_partitioned_asset.2023-01-03. Assets that are partitioned across multiple dimensions shows each dimension in dot-delimited format. For example, my_asset.car.blue.

The integration does not allow for the materialization of multiple partitions within one run. You will need to carry out multiple runs to materialize your assets. This can be executed in Dagit when you’re materializing your assets.

Advanced usage

Read W&B Artifacts

Reading W&B Artifacts is similar to writing them. A configuration dictionary called wandb_artifact_configuration can be set on an @op or @asset. The only difference is that we must set the configuration on the input instead of the output.

For @op, it’s located in the input metadata through the In metadata argument. You need to explicitly pass the name of the Artifact.

For @asset, it’s located in the input metadata through the Asset In metadata argument. You should not pass an Artifact name because the name of the parent asset should match it.

If you want to have a dependency on an Artifact created outside the integration you will need to use SourceAsset. It will always read the latest version of that asset.

The following examples demonstrate how to read an Artifact from various ops.

Reading an artifact from an @op

@op(
   ins={
       "artifact": In(
           metadata={
               "wandb_artifact_configuration": {
                   "name": "my_artifact",
               }
           }
       )
   },
   io_manager_key="wandb_artifacts_manager"
)
def read_artifact(context, artifact):
   context.log.info(artifact)

Reading an artifact created by another @asset

@asset(
   name="my_asset",
   ins={
       "artifact": AssetIn(
           # if you don't want to rename the input argument you can remove 'key'
           key="parent_dagster_asset_name",
           input_manager_key="wandb_artifacts_manager",
       )
   },
)
def read_artifact(context, artifact):
   context.log.info(artifact)

Reading an Artifact created outside Dagster:

my_artifact = SourceAsset(
   key=AssetKey("my_artifact"),  # the name of the W&B Artifact
   description="Artifact created outside Dagster",
   io_manager_key="wandb_artifacts_manager",
)


@asset
def read_artifact(context, my_artifact):
   context.log.info(my_artifact)

Configuration

The proceeding configuration is used to indicate what the IO Manager should collect and provide as inputs to the decorated functions. The following read patterns are supported.

To get an named object contained within an Artifact use get:

@asset(
   ins={
       "table": AssetIn(
           key="my_artifact_with_table",
           metadata={
               "wandb_artifact_configuration": {
                   "get": "my_table",
               }
           },
           input_manager_key="wandb_artifacts_manager",
       )
   }
)
def get_table(context, table):
   context.log.info(table.get_column("a"))

To get the local path of a downloaded file contained within an Artifact use get_path:

@asset(
   ins={
       "path": AssetIn(
           key="my_artifact_with_file",
           metadata={
               "wandb_artifact_configuration": {
                   "get_path": "name_of_file",
               }
           },
           input_manager_key="wandb_artifacts_manager",
       )
   }
)
def get_path(context, path):
   context.log.info(path)

To get the entire Artifact object (with the content downloaded locally):

@asset(
   ins={
       "artifact": AssetIn(
           key="my_artifact",
           input_manager_key="wandb_artifacts_manager",
       )
   },
)
def get_artifact(context, artifact):
   context.log.info(artifact.name)

Supported properties

get: (str) Gets the W&B object located at the artifact relative name.
get_path: (str) Gets the path to the file located at the artifact relative name.

Serialization configuration

By default, the integration will use the standard pickle module, but some objects are not compatible with it. For example, functions with yield will raise an error if you try to pickle them.

We support more Pickle-based serialization modules (dill, cloudpickle, joblib). You can also use more advanced serialization like ONNX or PMML by returning a serialized string or creating an Artifact directly. The right choice will depend on your use case, please refer to the available literature on this subject.

Pickle-based serialization modules

Pickling is known to be insecure. If security is a concern please only use W&B objects. We recommend signing your data and storing the hash keys in your own systems. For more complex use cases don’t hesitate to contact us, we will be happy to help.

You can configure the serialization used through the serialization_module dictionary in the wandb_artifact_configuration. Please make sure the module is available on the machine running Dagster.

The integration will automatically know which serialization module to use when you read that Artifact.

The currently supported modules are pickle, dill, cloudpickle, and joblib.

Here’s a simplified example where we create a “model” serialized with joblib and then use it for inference.

@asset(
    name="my_joblib_serialized_model",
    compute_kind="Python",
    metadata={
        "wandb_artifact_configuration": {
            "type": "model",
            "serialization_module": {
                "name": "joblib"
            },
        }
    },
    io_manager_key="wandb_artifacts_manager",
)
def create_model_serialized_with_joblib():
    # This is not a real ML model but this would not be possible with the pickle module
    return lambda x, y: x + y

@asset(
    name="inference_result_from_joblib_serialized_model",
    compute_kind="Python",
    ins={
        "my_joblib_serialized_model": AssetIn(
            input_manager_key="wandb_artifacts_manager",
        )
    },
    metadata={
        "wandb_artifact_configuration": {
            "type": "results",
        }
    },
    io_manager_key="wandb_artifacts_manager",
)
def use_model_serialized_with_joblib(
    context: OpExecutionContext, my_joblib_serialized_model
):
    inference_result = my_joblib_serialized_model(1, 2)
    context.log.info(inference_result)  # Prints: 3
    return inference_result

Advanced serialization formats (ONNX, PMML)

It’s common to use interchange file formats like ONNX and PMML. The integration supports those formats but it requires a bit more work than for Pickle-based serialization.

There are two different methods to use those formats.

Convert your model to the selected format, then return the string representation of that format as if it were a normal Python objects. The integration will pickle that string. You can then rebuild your model using that string.
Create a new local file with your serialized model, then build a custom Artifact with that file using the add_file configuration.

Here’s an example of a Scikit-learn model being serialized using ONNX.

import numpy
import onnxruntime as rt
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from dagster import AssetIn, AssetOut, asset, multi_asset

@multi_asset(
    compute_kind="Python",
    outs={
        "my_onnx_model": AssetOut(
            metadata={
                "wandb_artifact_configuration": {
                    "type": "model",
                }
            },
            io_manager_key="wandb_artifacts_manager",
        ),
        "my_test_set": AssetOut(
            metadata={
                "wandb_artifact_configuration": {
                    "type": "test_set",
                }
            },
            io_manager_key="wandb_artifacts_manager",
        ),
    },
    group_name="onnx_example",
)
def create_onnx_model():
    # Inspired from https://onnx.ai/sklearn-onnx/

    # Train a model.
    iris = load_iris()
    X, y = iris.data, iris.target
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    clr = RandomForestClassifier()
    clr.fit(X_train, y_train)

    # Convert into ONNX format
    initial_type = [("float_input", FloatTensorType([None, 4]))]
    onx = convert_sklearn(clr, initial_types=initial_type)

    # Write artifacts (model + test_set)
    return onx.SerializeToString(), {"X_test": X_test, "y_test": y_test}

@asset(
    name="experiment_results",
    compute_kind="Python",
    ins={
        "my_onnx_model": AssetIn(
            input_manager_key="wandb_artifacts_manager",
        ),
        "my_test_set": AssetIn(
            input_manager_key="wandb_artifacts_manager",
        ),
    },
    group_name="onnx_example",
)
def use_onnx_model(context, my_onnx_model, my_test_set):
    # Inspired from https://onnx.ai/sklearn-onnx/

    # Compute the prediction with ONNX Runtime
    sess = rt.InferenceSession(my_onnx_model)
    input_name = sess.get_inputs()[0].name
    label_name = sess.get_outputs()[0].name
    pred_onx = sess.run(
        [label_name], {input_name: my_test_set["X_test"].astype(numpy.float32)}
    )[0]
    context.log.info(pred_onx)
    return pred_onx

Using partitions

The integration natively supports Dagster partitions.

You can selectively read one, multiple, or all partitions of an asset.

All partitions are provided in a dictionary, with the key and value representing the partition key and the Artifact content, respectively.

It reads all partitions of the upstream @asset, which are given as a dictionary. In this dictionary, the key and value correlate to the partition key and the Artifact content, respectively.

@asset(
    compute_kind="wandb",
    ins={"my_daily_partitioned_asset": AssetIn()},
    output_required=False,
)
def read_all_partitions(context, my_daily_partitioned_asset):
    for partition, content in my_daily_partitioned_asset.items():
        context.log.info(f"partition={partition}, content={content}")

The AssetIn’s partition_mapping configuration allows you to choose specific partitions. In this case, we are employing the TimeWindowPartitionMapping.

@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2023-01-01", end_date="2023-02-01"),
    compute_kind="wandb",
    ins={
        "my_daily_partitioned_asset": AssetIn(
            partition_mapping=TimeWindowPartitionMapping(start_offset=-1)
        )
    },
    output_required=False,
)
def read_specific_partitions(context, my_daily_partitioned_asset):
    for partition, content in my_daily_partitioned_asset.items():
        context.log.info(f"partition={partition}, content={content}")

The configuration object, metadata, is used to configure how Weights & Biases (wandb) interacts with different artifact partitions in your project.

The object metadata contains a key named wandb_artifact_configuration which further contains a nested object partitions.

The partitions object maps the name of each partition to its configuration. The configuration for each partition can specify how to retrieve data from it. These configurations can contain different keys, namely get, version, and alias, depending on the requirements of each partition.

Configuration keys

get: The get key specifies the name of the W&B Object (Table, Image…) where to fetch the data.
version: The version key is used when you want to fetch a specific version for the Artifact.
alias: The alias key allows you to get the Artifact by its alias.

Wildcard configuration

The wildcard "*" stands for all non-configured partitions. This provides a default configuration for partitions that are not explicitly mentioned in the partitions object.

For example,

"*": {
    "get": "default_table_name",
},

This configuration means that for all partitions not explicitly configured, data is fetched from the table named default_table_name.

Specific partition configuration

You can override the wildcard configuration for specific partitions by providing their specific configurations using their keys.

For example,

"yellow": {
    "get": "custom_table_name",
},

This configuration means that for the partition named yellow, data will be fetched from the table named custom_table_name, overriding the wildcard configuration.

Versioning and aliasing

For versioning and aliasing purposes, you can provide specific version and alias keys in your configuration.

For versions,

"orange": {
    "version": "v0",
},

This configuration will fetch data from the version v0 of the orange Artifact partition.

For aliases,

"blue": {
    "alias": "special_alias",
},

This configuration will fetch data from the table default_table_name of the Artifact partition with the alias special_alias (referred to as blue in the configuration).

Advanced usage

To view advanced usage of the integration please refer to the following full code examples:

Using W&B Launch

Beta product in active development Interested in Launch? Reach out to your account team to talk about joining the customer pilot program for W&B Launch. Pilot customers need to use AWS EKS or SageMaker to qualify for the beta program. We ultimately plan to support additional platforms.

Before continuing, we recommend you to have a good understanding of how to use W&B Launch. Consider reading the Guide on Launch.

The Dagster integration helps with:

Running one or multiple Launch agents in your Dagster instance.
Executing local Launch jobs within your Dagster instance.
Remote Launch jobs on-prem or in a cloud.

Launch agents

The integration provides an importable @op called run_launch_agent. It starts a Launch Agent and runs it as a long running process until stopped manually.

Agents are processes that poll launch queues and execute the jobs (or dispatch them to external services to be executed) in order.

Refer to the Launch page.

You can also view useful descriptions for all properties in Launchpad.

Simple example

# add this to your config.yaml
# alternatively you can set the config in Dagit's Launchpad or JobDefinition.execute_in_process
# Reference: https://docs.dagster.io/concepts/configuration/config-schema#specifying-runtime-configuration
resources:
 wandb_config:
   config:
     entity: my_entity # replace this with your W&B entity
     project: my_project # replace this with your W&B project
ops:
 run_launch_agent:
   config:
     max_jobs: -1
     queues: 
       - my_dagster_queue

from dagster_wandb.launch.ops import run_launch_agent
from dagster_wandb.resources import wandb_resource

from dagster import job, make_values_resource

@job(
   resource_defs={
       "wandb_config": make_values_resource(
           entity=str,
           project=str,
       ),
       "wandb_resource": wandb_resource.configured(
           {"api_key": {"env": "WANDB_API_KEY"}}
       ),
   },
)
def run_launch_agent_example():
   run_launch_agent()

Launch jobs

The integration provides an importable @op called run_launch_job. It executes your Launch job.

A Launch job is assigned to a queue in order to be executed. You can create a queue or use the default one. Make sure you have an active agent listening to that queue. You can run an agent inside your Dagster instance but can also consider using a deployable agent in Kubernetes.

Refer to the Launch page.

You can also view useful descriptions for all properties in Launchpad.

Simple example

# add this to your config.yaml
# alternatively you can set the config in Dagit's Launchpad or JobDefinition.execute_in_process
# Reference: https://docs.dagster.io/concepts/configuration/config-schema#specifying-runtime-configuration
resources:
 wandb_config:
   config:
     entity: my_entity # replace this with your W&B entity
     project: my_project # replace this with your W&B project
ops:
 my_launched_job:
   config:
     entry_point:
       - python
       - train.py
     queue: my_dagster_queue
     uri: https://github.com/wandb/example-dagster-integration-with-launch


from dagster_wandb.launch.ops import run_launch_job
from dagster_wandb.resources import wandb_resource

from dagster import job, make_values_resource


@job(resource_defs={
       "wandb_config": make_values_resource(
           entity=str,
           project=str,
       ),
       "wandb_resource": wandb_resource.configured(
           {"api_key": {"env": "WANDB_API_KEY"}}
       ),
   },
)
def run_launch_job_example():
   run_launch_job.alias("my_launched_job")() # we rename the job with an alias

Best practices

Use the IO Manager to read and write Artifacts. Avoid using Artifact.download() or Run.log_artifact() directly. Those methods are handled by integration. Instead, return the data you want to store in the Artifact and let the integration do the rest. This approach provides better lineage for the Artifact.
Only build an Artifact object yourself for complex use cases. Python objects and W&B objects should be returned from your ops/assets. The integration handles bundling the Artifact. For complex use cases, you can build an Artifact directly in a Dagster job. We recommend you pass an Artifact object to the integration for metadata enrichment such as the source integration name and version, the python version used, the pickle protocol version and more.
Add files, directories and external references to your Artifacts through the metadata. Use the integration wandb_artifact_configuration object to add any file, directory or external references (Amazon S3, GCS, HTTP…). See the advanced example in the Artifact configuration section for more information.
Use an @asset instead of an @op when an Artifact is produced. Artifacts are assets. It is recommended to use an asset when Dagster maintains that asset. This will provide better observability in the Dagit Asset Catalog.
Use a SourceAsset to consume an Artifact created outside Dagster. This allows you to take advantage of the integration to read externally created Artifacts. Otherwise, you can only use Artifacts created by the integration.
Use W&B Launch to orchestrate training on dedicated compute for large models. You can train small models inside your Dagster cluster and you can run Dagster in a Kubernetes cluster with GPU nodes. We recommend using W&B Launch for large model training. This will prevent overloading your instance and provide access to more adequate compute.
When experiment tracking within Dagster, set your W&B Run ID to the value of your Dagster Run ID. We recommend that you both: make the Run resumable and set the W&B Run ID to the Dagster Run ID or to a string of your choice. Following this recommendation ensures your W&B metrics and W&B Artifacts are stored in the same W&B Run when you train models inside of Dagster.

Either set the W&B Run ID to the Dagster Run ID.

wandb.init(
    id=context.run_id,
    resume="allow",
    ...
)

Or choose your own W&B Run ID and pass it to the IO Manager configuration.

wandb.init(
    id="my_resumable_run_id",
    resume="allow",
    ...
)

@job(
   resource_defs={
       "io_manager": wandb_artifacts_io_manager.configured(
           {"wandb_run_id": "my_resumable_run_id"}
       ),
   }
)

Only collect data you need with get or get_path for large W&B Artifacts. By default, the integration will download an entire Artifact. If you are using very large artifacts you might want to only collect the specific files or objects you need. This will improve speed and resource utilization.
For Python objects adapt the pickling module to your use case. By default, the W&B integration will use the standard pickle module. But some objects are not compatible with it. For example, functions with yield will raise an error if you try to pickle them. W&B supports other Pickle-based serialization modules (dill, cloudpickle, joblib).

You can also use more advanced serialization like ONNX or PMML by returning a serialized string or creating an Artifact directly. The right choice will depend on your use case, refer to the available literature on this subject.

7.2 - Launch multinode jobs with Volcano

This tutorial will guide you through the process of launching multinode training jobs with W&B and Volcano on Kubernetes.

Overview

In this tutorial, you will learn how to use W&B Launch to run multinode jobs on Kubernetes. The steps we will follow are:

Ensure that you have a Weights & Biases account and a Kubernetes cluster.
Create a launch queue for our volcano jobs.
Deploy a Launch agent into our kubernetes cluster.
Create a distributed training job.
Launch our distributed training.

Prerequisites

Before you get started, you will need:

A Weights & Biases account
A Kubernetes cluster

Create a launch queue

The first step is to create a launch queue. Head to wandb.ai/launch and in the top right corner of your screen, hit the blue Create a queue button. A queue creation drawer will slide out from the right side of your screen. Select an entity, enter a name, and select Kubernetes as the type for your queue.

In the configuration section, we will enter a volcano job template. Any runs launched from this queue will be created using this job specification, so you can modify this configuration as needed to customize your jobs.

This configuration block can accept a Kubernetes job specification, volcano job specification, or any other custom resource definition (CRD) that you are interested in launching. You can make use of macros in the configuration block to dynamically set the contents of this spec.

In this tutorial, we will use a configuration for multinode pytorch training that makes use of volcano’s pytorch plugin. You can copy and paste the following config as YAML or JSON:

kind: Job
spec:
  tasks:
    - name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      replicas: 1
      template:
        spec:
          containers:
            - name: master
              image: ${image_uri}
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    - name: worker
      replicas: 1
      template:
        spec:
          containers:
            - name: worker
              image: ${image_uri}
              workingDir: /home
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
  plugins:
    pytorch:
      - --master=master
      - --worker=worker
      - --port=23456
  minAvailable: 1
  schedulerName: volcano
metadata:
  name: wandb-job-${run_id}
  labels:
    wandb_entity: ${entity_name}
    wandb_project: ${project_name}
  namespace: wandb
apiVersion: batch.volcano.sh/v1alpha1

{
  "kind": "Job",
  "spec": {
    "tasks": [
      {
        "name": "master",
        "policies": [
          {
            "event": "TaskCompleted",
            "action": "CompleteJob"
          }
        ],
        "replicas": 1,
        "template": {
          "spec": {
            "containers": [
              {
                "name": "master",
                "image": "${image_uri}",
                "imagePullPolicy": "IfNotPresent"
              }
            ],
            "restartPolicy": "OnFailure"
          }
        }
      },
      {
        "name": "worker",
        "replicas": 1,
        "template": {
          "spec": {
            "containers": [
              {
                "name": "worker",
                "image": "${image_uri}",
                "workingDir": "/home",
                "imagePullPolicy": "IfNotPresent"
              }
            ],
            "restartPolicy": "OnFailure"
          }
        }
      }
    ],
    "plugins": {
      "pytorch": [
        "--master=master",
        "--worker=worker",
        "--port=23456"
      ]
    },
    "minAvailable": 1,
    "schedulerName": "volcano"
  },
  "metadata": {
    "name": "wandb-job-${run_id}",
    "labels": {
      "wandb_entity": "${entity_name}",
      "wandb_project": "${project_name}"
    },
    "namespace": "wandb"
  },
  "apiVersion": "batch.volcano.sh/v1alpha1"
}

Click the Create queue button at the bottom of the drawer to finish creating your queue.

Install Volcano

To install Volcano in your Kubernetes cluster, you can follow the official installation guide.

Deploy your launch agent

Now that you have created a queue, you will need to deploy a launch agent to pull and execute jobs from the queue. The easiest way to do this is with the launch-agent chart from W&B’s official helm-charts repository. Follow the instructions in the README to install the chart into your Kubernetes cluster, and be sure to configure the agent to poll the queue you created earlier.

Create a training job

Volcano’s pytorch plugin automatically configures the necessary environment variables for pytorch DPP to work, such as MASTER_ADDR, RANK, and WORLD_SIZE, as long as your pytorch code uses DDP correctly. Refer to pytorch’s documentation for more details on how to use DDP in your custom python code.

Volcano’s pytorch plugin is also compatible with multinode training via the PyTorch Lightning Trainer.

Launch 🚀

Now that our queue and cluster are set up, it’s time to launch some distributed training. To start off with we will use a job trains a simple multi-layer perceptron on random data using volcano’s pytorch plugin. You can find the source code for the job here.

To launch this job, head to the job’s page and click the Launch button in the top right corner of the screen. You will be prompted to select a queue to launch the job from.

Set the jobs parameters however you like,
Select the queue you created earlier.
Modify the volcano job in the Resource config section to modify the parameters of your job. For example, you can change the number of workers by changing the replicas field in the worker task.
Click Launch 🚀

You can monitor the progress and if necessary stop your job from the W&B UI.

7.3 - NVIDIA NeMo Inference Microservice Deploy Job

Deploy a model artifact from W&B to a NVIDIA NeMo Inference Microservice. To do this, use W&B Launch. W&B Launch converts model artifacts to NVIDIA NeMo Model and deploys to a running NIM/Triton server.

W&B Launch currently accepts the following compatible model types:

Llama2
StarCoder
NV-GPT (coming soon)

Deployment time varies by model and machine type. The base Llama2-7b config takes about 1 minute on GCP’s a2-ultragpu-1g.

Quickstart

Create a launch queue if you don’t have one already. See an example queue config below.

net: host
gpus: all # can be a specific set of GPUs or `all` to use everything
runtime: nvidia # also requires nvidia container runtime
volume:
  - model-store:/model-store/

Create this job in your project:

wandb job create -n "deploy-to-nvidia-nemo-inference-microservice" \
   -e $ENTITY \
   -p $PROJECT \
   -E jobs/deploy_to_nvidia_nemo_inference_microservice/job.py \
   -g andrew/nim-updates \
   git https://github.com/wandb/launch-jobs

Launch an agent on your GPU machine:

wandb launch-agent -e $ENTITY -p $PROJECT -q $QUEUE

Submit the deployment launch job with your desired configs from the Launch UI

You can also submit via the CLI:

wandb launch -d gcr.io/playground-111/deploy-to-nemo:latest \
  -e $ENTITY \
  -p $PROJECT \
  -q $QUEUE \
  -c $CONFIG_JSON_FNAME

You can track the deployment process in the Launch UI.

Once complete, you can immediately curl the endpoint to test the model. The model name is always ensemble.

 #!/bin/bash
 curl -X POST "http://0.0.0.0:9999/v1/completions" \
     -H "accept: application/json" \
     -H "Content-Type: application/json" \
     -d '{
         "model": "ensemble",
         "prompt": "Tell me a joke",
         "max_tokens": 256,
         "temperature": 0.5,
         "n": 1,
         "stream": false,
         "stop": "string",
         "frequency_penalty": 0.0
         }'

7.4 - Spin up a single node GPU cluster with Minikube

Set up W&B Launch on a Minikube cluster that can schedule and run GPU workloads.

This tutorial is intended to guide users with direct access to a machine that has multiple GPUs. This tutorial is not intended for users who rent a cloud machine.

W&B recommends you create a Kubernetes cluster with GPU support that uses your cloud provider, if you want to set up a minikube cluster on a cloud machine. For example, AWS, GCP, Azure, Coreweave, and other cloud providers have tools to create Kubernetes clusters with GPU support.

W&B recommends you use a Launch Docker queue if you want to set up a minikube cluster for scheduling GPUs on a machine that has a single GPU. You can still follow the tutorial for fun, but the GPU scheduling will not be very useful.

Background

The Nvidia container toolkit has made it easy to run GPU-enabled workflows on Docker. One limitation is a lack of native support for scheduling GPU by volume. If you want to use a GPU with the docker run command you must either request specific GPU by ID or all GPU present, which makes many distributed GPU enabled workloads impractical. Kubernetes offers support for scheduling by a volume request, but setting up a local Kubernetes cluster with GPU scheduling can take considerable time and effort, until recently. Minikube, one of the most popular tools for running single node Kubernetes clusters, recently released support for GPU scheduling 🎉 In this tutorial, we will create a Minikube cluster on a multi-GPU machine and launch concurrent stable diffusion inference jobs to the cluster using W&B Launch 🚀

Prerequisites

Before getting started, you will need:

A W&B account.
Linux machine with the following installed and running:
1. Docker runtime
2. Drivers for any GPU you want to use
3. Nvidia container toolkit

For testing and creating this tutorial, we used an n1-standard-16 Google Cloud Compute Engine instance with 4 NVIDIA Tesla T4 GPU connected.

Create a queue for launch jobs

First, create a launch queue for our launch jobs.

Navigate to wandb.ai/launch (or <your-wandb-url>/launch if you use a private W&B server).
In the top right corner of your screen, click the blue Create a queue button. A queue creation drawer will slide out from the right side of your screen.
Select an entity, enter a name, and select Kubernetes as the type for your queue.
The Config section of the drawer is where you will enter a Kubernetes job specification for the launch queue. Any runs launched from this queue will be created using this job specification, so you can modify this configuration as needed to customize your jobs. For this tutorial, you can copy and paste the sample config below in your queue config as YAML or JSON:

spec:
  template:
    spec:
      containers:
        - image: ${image_uri}
          resources:
            limits:
              cpu: 4
              memory: 12Gi
              nvidia.com/gpu: '{{gpus}}'
      restartPolicy: Never
  backoffLimit: 0

{
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "image": "${image_uri}",
            "resources": {
              "limits": {
                "cpu": 4,
                "memory": "12Gi",
                "nvidia.com/gpu": "{{gpus}}"
              }
            }
          }
        ],
        "restartPolicy": "Never"
      }
    },
    "backoffLimit": 0
  }
}

For more information about queue configurations, see the Set up Launch on Kubernetes and the Advanced queue setup guide.

The ${image_uri} and {{gpus}} strings are examples of the two kinds of variable templates that you can use in your queue configuration. The ${image_uri} template will be replaced with the image URI of the job you are launching by the agent. The {{gpus}} template will be used to create a template variable that you can override from the launch UI, CLI, or SDK when submitting a job. These values are placed in the job specification so that they will modify the correct fields to control the image and GPU resources used by the job.

Click the Parse configuration button to begin customizing your gpus template variable.
Set the Type to Integer and the Default, Min, and Max to values of your choosing. Attempts to submit a run to this queue which violate the constraints of the template variable will be rejected.

Click Create queue to create your queue. You will be redirected to the queue page for your new queue.

In the next section, we will set up an agent that can pull and execute jobs from the queue you created.

Setup Docker + NVIDIA CTK

If you already have Docker and the Nvidia container toolkit setup on your machine, you can skip this section.

Refer to Docker’s documentation for instructions on setting up the Docker container engine on your system.

Once you have Docker installed, install the Nvidia container toolkit following the instructions in Nvidia’s documentation.

To validate that your container runtime has access to your GPU, you can run:

docker run --gpus all ubuntu nvidia-smi

You should see nvidia-smi output describing the GPU connected to your machine. For example, on our setup the output looks like this:

Wed Nov  8 23:25:53 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Setup Minikube

Minikube’s GPU support requires version v1.32.0 or later. Refer to Minikube’s install documentation for up to date installation help. For this tutorial, we installed the latest Minikube release using the command:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

The next step is to start a minikube cluster using your GPU. On your machine, run:

minikube start --gpus all

The output of the command above will indicate whether a cluster has been successfully created.

Start launch agent

The launch agent for your new cluster can either be started by invoking wandb launch-agent directly or by deploying the launch agent using a helm chart managed by W&B.

In this tutorial we will run the agent directly on our host machine.

Running the agent outside of a container also means we can use the local Docker host to build images for our cluster to run.

To run the agent locally, make sure your default Kubernetes API context refers to the Minikube cluster. Then, execute the following:

pip install "wandb[launch]"

to install the agent’s dependencies. To setup authentication for the agent, run wandb login or set the WANDB_API_KEY environment variable.

To start the agent, execute this command:

wandb launch-agent -j <max-number-concurrent-jobs> -q <queue-name> -e <queue-entity>

Within your terminal you should see the launch agent start to print polling message.

Congratulations, you have a launch agent polling your launch queue. When a job is added to your queue, your agent will pick it up and schedule it to run on your Minikube cluster.

Launch a job

Let’s send a job to our agent. You can launch a simple “hello world” from a terminal logged into your W&B account with:

wandb launch -d wandb/job_hello_world:main -p <target-wandb-project> -q <your-queue-name> -e <your-queue-entity>

You can test with any job or image you like, but make sure your cluster can pull your image. See Minikube’s documentation for additional guidance. You can also test using one of our public jobs.

(Optional) Model and data caching with NFS

For ML workloads we will often want multiple jobs to have access to the same data. For example, you might want to have a shared cache to avoid repeatedly downloading large assets like datasets or model weights. Kubernetes supports this through persistent volumes and persistent volume claims. Persistent volumes can be used to create volumeMounts in our Kubernetes workloads, providing direct filesystem access to the shared cache.

In this step, we will set up a network file system (NFS) server that can be used as a shared cache for model weights. The first step is to install and configure NFS. This process varies by operating system. Since our VM is running Ubuntu, we installed nfs-kernel-server and configured an export at /srv/nfs/kubedata:

sudo apt-get install nfs-kernel-server
sudo mkdir -p /srv/nfs/kubedata
sudo chown nobody:nogroup /srv/nfs/kubedata
sudo sh -c 'echo "/srv/nfs/kubedata *(rw,sync,no_subtree_check,no_root_squash,no_all_squash,insecure)" >> /etc/exports'
sudo exportfs -ra
sudo systemctl restart nfs-kernel-server

Keep note of the export location of the server in your host filesystem, as well as the local IP address of your NFS server. You need this information in the next step.

Next, you will need to create a persistent volume and persistent volume claim for this NFS. Persistent volumes are highly customizable, but we will use straightforward configuration here for the sake of simplicity.

Copy the yaml below into a file named nfs-persistent-volume.yaml , making sure to fill out your desired volume capacity and claim request. The PersistentVolume.spec.capcity.storage field controls the maximum size of the underlying volume. The PersistentVolumeClaim.spec.resources.requests.stroage can be used to limit the volume capacity allotted for a particular claim. For our use case, it makes sense to use the same value for each.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: 100Gi # Set this to your desired capacity.
  accessModes:
    - ReadWriteMany
  nfs:
    server: <your-nfs-server-ip> # TODO: Fill this in.
    path: '/srv/nfs/kubedata' # Or your custom path
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi # Set this to your desired capacity.
  storageClassName: ''
  volumeName: nfs-pv

Create the resources in your cluster with:

kubectl apply -f nfs-persistent-volume.yaml

In order for our runs to make use of this cache, we will need to add volumes and volumeMounts to our launch queue config. To edit the launch config, head back to wandb.ai/launch (or <your-wandb-url>/launch for users on wandb server), find your queue, click to the queue page, and then click the Edit config tab. The original config can be modified to:

spec:
  template:
    spec:
      containers:
        - image: ${image_uri}
          resources:
            limits:
              cpu: 4
              memory: 12Gi
              nvidia.com/gpu: "{{gpus}}"
					volumeMounts:
            - name: nfs-storage
              mountPath: /root/.cache
      restartPolicy: Never
			volumes:
        - name: nfs-storage
          persistentVolumeClaim:
            claimName: nfs-pvc
  backoffLimit: 0

{
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "image": "${image_uri}",
            "resources": {
              "limits": {
                "cpu": 4,
                "memory": "12Gi",
                "nvidia.com/gpu": "{{gpus}}"
              },
              "volumeMounts": [
                {
                  "name": "nfs-storage",
                  "mountPath": "/root/.cache"
                }
              ]
            }
          }
        ],
        "restartPolicy": "Never",
        "volumes": [
          {
            "name": "nfs-storage",
            "persistentVolumeClaim": {
              "claimName": "nfs-pvc"
            }
          }
        ]
      }
    },
    "backoffLimit": 0
  }
}

Now, our NFS will be mounted at /root/.cache in the containers running our jobs. The mount path will require adjustment if your container runs as a user other than root. Huggingface’s libraries and W&B Artifacts both make use of $HOME/.cache/ by default, so downloads should only happen once.

Playing with stable diffusion

To test out our new system, we are going to experiment with stable diffusion’s inference parameters. To run a simple stable diffusion inference job with a default prompt and sane parameters, you can run:

wandb launch -d wandb/job_stable_diffusion_inference:main -p <target-wandb-project> -q <your-queue-name> -e <your-queue-entity>

The command above will submit the container image wandb/job_stable_diffusion_inference:main to your queue. Once your agent picks up the job and schedules it for execution on your cluster, it may take a while for the image to be pulled, depending on your connection. You can follow the status of the job on the queue page on wandb.ai/launch (or <your-wandb-url>/launch for users on wandb server).

Once the run has finished, you should have a job artifact in the project you specified. You can check your project’s job page (<project-url>/jobs) to find the job artifact. Its default name should be job-wandb_job_stable_diffusion_inference but you can change that to whatever you like on the job’s page by clicking the pencil icon next to the job name.

You can now use this job to run more stable diffusion inference on your cluster. From the job page, we can click the Launch button in the top right hand corner to configure a new inference job and submit it to our queue. The job configuration page will be pre-populated with the parameters from the original run, but you can change them to whatever you like by modifying their values in the Overrides section of the launch drawer.