This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Create and deploy jobs

1 - Add job to queue

The following page describes how to add launch jobs to a launch queue.

Add jobs to your queue

Add jobs to your queue interactively with the W&B App or programmatically with the W&B CLI.

Add a job to your queue programmatically with the W&B App.

  1. Navigate to your W&B Project Page.
  2. Select the Jobs icon on the left panel:
  3. The Jobs page displays a list of W&B launch jobs that were created from previously executed W&B runs.
  4. Select the Launch button next to the name of the Job name. A modal will appear on the right side of the page.
  5. From the Job version dropdown, select the version of hte launch job you want to use. Launch jobs are versioned like any other W&B Artifact. Different versions of the same launch job will be created if you make modifications to the software dependencies or source code used to run the job.
  6. Within the Overrides section, provide new values for any inputs that are configured for your launch job. Common overrides include a new entrypoint command, arguments, or values in the wandb.config of your new W&B run.
    You can copy and paste values from other W&B runs that used your launch job by clicking on the Paste from… button.
  7. From the Queue dropdown, select the name of the launch queue you want to add your launch job to.
  8. Use the Job Priority dropdown to specify the priority of your launch job. A launch job’s priority is set to “Medium” if the launch queue does not support prioritization.
  9. (Optional) Follow this step only if a queue config template was created by your team admin
    Within the Queue Configurations field, provide values for configuration options that were created by the admin of your team.
    For example, in the following example, the team admin configured AWS instance types that can be used by the team. In this case, team members can pick either the ml.m4.xlarge or ml.p3.xlarge compute instance type to train their model.
  10. Select the Destination project, where the resulting run will appear. This project needs to belong to the same entity as the queue.
  11. Select the Launch now button.

Use the wandb launch command to add jobs to a queue. Create a JSON configuration with hyperparameter overrides. For example, using the script from the Quickstart guide, we create a JSON file with the following overrides:

{
  "overrides": {
      "args": [],
      "run_config": {
          "learning_rate": 0,
          "epochs": 0
      },   
      "entry_point": []
  }
}

If you want to override the queue configuration, or if your launch queue does not have a configuration resource defined, you can specify the resource_args key in your config.json file. For example, following continuing the example above, your config.json file might look similar to the following:

{
  "overrides": {
      "args": [],
      "run_config": {
          "learning_rate": 0,
          "epochs": 0
      },
      "entry_point": []
  },
  "resource_args": {
        "<resource-type>" : {
            "<key>": "<value>"
        }
  }
}

Replace values within the <> with your own values.

Provide the name of the queue for the queue(-q) flag, the name of the job for the job(-j) flag, and the path to the configuration file for the config(-c) flag.

wandb launch -j <job> -q <queue-name> \ 
-e <entity-name> -c path/to/config.json

If you work within a W&B Team, we suggest you specify the entity flag (-e) to indicate which entity the queue will use.

2 - Create a launch job

Launch jobs are blueprints for reproducing W&B runs. Jobs are W&B Artifacts that capture the source code, dependencies, and inputs required to execute a workload.

Create and run jobs with the wandb launch command.

Git jobs

You can create a Git-based job where code and other tracked assets are cloned from a certain commit, branch, or tag in a remote git repository with W&B Launch. Use the --uri or -u flag to specify the URI containing the code, along with optionally a --build-context flag to specify a subdirectory.

Run a “hello world” job from a git repository with the following command:

wandb launch --uri "https://github.com/wandb/launch-jobs.git" --build-context jobs/hello_world --dockerfile Dockerfile.wandb --project "hello-world" --job-name "hello-world" --entry-point "python job.py"

The command does the following:

  1. Clones the W&B Launch jobs repository to a temporary directory.
  2. Creates a job named hello-world-git in the hello project. The job is associated with the commit at the head of the default branch of the repository.
  3. Builds a container image from the jobs/hello_world directory and the Dockerfile.wandb.
  4. Starts the container and runs python job.py.

To build a job from a specific branch or commit hash, append the -g, --git-hash argument. For a full list of arguments, run wandb launch --help.

Remote URL format

The git remote associated with a Launch job can be either an HTTPS or an SSH URL. The URL type determines the protocol used to fetch job source code.

Remote URL Type URL Format Requirements for access and authentication
https https://github.com/organization/repository.git username and password to authenticate with the git remote
ssh git@github.com:organization/repository.git ssh key to authenticate with the git remote

Note that the exact URL format varies by hosting provider. Jobs created with wandb launch --uri will use the transfer protocol specified in the provided --uri.

Code artifact jobs

Jobs can be created from any source code stored in a W&B Artifact. Use a local directory with the --uri or -u argument to create a new code artifact and job.

To get started, create an empty directory and add a Python script named main.py with the following content:

import wandb

with wandb.init() as run:
    run.log({"metric": 0.5})

Add a file requirements.txt with the following content:

wandb>=0.17.1

Log the directory as a code artifact and launch a job with the following command:

wandb launch --uri . --job-name hello-world-code --project launch-quickstart --entry-point "python main.py"

The preceding command does the following:

  1. Logs the current directory as a code artifact named hello-world-code.
  2. Creates a job named hello-world-code in the launch-quickstart project.
  3. Builds a container image from the current directory and Launch’s default Dockerfile. The default Dockerfile will install the requirements.txt file and set the entry point to python main.py.

Image jobs

Alternatively, you can build jobs off of pre-made Docker images. This is useful when you already have an established build system for your ML code, or when you don’t expect to adjust the code or requirements for the job but do want to experiment with hyperparameters or different infrastructure scales.

The image is pulled from a Docker registry and run with the specified entry point, or the default entry point if none is specified. Pass a full image tag to the --docker-image option to create and run a job from a Docker image.

To run a simple job from a pre-made image, use the following command:

wandb launch --docker-image "wandb/job_hello_world:main" --project "hello-world"           

Automatic job creation

W&B will automatically create and track a job for any run with tracked source code, even if that run was not created with Launch. Runs are considered to have tracked source code if any of the three following conditions are met:

  • The run has an associated git remote and commit hash
  • The run logged a code artifact (see Run.log_code for more information)
  • The run was executed in a Docker container with the WANDB_DOCKER environment variable set to an image tag

The Git remote URL is inferred from the local git repository if your Launch job is created automatically by a W&B run.

Launch job names

By default, W&B automatically generates a job name for you. The name is generated depending on how the job is created (GitHub, code artifact, or Docker image). Alternatively, you can define a Launch job’s name with environment variables or with the W&B Python SDK.

The following table describes the job naming convention used by default based on job source:

Source Naming convention
GitHub job-<git-remote-url>-<path-to-script>
Code artifact job-<code-artifact-name>
Docker image job-<image-name>

Name your job with a W&B environment variable or with the W&B Python SDK

Set the WANDB_JOB_NAME environment variable to your preferred job name. For example:

WANDB_JOB_NAME=awesome-job-name

Define the name of your job with wandb.Settings. Then pass this object when you initialize W&B with wandb.init. For example:

settings = wandb.Settings(job_name="my-job-name")
wandb.init(settings=settings)

Containerization

Jobs are executed in a container. Image jobs use a pre-built Docker image, while Git and code artifact jobs require a container build step.

Job containerization can be customized with arguments to wandb launch and files within the job source code.

Build context

The term build context refers to the tree of files and directories that are sent to the Docker daemon to build a container image. By default, Launch uses the root of the job source code as the build context. To specify a subdirectory as the build context, use the --build-context argument of wandb launch when creating and launching a job.

Dockerfile

The Dockerfile is a text file that contains instructions for building a Docker image. By default, Launch uses a default Dockerfile that installs the requirements.txt file. To use a custom Dockerfile, specify the path to the file with the --dockerfile argument of wandb launch.

The Dockerfile path is specified relative to the build context. For example, if the build context is jobs/hello_world, and the Dockerfile is located in the jobs/hello_world directory, the --dockerfile argument should be set to Dockerfile.wandb. See the example above for a demonstration of how to use the --dockerfile argument with the official W&B Launch jobs repository.

Requirements file

If no custom Dockerfile is provided, Launch will look in the build context for Python dependencies to install. If a requirements.txt file is found at the root of the build context, Launch will install the dependencies listed in the file. Otherwise, if a pyproject.toml file is found, Launch will install dependencies from the project.dependencies section.

3 - Manage job inputs

The core experience of Launch is easily experimenting with different job inputs like hyperparameters and datasets, and routing these jobs to appropriate hardware. Once a job is created, users beyond the original author can adjust these inputs via the W&B GUI or CLI. For information on how job inputs can be set when launching from the CLI or UI, see the Enqueue jobs guide.

This section describes how to programmatically control the inputs that can be tweaked for a job.

By default, W&B jobs capture the entire Run.config as the inputs to a job, but the Launch SDK provides a function to control select keys in the run config or to specify JSON or YAML files as inputs.

Reconfigure the Run object

The Run object returned by wandb.init in a job can be reconfigured, by default. The Launch SDK provides a way to customize what parts of the Run.config object can be reconfigured when launching the job.

import wandb
from wandb.sdk import launch

# Required for launch sdk use.
wandb.require("core")

config = {
    "trainer": {
        "learning_rate": 0.01,
        "batch_size": 32,
        "model": "resnet",
        "dataset": "cifar10",
        "private": {
            "key": "value",
        },
    },
    "seed": 42,
}


with wandb.init(config=config):
    launch.manage_wandb_config(
        include=["trainer"], 
        exclude=["trainer.private"],
    )
    # Etc.

The function launch.manage_wandb_config configures the job to accept input values for the Run.config object. The optional include and exclude options take path prefixes within the nested config object. This can be useful if, for example, a job uses a library whose options you don’t want to expose to end users.

If include prefixes are provided, only paths within the config that match an include prefix will accept input values. If exclude prefixes are provided, no paths that match the exclude list will be filtered out of the input values. If a path matches both an include and an exclude prefix, the exclude prefix will take precedence.

In the preceding example, the path ["trainer.private"] will filter out the private key from the trainer object, and the path ["trainer"] will filter out all keys not under the trainer object.

If the code above is packaged and run as a job, the input types of the job will be:

{
    "trainer": {
        "learning_rate": "float",
        "batch_size": "int",
        "model": "str",
        "dataset": "str",
    },
}

When launching the job from the W&B CLI or UI, the user will be able to override only the four trainer parameters.

Access run config inputs

Jobs launched with run config inputs can access the input values through the Run.config. The Run returned by wandb.init in the job code will have the input values automatically set. Use

from wandb.sdk import launch

run_config_overrides = launch.load_wandb_config()

to load the run config input values anywhere in the job code.

Reconfigure a file

The Launch SDK also provides a way to manage input values stored in config files in the job code. This is a common pattern in many deep learning and large language model use cases, like this torchtune example or this Axolotl config).

The launch.manage_config_file function can be used to add a config file as an input to the Launch job, giving you access to edit values within the config file when launching the job.

By default, no run config inputs will be captured if launch.manage_config_file is used. Calling launch.manage_wandb_config overrides this behavior.

Consider the following example:

import yaml
import wandb
from wandb.sdk import launch

# Required for launch sdk use.
wandb.require("core")

launch.manage_config_file("config.yaml")

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

with wandb.init(config=config):
    # Etc.
    pass

Imagine the code is run with an adjacent file config.yaml:

learning_rate: 0.01
batch_size: 32
model: resnet
dataset: cifar10

The call to launch.manage_config_file will add the config.yaml file as an input to the job, making it reconfigurable when launching from the W&B CLI or UI.

The include and exclude keyword arugments may be used to filter the acceptable input keys for the config file in the same way as launch.manage_wandb_config.

Access config file inputs

When launch.manage_config_file is called in a run created by Launch, launch patches the contents of the config file with the input values. The patched config file is available in the job environment.

Customize a job’s launch drawer UI

Defining a schema for a job’s inputs allows you to create a custom UI for launching the job. To define a job’s schema, include it in the call to launch.manage_wandb_config or launch.manage_config_file. The schema can either be a python dict in the form of a JSON Schema or a Pydantic model class.

The following example shows a schema with these properties:

  • seed, an integer
  • trainer, a dictionary with some keys specified:
    • trainer.learning_rate, a float that must be greater than zero
    • trainer.batch_size, an integer that must be either 16, 64, or 256
    • trainer.dataset, a string that must be either cifar10 or cifar100
schema = {
    "type": "object",
    "properties": {
        "seed": {
          "type": "integer"
        }
        "trainer": {
            "type": "object",
            "properties": {
                "learning_rate": {
                    "type": "number",
                    "description": "Learning rate of the model",
                    "exclusiveMinimum": 0,
                },
                "batch_size": {
                    "type": "integer",
                    "description": "Number of samples per batch",
                    "enum": [16, 64, 256]
                },
                "dataset": {
                    "type": "string",
                    "description": "Name of the dataset to use",
                    "enum": ["cifar10", "cifar100"]
                }
            }
        }
    }
}

launch.manage_wandb_config(
    include=["seed", "trainer"], 
    exclude=["trainer.private"],
    schema=schema,
)

In general, the following JSON Schema attributes are supported:

Attribute Required Notes
type Yes Must be one of number, integer, string, or object
title No Overrides the property’s display name
description No Gives the property helper text
enum No Creates a dropdown select instead of a freeform text entry
minimum No Allowed only if type is number or integer
maximum No Allowed only if type is number or integer
exclusiveMinimum No Allowed only if type is number or integer
exclusiveMaximum No Allowed only if type is number or integer
properties No If type is object, used to define nested configurations

The following example shows a schema with these properties:

  • seed, an integer
  • trainer, a schema with some sub-attributes specified:
    • trainer.learning_rate, a float that must be greater than zero
    • trainer.batch_size, an integer that must be between 1 and 256, inclusive
    • trainer.dataset, a string that must be either cifar10 or cifar100
class DatasetEnum(str, Enum):
    cifar10 = "cifar10"
    cifar100 = "cifar100"

class Trainer(BaseModel):
    learning_rate: float = Field(gt=0, description="Learning rate of the model")
    batch_size: int = Field(ge=1, le=256, description="Number of samples per batch")
    dataset: DatasetEnum = Field(title="Dataset", description="Name of the dataset to use")

class Schema(BaseModel):
    seed: int
    trainer: Trainer

launch.manage_wandb_config(
    include=["seed", "trainer"],
    exclude=["trainer.private"],
    schema=Schema,
)

You can also use an instance of the class:

t = Trainer(learning_rate=0.01, batch_size=32, dataset=DatasetEnum.cifar10)
s = Schema(seed=42, trainer=t)
launch.manage_wandb_config(
    include=["seed", "trainer"],
    exclude=["trainer.private"],
    input_schema=s,
)

Adding a job input schema will create a structured form in the launch drawer, making it easier to launch the job.

4 - Monitor launch queue

Use the interactive Queue monitoring dashboard to view when a launch queue is in heavy use or idle, visualize workloads that are running, and spot inefficient jobs. The launch queue dashboard is especially useful for deciding whether or not you are effectively using your compute hardware or cloud resources.

For deeper analysis, the page links to the W&B experiment tracking workspace and to external infrastructure monitoring providers like Datadog, NVIDIA Base Command, or cloud consoles.

Dashboard and plots

Use the Monitor tab to view the activity of a queue that occurred during the last seven days. Use the left panel to control time ranges, grouping, and filters.

The dashboard contains a number of plots answering common questions about performance and efficiency. The proceeding sections describe UI elements of queue dashboards.

Job status

The Job status plot shows how many jobs are running, pending, queued, or completed in each time interval. Use the Job status plot for identifying periods of idleness in the queue.

For example, suppose you have a fixed resource (such as DGX BasePod). If you observe an idle queue with the fixed resource, this might suggest an opportunity to run lower-priority pre-emptible launch jobs such as sweeps.

On the other hand, suppose you use a cloud resource and you see periodic bursts of activity. Periodic bursts of activity might suggest an opportunity to save money by reserving resources for particular times.

To the right of the plot is a key that shows which colors represent the status of a launch job.

Queued time

The Queued time plots shows the amount of time (in seconds) that a launch job was on a queue for a given date or time range.

The x-axis shows a time frame that you specify and the y-axis shows the time (in seconds) a launch job was on a launch queue. For example, suppose on a given day there are 10 launch jobs queued. The Queue time plot shows 600 seconds if those 10 launch jobs wait an average of 60 seconds each.

Customize the color of each job with the Grouping control in the left bar.

which can be particularly helpful for identifying which users and jobs are feeling the pain of scarce queue capacity.

Job runs

This plot shows the start and end of every job executed in a time period, with distinct colors for each run. This makes it easy to see at a glance what workloads the queue was processing at a given time.

Use the Select tool in the bottom right of the panel to brush over jobs to populate details in the table below.

CPU and GPU usage

Use the GPU use by a job, CPU use by a job, GPU memory by job, and System memory by job to view the efficiency of your launch jobs.

For example, you can use the GPU memory by job to view if a W&B run took a long time to complete and whether or not it used a low percentage of its CPU cores.

The x-axis of each plot shows the duration of a W&B run (created by a launch job) in seconds. Hover your mouse over a data point to view information about a W&B run such as the run ID, the project the run belongs to, the launch job that created the W&B run and more.

Errors

The Errors panel shows errors that occurred on a given launch queue. More specifically, the Errors panel shows a timestamp of when the error occurred, the name of the launch job where the error comes from, and the error message that was created. By default, errors are ordered from latest to oldest.

Use the Errors panel to identify and unblock users.

The queue observability dashboard’s view is consistent across all queue types, but in many cases, it can be useful to jump directly into environment-specific monitors. To accomplish this, add a link from the console directly from the queue observability dashboard.

At the bottom of the page, click Manage Links to open a panel. Add the full URL of the page you want. Next, add a label. Links that you add appear in the External Links section.

5 - View launch jobs

The following page describes how to view information about launch jobs added to queues.

View jobs

View jobs added to a queue with the W&B App.

  1. Navigate to the W&B App at https://wandb.ai/home.
  2. Select Launch within the Applications section of the left sidebar.
  3. Select the All entities dropdown and select the entity the launch job belongs to.
  4. Expand the collapsible UI from the Launch Application page to view a list of jobs added to that specific queue.

For example, the following image shows two runs that were created from a job called job-source-launch_demo-canonical. The job was added to a queue called Start queue. The first run listed in the queue called resilient-snowball and the second run listed is called earthy-energy-165.

Within the W&B App UI you can find additional information about runs created from launch jobs such as the:

  • Run: The name of the W&B run assigned to that job.
  • Job ID: The name of the job.
  • Project: The name of the project the run belongs to.
  • Status: The status of the queued run.
  • Author: The W&B entity that created the run.
  • Creation date: The timestamp when the queue was created.
  • Start time: The timestamp when the job started.
  • Duration: Time, in seconds, it took to complete the job’s run.

List jobs

View a list of jobs that exist within a project with the W&B CLI. Use the W&B job list command and provide the name of the project and entity the launch job belongs to the --project and --entity flags, respectively.

 wandb job list --entity your-entity --project project-name

Check the status of a job

The following table defines the status a queued run can have:

Status Description
Idle The run is in a queue with no active agents.
Queued The run is in a queue waiting for an agent to process it.
Pending The run has been picked up by an agent but has not yet started. This could be due to resources being unavailable on the cluster.
Running The run is currently executing.
Killed The job was killed by the user.
Crashed The run stopped sending data or did not successfully start.
Failed The run ended with a non-zero exit code or the run failed to start.
Finished The job completed successfully.