This is the multi-page printable view of this section. Click here to print.
Create and deploy jobs
- 1: Add job to queue
- 2: Create a launch job
- 3: Manage job inputs
- 4: Monitor launch queue
- 5: View launch jobs
1 - Add job to queue
The following page describes how to add launch jobs to a launch queue.
Add jobs to your queue
Add jobs to your queue interactively with the W&B App or programmatically with the W&B CLI.
Add a job to your queue programmatically with the W&B App.
- Navigate to your W&B Project Page.
- Select the Jobs icon on the left panel:
- The Jobs page displays a list of W&B launch jobs that were created from previously executed W&B runs.
- Select the Launch button next to the name of the Job name. A modal will appear on the right side of the page.
- From the Job version dropdown, select the version of hte launch job you want to use. Launch jobs are versioned like any other W&B Artifact. Different versions of the same launch job will be created if you make modifications to the software dependencies or source code used to run the job.
- Within the Overrides section, provide new values for any inputs that are configured for your launch job. Common overrides include a new entrypoint command, arguments, or values in the
wandb.config
of your new W&B run.
You can copy and paste values from other W&B runs that used your launch job by clicking on the Paste from… button. - From the Queue dropdown, select the name of the launch queue you want to add your launch job to.
- Use the Job Priority dropdown to specify the priority of your launch job. A launch job’s priority is set to “Medium” if the launch queue does not support prioritization.
- (Optional) Follow this step only if a queue config template was created by your team admin
Within the Queue Configurations field, provide values for configuration options that were created by the admin of your team.
For example, in the following example, the team admin configured AWS instance types that can be used by the team. In this case, team members can pick either theml.m4.xlarge
orml.p3.xlarge
compute instance type to train their model. - Select the Destination project, where the resulting run will appear. This project needs to belong to the same entity as the queue.
- Select the Launch now button.
Use the wandb launch
command to add jobs to a queue. Create a JSON configuration with hyperparameter overrides. For example, using the script from the Quickstart guide, we create a JSON file with the following overrides:
{
"overrides": {
"args": [],
"run_config": {
"learning_rate": 0,
"epochs": 0
},
"entry_point": []
}
}
If you want to override the queue configuration, or if your launch queue does not have a configuration resource defined, you can specify the resource_args
key in your config.json file. For example, following continuing the example above, your config.json file might look similar to the following:
{
"overrides": {
"args": [],
"run_config": {
"learning_rate": 0,
"epochs": 0
},
"entry_point": []
},
"resource_args": {
"<resource-type>" : {
"<key>": "<value>"
}
}
}
Replace values within the <>
with your own values.
Provide the name of the queue for the queue
(-q
) flag, the name of the job for the job
(-j
) flag, and the path to the configuration file for the config
(-c
) flag.
wandb launch -j <job> -q <queue-name> \
-e <entity-name> -c path/to/config.json
If you work within a W&B Team, we suggest you specify the entity
flag (-e
) to indicate which entity the queue will use.
2 - Create a launch job
Launch jobs are blueprints for reproducing W&B runs. Jobs are W&B Artifacts that capture the source code, dependencies, and inputs required to execute a workload.
Create and run jobs with the wandb launch
command.
wandb job create
command. See the command reference docs for more information.Git jobs
You can create a Git-based job where code and other tracked assets are cloned from a certain commit, branch, or tag in a remote git repository with W&B Launch. Use the --uri
or -u
flag to specify the URI containing the code, along with optionally a --build-context
flag to specify a subdirectory.
Run a “hello world” job from a git repository with the following command:
wandb launch --uri "https://github.com/wandb/launch-jobs.git" --build-context jobs/hello_world --dockerfile Dockerfile.wandb --project "hello-world" --job-name "hello-world" --entry-point "python job.py"
The command does the following:
- Clones the W&B Launch jobs repository to a temporary directory.
- Creates a job named hello-world-git in the hello project. The job is associated with the commit at the head of the default branch of the repository.
- Builds a container image from the
jobs/hello_world
directory and theDockerfile.wandb
. - Starts the container and runs
python job.py
.
To build a job from a specific branch or commit hash, append the -g
, --git-hash
argument. For a full list of arguments, run wandb launch --help
.
Remote URL format
The git remote associated with a Launch job can be either an HTTPS or an SSH URL. The URL type determines the protocol used to fetch job source code.
Remote URL Type | URL Format | Requirements for access and authentication |
---|---|---|
https | https://github.com/organization/repository.git |
username and password to authenticate with the git remote |
ssh | git@github.com:organization/repository.git |
ssh key to authenticate with the git remote |
Note that the exact URL format varies by hosting provider. Jobs created with wandb launch --uri
will use the transfer protocol specified in the provided --uri
.
Code artifact jobs
Jobs can be created from any source code stored in a W&B Artifact. Use a local directory with the --uri
or -u
argument to create a new code artifact and job.
To get started, create an empty directory and add a Python script named main.py
with the following content:
import wandb
with wandb.init() as run:
run.log({"metric": 0.5})
Add a file requirements.txt
with the following content:
wandb>=0.17.1
Log the directory as a code artifact and launch a job with the following command:
wandb launch --uri . --job-name hello-world-code --project launch-quickstart --entry-point "python main.py"
The preceding command does the following:
- Logs the current directory as a code artifact named
hello-world-code
. - Creates a job named
hello-world-code
in thelaunch-quickstart
project. - Builds a container image from the current directory and Launch’s default Dockerfile. The default Dockerfile will install the
requirements.txt
file and set the entry point topython main.py
.
Image jobs
Alternatively, you can build jobs off of pre-made Docker images. This is useful when you already have an established build system for your ML code, or when you don’t expect to adjust the code or requirements for the job but do want to experiment with hyperparameters or different infrastructure scales.
The image is pulled from a Docker registry and run with the specified entry point, or the default entry point if none is specified. Pass a full image tag to the --docker-image
option to create and run a job from a Docker image.
To run a simple job from a pre-made image, use the following command:
wandb launch --docker-image "wandb/job_hello_world:main" --project "hello-world"
Automatic job creation
W&B will automatically create and track a job for any run with tracked source code, even if that run was not created with Launch. Runs are considered to have tracked source code if any of the three following conditions are met:
- The run has an associated git remote and commit hash
- The run logged a code artifact (see
Run.log_code
for more information) - The run was executed in a Docker container with the
WANDB_DOCKER
environment variable set to an image tag
The Git remote URL is inferred from the local git repository if your Launch job is created automatically by a W&B run.
Launch job names
By default, W&B automatically generates a job name for you. The name is generated depending on how the job is created (GitHub, code artifact, or Docker image). Alternatively, you can define a Launch job’s name with environment variables or with the W&B Python SDK.
The following table describes the job naming convention used by default based on job source:
Source | Naming convention |
---|---|
GitHub | job-<git-remote-url>-<path-to-script> |
Code artifact | job-<code-artifact-name> |
Docker image | job-<image-name> |
Name your job with a W&B environment variable or with the W&B Python SDK
Set the WANDB_JOB_NAME
environment variable to your preferred job name. For example:
WANDB_JOB_NAME=awesome-job-name
Define the name of your job with wandb.Settings
. Then pass this object when you initialize W&B with wandb.init
. For example:
settings = wandb.Settings(job_name="my-job-name")
wandb.init(settings=settings)
Containerization
Jobs are executed in a container. Image jobs use a pre-built Docker image, while Git and code artifact jobs require a container build step.
Job containerization can be customized with arguments to wandb launch
and files within the job source code.
Build context
The term build context refers to the tree of files and directories that are sent to the Docker daemon to build a container image. By default, Launch uses the root of the job source code as the build context. To specify a subdirectory as the build context, use the --build-context
argument of wandb launch
when creating and launching a job.
The --build-context
argument is particularly useful for working with Git jobs that refer to a monorepo with multiple projects. By specifying a subdirectory as the build context, you can build a container image for a specific project within the monorepo.
See the example above for a demonstration of how to use the --build-context
argument with the official W&B Launch jobs repository.
Dockerfile
The Dockerfile is a text file that contains instructions for building a Docker image. By default, Launch uses a default Dockerfile that installs the requirements.txt
file. To use a custom Dockerfile, specify the path to the file with the --dockerfile
argument of wandb launch
.
The Dockerfile path is specified relative to the build context. For example, if the build context is jobs/hello_world
, and the Dockerfile is located in the jobs/hello_world
directory, the --dockerfile
argument should be set to Dockerfile.wandb
. See the example above for a demonstration of how to use the --dockerfile
argument with the official W&B Launch jobs repository.
Requirements file
If no custom Dockerfile is provided, Launch will look in the build context for Python dependencies to install. If a requirements.txt
file is found at the root of the build context, Launch will install the dependencies listed in the file. Otherwise, if a pyproject.toml
file is found, Launch will install dependencies from the project.dependencies
section.
3 - Manage job inputs
The core experience of Launch is easily experimenting with different job inputs like hyperparameters and datasets, and routing these jobs to appropriate hardware. Once a job is created, users beyond the original author can adjust these inputs via the W&B GUI or CLI. For information on how job inputs can be set when launching from the CLI or UI, see the Enqueue jobs guide.
This section describes how to programmatically control the inputs that can be tweaked for a job.
By default, W&B jobs capture the entire Run.config
as the inputs to a job, but the Launch SDK provides a function to control select keys in the run config or to specify JSON or YAML files as inputs.
wandb-core
. See the wandb-core
README for more information.Reconfigure the Run
object
The Run
object returned by wandb.init
in a job can be reconfigured, by default. The Launch SDK provides a way to customize what parts of the Run.config
object can be reconfigured when launching the job.
import wandb
from wandb.sdk import launch
# Required for launch sdk use.
wandb.require("core")
config = {
"trainer": {
"learning_rate": 0.01,
"batch_size": 32,
"model": "resnet",
"dataset": "cifar10",
"private": {
"key": "value",
},
},
"seed": 42,
}
with wandb.init(config=config):
launch.manage_wandb_config(
include=["trainer"],
exclude=["trainer.private"],
)
# Etc.
The function launch.manage_wandb_config
configures the job to accept input values for the Run.config
object. The optional include
and exclude
options take path prefixes within the nested config object. This can be useful if, for example, a job uses a library whose options you don’t want to expose to end users.
If include
prefixes are provided, only paths within the config that match an include
prefix will accept input values. If exclude
prefixes are provided, no paths that match the exclude
list will be filtered out of the input values. If a path matches both an include
and an exclude
prefix, the exclude
prefix will take precedence.
In the preceding example, the path ["trainer.private"]
will filter out the private
key from the trainer
object, and the path ["trainer"]
will filter out all keys not under the trainer
object.
Use a \
-escaped .
to filter out keys with a .
in their name.
For example, r"trainer\.private"
filters out the trainer.private
key rather than the private
key under the trainer
object.
Note that the r
prefix above denotes a raw string.
If the code above is packaged and run as a job, the input types of the job will be:
{
"trainer": {
"learning_rate": "float",
"batch_size": "int",
"model": "str",
"dataset": "str",
},
}
When launching the job from the W&B CLI or UI, the user will be able to override only the four trainer
parameters.
Access run config inputs
Jobs launched with run config inputs can access the input values through the Run.config
. The Run
returned by wandb.init
in the job code will have the input values automatically set. Use
from wandb.sdk import launch
run_config_overrides = launch.load_wandb_config()
to load the run config input values anywhere in the job code.
Reconfigure a file
The Launch SDK also provides a way to manage input values stored in config files in the job code. This is a common pattern in many deep learning and large language model use cases, like this torchtune example or this Axolotl config).
Run.config
object.The launch.manage_config_file
function can be used to add a config file as an input to the Launch job, giving you access to edit values within the config file when launching the job.
By default, no run config inputs will be captured if launch.manage_config_file
is used. Calling launch.manage_wandb_config
overrides this behavior.
Consider the following example:
import yaml
import wandb
from wandb.sdk import launch
# Required for launch sdk use.
wandb.require("core")
launch.manage_config_file("config.yaml")
with open("config.yaml", "r") as f:
config = yaml.safe_load(f)
with wandb.init(config=config):
# Etc.
pass
Imagine the code is run with an adjacent file config.yaml
:
learning_rate: 0.01
batch_size: 32
model: resnet
dataset: cifar10
The call to launch.manage_config_file
will add the config.yaml
file as an input to the job, making it reconfigurable when launching from the W&B CLI or UI.
The include
and exclude
keyword arugments may be used to filter the acceptable input keys for the config file in the same way as launch.manage_wandb_config
.
Access config file inputs
When launch.manage_config_file
is called in a run created by Launch, launch
patches the contents of the config file with the input values. The patched config file is available in the job environment.
launch.manage_config_file
before reading the config file in the job code to ensure input values are used.Customize a job’s launch drawer UI
Defining a schema for a job’s inputs allows you to create a custom UI for launching the job. To define a job’s schema, include it in the call to launch.manage_wandb_config
or launch.manage_config_file
. The schema can either be a python dict in the form of a JSON Schema or a Pydantic model class.
The following example shows a schema with these properties:
seed
, an integertrainer
, a dictionary with some keys specified:trainer.learning_rate
, a float that must be greater than zerotrainer.batch_size
, an integer that must be either 16, 64, or 256trainer.dataset
, a string that must be eithercifar10
orcifar100
schema = {
"type": "object",
"properties": {
"seed": {
"type": "integer"
}
"trainer": {
"type": "object",
"properties": {
"learning_rate": {
"type": "number",
"description": "Learning rate of the model",
"exclusiveMinimum": 0,
},
"batch_size": {
"type": "integer",
"description": "Number of samples per batch",
"enum": [16, 64, 256]
},
"dataset": {
"type": "string",
"description": "Name of the dataset to use",
"enum": ["cifar10", "cifar100"]
}
}
}
}
}
launch.manage_wandb_config(
include=["seed", "trainer"],
exclude=["trainer.private"],
schema=schema,
)
In general, the following JSON Schema attributes are supported:
Attribute | Required | Notes |
---|---|---|
type |
Yes | Must be one of number , integer , string , or object |
title |
No | Overrides the property’s display name |
description |
No | Gives the property helper text |
enum |
No | Creates a dropdown select instead of a freeform text entry |
minimum |
No | Allowed only if type is number or integer |
maximum |
No | Allowed only if type is number or integer |
exclusiveMinimum |
No | Allowed only if type is number or integer |
exclusiveMaximum |
No | Allowed only if type is number or integer |
properties |
No | If type is object , used to define nested configurations |
The following example shows a schema with these properties:
seed
, an integertrainer
, a schema with some sub-attributes specified:trainer.learning_rate
, a float that must be greater than zerotrainer.batch_size
, an integer that must be between 1 and 256, inclusivetrainer.dataset
, a string that must be eithercifar10
orcifar100
class DatasetEnum(str, Enum):
cifar10 = "cifar10"
cifar100 = "cifar100"
class Trainer(BaseModel):
learning_rate: float = Field(gt=0, description="Learning rate of the model")
batch_size: int = Field(ge=1, le=256, description="Number of samples per batch")
dataset: DatasetEnum = Field(title="Dataset", description="Name of the dataset to use")
class Schema(BaseModel):
seed: int
trainer: Trainer
launch.manage_wandb_config(
include=["seed", "trainer"],
exclude=["trainer.private"],
schema=Schema,
)
You can also use an instance of the class:
t = Trainer(learning_rate=0.01, batch_size=32, dataset=DatasetEnum.cifar10)
s = Schema(seed=42, trainer=t)
launch.manage_wandb_config(
include=["seed", "trainer"],
exclude=["trainer.private"],
input_schema=s,
)
Adding a job input schema will create a structured form in the launch drawer, making it easier to launch the job.
4 - Monitor launch queue
Use the interactive Queue monitoring dashboard to view when a launch queue is in heavy use or idle, visualize workloads that are running, and spot inefficient jobs. The launch queue dashboard is especially useful for deciding whether or not you are effectively using your compute hardware or cloud resources.
For deeper analysis, the page links to the W&B experiment tracking workspace and to external infrastructure monitoring providers like Datadog, NVIDIA Base Command, or cloud consoles.
Dashboard and plots
Use the Monitor tab to view the activity of a queue that occurred during the last seven days. Use the left panel to control time ranges, grouping, and filters.
The dashboard contains a number of plots answering common questions about performance and efficiency. The proceeding sections describe UI elements of queue dashboards.
Job status
The Job status plot shows how many jobs are running, pending, queued, or completed in each time interval. Use the Job status plot for identifying periods of idleness in the queue.
For example, suppose you have a fixed resource (such as DGX BasePod). If you observe an idle queue with the fixed resource, this might suggest an opportunity to run lower-priority pre-emptible launch jobs such as sweeps.
On the other hand, suppose you use a cloud resource and you see periodic bursts of activity. Periodic bursts of activity might suggest an opportunity to save money by reserving resources for particular times.
To the right of the plot is a key that shows which colors represent the status of a launch job.
Queued
items might indicate opportunities to shift workloads to other queues. A spike in failures can identify users who might need help with their launch job setup.Queued time
The Queued time plots shows the amount of time (in seconds) that a launch job was on a queue for a given date or time range.
The x-axis shows a time frame that you specify and the y-axis shows the time (in seconds) a launch job was on a launch queue. For example, suppose on a given day there are 10 launch jobs queued. The Queue time plot shows 600 seconds if those 10 launch jobs wait an average of 60 seconds each.
Customize the color of each job with the Grouping
control in the left bar.
which can be particularly helpful for identifying which users and jobs are feeling the pain of scarce queue capacity.
Job runs
This plot shows the start and end of every job executed in a time period, with distinct colors for each run. This makes it easy to see at a glance what workloads the queue was processing at a given time.
Use the Select tool in the bottom right of the panel to brush over jobs to populate details in the table below.
CPU and GPU usage
Use the GPU use by a job, CPU use by a job, GPU memory by job, and System memory by job to view the efficiency of your launch jobs.
For example, you can use the GPU memory by job to view if a W&B run took a long time to complete and whether or not it used a low percentage of its CPU cores.
The x-axis of each plot shows the duration of a W&B run (created by a launch job) in seconds. Hover your mouse over a data point to view information about a W&B run such as the run ID, the project the run belongs to, the launch job that created the W&B run and more.
Errors
The Errors panel shows errors that occurred on a given launch queue. More specifically, the Errors panel shows a timestamp of when the error occurred, the name of the launch job where the error comes from, and the error message that was created. By default, errors are ordered from latest to oldest.
Use the Errors panel to identify and unblock users.
External links
The queue observability dashboard’s view is consistent across all queue types, but in many cases, it can be useful to jump directly into environment-specific monitors. To accomplish this, add a link from the console directly from the queue observability dashboard.
At the bottom of the page, click Manage Links
to open a panel. Add the full URL of the page you want. Next, add a label. Links that you add appear in the External Links section.
5 - View launch jobs
The following page describes how to view information about launch jobs added to queues.
View jobs
View jobs added to a queue with the W&B App.
- Navigate to the W&B App at https://wandb.ai/home.
- Select Launch within the Applications section of the left sidebar.
- Select the All entities dropdown and select the entity the launch job belongs to.
- Expand the collapsible UI from the Launch Application page to view a list of jobs added to that specific queue.
For example, the following image shows two runs that were created from a job called job-source-launch_demo-canonical
. The job was added to a queue called Start queue
. The first run listed in the queue called resilient-snowball
and the second run listed is called earthy-energy-165
.
Within the W&B App UI you can find additional information about runs created from launch jobs such as the:
- Run: The name of the W&B run assigned to that job.
- Job ID: The name of the job.
- Project: The name of the project the run belongs to.
- Status: The status of the queued run.
- Author: The W&B entity that created the run.
- Creation date: The timestamp when the queue was created.
- Start time: The timestamp when the job started.
- Duration: Time, in seconds, it took to complete the job’s run.
List jobs
View a list of jobs that exist within a project with the W&B CLI. Use the W&B job list command and provide the name of the project and entity the launch job belongs to the --project
and --entity
flags, respectively.
wandb job list --entity your-entity --project project-name
Check the status of a job
The following table defines the status a queued run can have:
Status | Description |
---|---|
Idle | The run is in a queue with no active agents. |
Queued | The run is in a queue waiting for an agent to process it. |
Pending | The run has been picked up by an agent but has not yet started. This could be due to resources being unavailable on the cluster. |
Running | The run is currently executing. |
Killed | The job was killed by the user. |
Crashed | The run stopped sending data or did not successfully start. |
Failed | The run ended with a non-zero exit code or the run failed to start. |
Finished | The job completed successfully. |