This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

What are runs?

Learn about the basic building block of W&B, Runs.

A run is a single unit of computation logged by W&B. You can think of a W&B run as an atomic element of your whole project. In other words, each run is a record of a specific computation, such as training a model and logging the results, hyperparameter sweeps, and so forth.

Common patterns for initiating a run include, but are not limited to:

  • Training a model
  • Changing a hyperparameter and conducting a new experiment
  • Conducting a new machine learning experiment with a different model
  • Logging data or a model as a W&B Artifact
  • Downloading a W&B Artifact

W&B stores runs that you create into projects. You can view runs and their properties within the run’s project workspace on the W&B App UI. You can also programmatically access run properties with the wandb.Api.Run object.

Anything you log with run.log is recorded in that run. Consider the proceeding code snippet.

import wandb

run = wandb.init(entity="nico", project="awesome-project")
run.log({"accuracy": 0.9, "loss": 0.1})

The first line imports the W&B Python SDK. The second line initializes a run in the project awesome-project under the entity nico. The third line logs the accuracy and loss of the model to that run.

Within the terminal, W&B returns:

wandb: Syncing run earnest-sunset-1
wandb: ⭐️ View project at https://wandb.ai/nico/awesome-project
wandb: 🚀 View run at https://wandb.ai/nico/awesome-project/runs/1jx1ud12
wandb:                                                                                
wandb: 
wandb: Run history:
wandb: accuracy ▁
wandb:     loss ▁
wandb: 
wandb: Run summary:
wandb: accuracy 0.9
wandb:     loss 0.5
wandb: 
wandb: 🚀 View run earnest-sunset-1 at: https://wandb.ai/nico/awesome-project/runs/1jx1ud12
wandb: ⭐️ View project at: https://wandb.ai/nico/awesome-project
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20241105_111006-1jx1ud12/logs

The URL W&B returns in the terminal to redirects you to the run’s workspace in the W&B App UI. Note that the panels generated in the workspace corresponds to the single point.

Logging a metrics at a single point of time might not be that useful. A more realistic example in the case of training discriminative models is to log metrics at regular intervals. For example, consider the proceeding code snippet:

epochs = 10
lr = 0.01

run = wandb.init(
    entity="nico",
    project="awesome-project",
    config={
        "learning_rate": lr,
        "epochs": epochs,
    },
)

offset = random.random() / 5

# simulating a training run
for epoch in range(epochs):
    acc = 1 - 2**-epoch - random.random() / (epoch + 1) - offset
    loss = 2**-epoch + random.random() / (epoch + 1) + offset
    print(f"epoch={epoch}, accuracy={acc}, loss={loss}")
    run.log({"accuracy": acc, "loss": loss})

This returns the following output:

wandb: Syncing run jolly-haze-4
wandb: ⭐️ View project at https://wandb.ai/nico/awesome-project
wandb: 🚀 View run at https://wandb.ai/nico/awesome-project/runs/pdo5110r
lr: 0.01
epoch=0, accuracy=-0.10070974957523078, loss=1.985328507123956
epoch=1, accuracy=0.2884687745057535, loss=0.7374362314407752
epoch=2, accuracy=0.7347387967382066, loss=0.4402409835486663
epoch=3, accuracy=0.7667969248039795, loss=0.26176963846423457
epoch=4, accuracy=0.7446848791003173, loss=0.24808611724405083
epoch=5, accuracy=0.8035095836268268, loss=0.16169791827329466
epoch=6, accuracy=0.861349032371624, loss=0.03432578493587426
epoch=7, accuracy=0.8794926436276016, loss=0.10331872172219471
epoch=8, accuracy=0.9424839917077272, loss=0.07767793473500445
epoch=9, accuracy=0.9584880427028566, loss=0.10531971149250456
wandb: 🚀 View run jolly-haze-4 at: https://wandb.ai/nico/awesome-project/runs/pdo5110r
wandb: Find logs at: wandb/run-20241105_111816-pdo5110r/logs

The training script calls run.log 10 times. Each time the script calls run.log, W&B logs the accuracy and loss for that epoch. Selecting the URL that W&B prints from the preceding output, directs you to the run’s workspace in the W&B App UI.

Note that W&B captures the simulated training loop within a single run called jolly-haze-4. This is because the script calls wandb.init method only once.

As another example, during a sweep, W&B explores a hyperparameter search space that you specify. W&B implements each new hyperparameter combination that the sweep creates as a unique run.

Initialize a run

Initialize a W&B run with wandb.init(). The proceeding code snippet shows how to import the W&B Python SDK and initialize a run.

Ensure to replace values enclosed in angle brackets (< >) with your own values:

import wandb

run = wandb.init(entity="<entity>", project="<project>")

When you initialize a run, W&B logs your run to the project you specify for the project field (wandb.init(project="<project>"). W&B creates a new project if the project does not already exist. If the project already exists, W&B stores the run in that project.

Each run in W&B has a unique identifier known as a run ID. You can specify a unique ID or let W&B randomly generate one for you.

Each run also has a human-readable, non-unique identifier known as a run name. You can specify a name for your run or let W&B randomly generate one for you.

For example, consider the proceeding code snippet:

import wandb

run = wandb.init(entity="wandbee", project="awesome-project")

The code snippet produces the proceeding output:

🚀 View run exalted-darkness-6 at: 
https://wandb.ai/nico/awesome-project/runs/pgbn9y21
Find logs at: wandb/run-20241106_090747-pgbn9y21/logs

Since the preceding code did not specify an argument for the id parameter, W&B creates a unique run ID. Where nico is the entity that logged the run, awesome-project is the name of the project the run is logged to, exalted-darkness-6 is the name of the run, and pgbn9y21 is the run ID.

Each run has a state that describes the current status of the run. See Run states for a full list of possible run states.

Run states

The proceeding table describes the possible states a run can be in:

State Description
Finished Run ended and fully synced data, or called wandb.finish()
Failed Run ended with a non-zero exit status
Crashed Run stopped sending heartbeats in the internal process, which can happen if the machine crashes
Running Run is still running and has recently sent a heartbeat

Unique run identifiers

Run IDs are unique identifiers for runs. By default, W&B generates a random and unique run ID for you when you initialize a new run. You can also specify your own unique run ID when you initialize a run.

Autogenerated run IDs

If you do not specify a run ID when you initialize a run, W&B generates a random run ID for you. You can find the unique ID of a run in the W&B App UI.

  1. Navigate to the W&B App UI at https://wandb.ai/home.
  2. Navigate to the W&B project you specified when you initialized the run.
  3. Within your project’s workspace, select the Runs tab.
  4. Select the Overview tab.

W&B displays the unique run ID in the Run path field. The run path consists of the name of your team, the name of the project, and the run ID. The unique ID is the last part of the run path.

For example, in the proceeding image, the unique run ID is 9mxi1arc:

Custom run IDs

You can specify your own run ID by passing the id parameter to the wandb.init method.

import wandb

run = wandb.init(entity="<project>", project="<project>", id="<run-id>")

You can use a run’s unique ID to directly navigate to the run’s overview page in the W&B App UI. The proceeding cell shows the URL path for a specific run:

https://wandb.ai/<entity>/<project>/<run-id>

Where values enclosed in angle brackets (< >) are placeholders for the actual values of the entity, project, and run ID.

Name your run

The name of a run is a human-readable, non-unique identifier.

By default, W&B generates a random run name when you initialize a new run. The name of a run appears within your project’s workspace and at the top of the run’s overview page.

You can specify a name for your run by passing the name parameter to the wandb.init method.

import wandb

run = wandb.init(entity="<project>", project="<project>", name="<run-name>")

Add a note to a run

Notes that you add to a specific run appear on the run page in the Overview tab and in the table of runs on the project page.

  1. Navigate to your W&B project
  2. Select the Workspace tab from the project sidebar
  3. Select the run you want to add a note to from the run selector
  4. Choose the Overview tab
  5. Select the pencil icon next to the Description field and add your notes

Stop a run

Stop a run from the W&B App or programmatically.

  1. Navigate to the terminal or code editor where you initialized the run.
  2. Press Ctrl+D to stop the run.

For example, following the preceding instructions, your terminal might looks similar to the following:

KeyboardInterrupt
wandb: 🚀 View run legendary-meadow-2 at: https://wandb.ai/nico/history-blaster-4/runs/o8sdbztv
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20241106_095857-o8sdbztv/logs

Navigate to the W&B App UI to confirm the run is no longer active:

  1. Navigate to the project that your run was logging to.
  2. Select the name of the run.
3. Choose the **Overview** tab from the project sidebar.

Next to the State field, the run’s state changes from running to Killed.

  1. Navigate to the project that your run is logging to.
  2. Select the run you want to stop within the run selector.
  3. Choose the Overview tab from the project sidebar.
  4. Select the top button next to the State field.

Next to the State field, the run’s state changes from running to Killed.

See State fields for a full list of possible run states.

View logged runs

View a information about a specific run such as the state of the run, artifacts logged to the run, log files recorded during the run, and more.

To view a specific run:

  1. Navigate to the W&B App UI at https://wandb.ai/home.

  2. Navigate to the W&B project you specified when you initialized the run.

  3. Within the project sidebar, select the Workspace tab.

  4. Within the run selector, click the run you want to view, or enter a partial run name to filter for matching runs.

    By default, long run names are truncated in the middle for readability. To truncate run names at the beginning or end instead, click the action ... menu at the top of the list of runs, then set Run name cropping to crop the end, middle, or beginning.

Note that the URL path of a specific run has the proceeding format:

https://wandb.ai/<team-name>/<project-name>/runs/<run-id>

Where values enclosed in angle brackets (< >) are placeholders for the actual values of the team name, project name, and run ID.

Overview tab

Use the Overview tab to learn about specific run information in a project, such as:

  • Author: The W&B entity that creates the run.
  • Command: The command that initializes the run.
  • Description: A description of the run that you provided. This field is empty if you do not specify a description when you create the run. You can add a description to a run with the W&B App UI or programmatically with the Python SDK.
  • Duration: The amount of time the run is actively computing or logging data, excluding any pauses or waiting.
  • Git repository: The git repository associated with the run. You must enable git to view this field.
  • Host name: Where W&B computes the run. W&B displays the name of your machine if you initialize the run locally on your machine.
  • Name: The name of the run.
  • OS: Operating system that initializes the run.
  • Python executable: The command that starts the run.
  • Python version: Specifies the Python version that creates the run.
  • Run path: Identifies the unique run identifier in the form entity/project/run-ID.
  • Runtime: Measures the total time from the start to the end of the run. It’s the wall-clock time for the run. Runtime includes any time where the run is paused or waiting for resources, while duration does not.
  • Start time: The timestamp when you initialize the run.
  • State: The state of the run.
  • System hardware: The hardware W&B uses to compute the run.
  • Tags: A list of strings. Tags are useful for organizing related runs together or applying temporary labels like baseline or production.
  • W&B CLI version: The W&B CLI version installed on the machine that hosted the run command.

W&B stores the proceeding information below the overview section:

  • Artifact Outputs: Artifact outputs produced by the run.
  • Config: List of config parameters saved with wandb.config.
  • Summary: List of summary parameters saved with wandb.log(). By default, W&B sets this value to the last value logged.
W&B Dashboard run overview tab

View an example project overview here.

Workspace tab

Use the Workspace tab to view, search, group, and arrange visualizations such as autogenerated and custom plots, system metrics, and more.

View an example project workspace here

System tab

The System tab shows system metrics tracked for a specific run such as CPU utilization, system memory, disk I/O, network traffic, GPU utilization and more.

For a full list of system metrics W&B tracks, see System metrics.

View an example system tab here.

Logs tab

The Log tab shows output printed on the command line such as the standard output (stdout) and standard error (stderr).

Choose the Download button in the upper right hand corner to download the log file.

View an example logs tab here.

Files tab

Use the Files tab to view files associated with a specific run such as model checkpoints, validation set examples, and more

View an example files tab here.

Artifacts tab

The Artifacts tab lists the input and output artifacts for the specified run.

View an example artifacts tab here.

Delete runs

Delete one or more runs from a project with the W&B App.

  1. Navigate to the project that contains the runs you want to delete.
  2. Select the Runs tab from the project sidebar.
  3. Select the checkbox next to the runs you want to delete.
  4. Choose the Delete button (trash can icon) above the table.
  5. From the modal that appears, choose Delete.

1 - Add labels to runs with tags

Add tags to label runs with particular features that might not be obvious from the logged metrics or artifact data.

For example, you can add a tag to a run to indicated that run’s model is in_production, that run is preemptible, this run represents the baseline, and so forth.

Add tags to one or more runs

Programmatically or interactively add tags to your runs.

Based on your use case, select the tab below that best fits your needs:

You can add tags to a run when it is created:

import wandb

run = wandb.init(
  entity="entity",
  project="<project-name>",
  tags=["tag1", "tag2"]
)

You can also update the tags after you initialize a run. For example, the proceeding code snippet shows how to update a tag if a particular metrics crosses a pre-defined threshold:

import wandb

run = wandb.init(
  entity="entity", 
  project="capsules", 
  tags=["debug"]
  )

# python logic to train model

if current_loss < threshold:
    run.tags = run.tags + ("release_candidate",)

After you create a run, you can update tags using the Public API. For example:

run = wandb.Api().run("{entity}/{project}/{run-id}")
run.tags.append("tag1")  # you can choose tags based on run data here
run.update()

This method is best suited to tagging large numbers of runs with the same tag or tags.

  1. Navigate to your project workspace.
  2. Select Runs in the from the project sidebar.
  3. Select one or more runs from the table.
  4. Once you select one or more runs, select the Tag button above the table.
  5. Type the tag you want to add and select the Create new tag checkbox to add the tag.

This method is best suited to applying a tag or tags to a single run manually.

  1. Navigate to your project workspace.
  2. Select a run from the list of runs within your project’s workspace.
  3. Select Overview from the project sidebar.
  4. Select the gray plus icon (+) button next to Tags.
  5. Type a tag you want to add and select Add below the text box to add a new tag.

Remove tags from one or more runs

Tags can also be removed from runs with the W&B App UI.

This method is best suited to removing tags from a large numbers of runs.

  1. In the Run sidebar of the project, select the table icon in the upper-right. This will expand the sidebar into the full runs table.
  2. Hover over a run in the table to see a checkbox on the left or look in the header row for a checkbox to select all runs.
  3. Select the checkbox to enable bulk actions.
  4. Select the runs you want to remove tags.
  5. Select the Tag button above the rows of runs.
  6. Select the checkbox next to a tag to remove it from the run.
  1. In the left sidebar of the Run page, select the top Overview tab. The tags on the run are visible here.
  2. Hover over a tag and select the “x” to remove it from the run.

2 - Filter and search runs

How to use the sidebar and table on the project page

Use your project page to gain insights from runs logged to W&B.

Filter runs

Filter runs based on their status, tags, or other properties with the filter button.

Filter runs with tags

Filter runs based on their tags with the filter button.

Filter runs with regex

If regex doesn’t provide you the desired results, you can make use of tags to filter out the runs in Runs Table. Tags can be added either on run creation or after they’re finished. Once the tags are added to a run, you can add a tag filter as shown in the gif below.

If regex doesn't provide you the desired results, you can make use of tags to filter out the runs in Runs Table

Search run names

Use regex to find runs with the regex you specify. When you type a query in the search box, that will filter down the visible runs in the graphs on the workspace as well as filtering the rows of the table.

Sort runs by minimum and maximum values

Sort the runs table by the minimum or maximum value of a logged metric. This is particularly useful if you want to view the best (or worst) recorded value.

The following steps describe how to sort the run table by a specific metric based on the minimum or maximum recorded value:

  1. Hover your mouse over the column with the metric you want to sort with.
  2. Select the kebob menu (three vertical lines).
  3. From the dropdown, select either Show min or Show max.
  4. From the same dropdown, select Sort by asc or Sort by desc to sort in ascending or descending order, respectively.

Search End Time for runs

We provide a column named End Time that logs that last heartbeat from the client process. The field is hidden by default.

Export runs table to CSV

Export the table of all your runs, hyperparameters, and summary metrics to a CSV with the download button.

3 - Fork a run

Forking a W&B run

Use fork_from when you initialize a run with wandb.init() to “fork” from an existing W&B run. When you fork from a run, W&B creates a new run using the run ID and step of the source run.

Forking a run enables you to explore different parameters or models from a specific point in an experiment without impacting the original run.

Start a forked run

To fork a run, use the fork_from argument in wandb.init() and specify the source run ID and the step from the source run to fork from:

import wandb

# Initialize a run to be forked later
original_run = wandb.init(project="your_project_name", entity="your_entity_name")
# ... perform training or logging ...
original_run.finish()

# Fork the run from a specific step
forked_run = wandb.init(
    project="your_project_name",
    entity="your_entity_name",
    fork_from=f"{original_run.id}?_step=200",
)

Using an immutable run ID

Use an immutable run ID to ensure you have a consistent and unchanging reference to a specific run. Follow these steps to obtain the immutable run ID from the user interface:

  1. Access the Overview Tab: Navigate to the Overview tab on the source run’s page.

  2. Copy the Immutable Run ID: Click on the ... menu (three dots) located in the top-right corner of the Overview tab. Select the Copy Immutable Run ID option from the dropdown menu.

By following these steps, you will have a stable and unchanging reference to the run, which can be used for forking a run.

Continue from a forked run

After initializing a forked run, you can continue logging to the new run. You can log the same metrics for continuity and introduce new metrics.

For example, the following code example shows how to first fork a run and then how to log metrics to the forked run starting from a training step of 200:

import wandb
import math

# Initialize the first run and log some metrics
run1 = wandb.init("your_project_name", entity="your_entity_name")
for i in range(300):
    run1.log({"metric": i})
run1.finish()

# Fork from the first run at a specific step and log the metric starting from step 200
run2 = wandb.init(
    "your_project_name", entity="your_entity_name", fork_from=f"{run1.id}?_step=200"
)

# Continue logging in the new run
# For the first few steps, log the metric as is from run1
# After step 250, start logging the spikey pattern
for i in range(200, 300):
    if i < 250:
        run2.log({"metric": i})  # Continue logging from run1 without spikes
    else:
        # Introduce the spikey behavior starting from step 250
        subtle_spike = i + (2 * math.sin(i / 3.0))  # Apply a subtle spikey pattern
        run2.log({"metric": subtle_spike})
    # Additionally log the new metric at all steps
    run2.log({"additional_metric": i * 1.1})
run2.finish()

4 - Group runs into experiments

Group training and evaluation runs into larger experiments

Group individual jobs into experiments by passing a unique group name to wandb.init().

Use cases

  1. Distributed training: Use grouping if your experiments are split up into different pieces with separate training and evaluation scripts that should be viewed as parts of a larger whole.
  2. Multiple processes: Group multiple smaller processes together into an experiment.
  3. K-fold cross-validation: Group together runs with different random seeds to see a larger experiment. Here’s an example of k-fold cross-validation with sweeps and grouping.

There are three ways to set grouping:

1. Set group in your script

Pass an optional group and job_type to wandb.init(). This gives you a dedicated group page for each experiment, which contains the individual runs. For example:wandb.init(group="experiment_1", job_type="eval")

2. Set a group environment variable

Use WANDB_RUN_GROUP to specify a group for your runs as an environment variable. For more on this, check our docs for Environment Variables. Group should be unique within your project and shared by all runs in the group. You can use wandb.util.generate_id() to generate a unique 8 character string to use in all your processes— for example, os.environ["WANDB_RUN_GROUP"] = "experiment-" + wandb.util.generate_id()

3. Toggle grouping in the UI

You can dynamically group by any config column. For example, if you use wandb.config to log batch size or learning rate, you can then group by those hyperparameters dynamically in the web app.

Distributed training with grouping

Suppose you set grouping in wandb.init(), we will group runs by default in the UI. You can toggle this on and off by clicking the Group button at the top of the table. Here’s an example project generated from sample code where we set grouping. You can click on each “Group” row in the sidebar to get to a dedicated group page for that experiment.

From the project page above, you can click a Group in the left sidebar to get to a dedicated page like this one:

Grouping dynamically in the UI

You can group runs by any column, for example by hyperparameter. Here’s an example of what that looks like:

  • Sidebar: Runs are grouped by the number of epochs.
  • Graphs: Each line represents the group’s mean, and the shading indicates the variance. This behavior can be changed in the graph settings.

Turn off grouping

Click the grouping button and clear group fields at any time, which returns the table and graphs to their ungrouped state.

Grouping graph settings

Click the edit button in the upper right corner of a graph and select the Advanced tab to change the line and shading. You can select the mean, minimum, or maximum value for the line in each group. For the shading, you can turn off shading, and show the min and max, the standard deviation, and the standard error.

5 - Move runs

Move runs between your projects or to a team you are a member of.

Move runs between your projects

To move runs from one project to another:

  1. Navigate to the project that contains the runs you want to move.
  2. Select the Runs tab from the project sidebar.
  3. Select the checkbox next to the runs you want to move.
  4. Choose the Move button above the table.
  5. Select the destination project from the dropdown.

Move runs to a team

Move runs to a team you are a member of:

  1. Navigate to the project that contains the runs you want to move.
  2. Select the Runs tab from the project sidebar.
  3. Select the checkbox next to the runs you want to move.
  4. Choose the Move button above the table.
  5. Select the destination team and project from the dropdown.

6 - Resume a run

Resume a paused or exited W&B Run

Specify how a run should behave in the event that run stops or crashes. To resume or enable a run to automatically resume, you will need to specify the unique run ID associated with that run for the id parameter:

run = wandb.init(entity="<entity>", \ 
        project="<project>", id="<run ID>", resume="<resume>")

Pass one of the following arguments to the resume parameter to determine how W&B should respond. In each case, W&B first checks if the run ID already exists.

Argument Description Run ID exists Run ID does not exist Use case
"must" W&B must resume run specified by the run ID. W&B resumes run with the same run ID. W&B raises an error. Resume a run that must use the same run ID.
"allow" Allow W&B to resume run if run ID exists. W&B resumes run with the same run ID. W&B initializes a new run with specified run ID. Resume a run without overriding an existing run.
"never" Never allow W&B to resume a run specified by run ID. W&B raises an error. W&B initializes a new run with specified run ID.

You can also specify resume="auto" to let W&B to automatically try to restart the run on your behalf. However, you will need to ensure that you restart your run from the same directory. See the Enable runs to automatically resume section for more information.

For all the examples below, replace values enclosed within <> with your own.

Resume a run that must use the same run ID

If a run is stopped, crashes, or fails, you can resume it using the same run ID. To do so, initialize a run and specify the following:

  • Set the resume parameter to "must" (resume="must")
  • Provide the run ID of the run that stopped or crashed

The following code snippet shows how to accomplish this with the W&B Python SDK:

run = wandb.init(entity="<entity>", \ 
        project="<project>", id="<run ID>", resume="must")

Resume a run without overriding the existing run

Resume a run that stopped or crashed without overriding the existing run. This is especially helpful if your process doesn’t exit successfully. The next time you start W&B, W&B will start logging from the last step.

Set the resume parameter to "allow" (resume="allow") when you initialize a run with W&B. Provide the run ID of the run that stopped or crashed. The following code snippet shows how to accomplish this with the W&B Python SDK:

import wandb

run = wandb.init(entity="<entity>", \ 
        project="<project>", id="<run ID>", resume="allow")

Enable runs to automatically resume

The following code snippet shows how to enable runs to automatically resume with the Python SDK or with environment variables.

The following code snippet shows how to specify a W&B run ID with the Python SDK.

Replace values enclosed within <> with your own:

run = wandb.init(entity="<entity>", \ 
        project="<project>", id="<run ID>", resume="<resume>")

The following example shows how to specify the W&B WANDB_RUN_ID variable in a bash script:

RUN_ID="$1"

WANDB_RESUME=allow WANDB_RUN_ID="$RUN_ID" python eval.py

Within your terminal, you could run the shell script along with the W&B run ID. The following code snippet passes the run ID akj172:

sh run_experiment.sh akj172 

For example, suppose you execute a python script called train.py in a directory called Users/AwesomeEmployee/Desktop/ImageClassify/training/. Within train.py, the script creates a run that enables automatic resuming. Suppose next that the training script is stopped. To resume this run, you would need to restart your train.py script within Users/AwesomeEmployee/Desktop/ImageClassify/training/ .

Resume preemptible Sweeps runs

Automatically requeue interrupted sweep runs. This is particularly useful if you run a sweep agent in a compute environment that is subject to preemption such as a SLURM job in a preemptible queue, an EC2 spot instance, or a Google Cloud preemptible VM.

Use the mark_preempting function to enable W&B to automatically requeue interrupted sweep runs. For example, the following code snippet

run = wandb.init()  # Initialize a run
run.mark_preempting()

The following table outlines how W&B handles runs based on the exit status of the a sweep run.

Status Behavior
Status code 0 Run is considered to have terminated successfully and it will not be requeued.
Nonzero status W&B automatically appends the run to a run queue associated with the sweep.
No status Run is added to the sweep run queue. Sweep agents consume runs off the run queue until the queue is empty. Once the queue is empty, the sweep queue resumes generating new runs based on the sweep search algorithm.

7 - Rewind a run

Rewind

Rewind a run

Rewind a run to correct or modify the history of a run without losing the original data. In addition, when you rewind a run, you can log new data from that point in time. W&B recomputes the summary metrics for the run you rewind based on the newly logged history. This means the following behavior:

  • History truncation: W&B truncates the history to the rewind point, allowing new data logging.
  • Summary metrics: Recomputed based on the newly logged history.
  • Configuration preservation: W&B preserves the original configurations and you can merge new configurations.

When you rewind a run, W&B resets the state of the run to the specified step, preserving the original data and maintaining a consistent run ID. This means that:

  • Run archiving: W&B archives the original runs. Runs are accessible from the Run Overview tab.
  • Artifact association: Associates artifacts with the run that produce them.
  • Immutable run IDs: Introduced for consistent forking from a precise state.
  • Copy immutable run ID: A button to copy the immutable run ID for improved run management.

Rewind a run

Use resume_from with wandb.init() to “rewind” a run’s history to a specific step. Specify the name of the run and the step you want to rewind from:

import wandb
import math

# Initialize the first run and log some metrics
# Replace with your_project_name and your_entity_name!
run1 = wandb.init(project="your_project_name", entity="your_entity_name")
for i in range(300):
    run1.log({"metric": i})
run1.finish()

# Rewind from the first run at a specific step and log the metric starting from step 200
run2 = wandb.init(project="your_project_name", entity="your_entity_name", resume_from=f"{run1.id}?_step=200")

# Continue logging in the new run
# For the first few steps, log the metric as is from run1
# After step 250, start logging the spikey pattern
for i in range(200, 300):
    if i < 250:
        run2.log({"metric": i, "step": i})  # Continue logging from run1 without spikes
    else:
        # Introduce the spikey behavior starting from step 250
        subtle_spike = i + (2 * math.sin(i / 3.0))  # Apply a subtle spikey pattern
        run2.log({"metric": subtle_spike, "step": i})
    # Additionally log the new metric at all steps
    run2.log({"additional_metric": i * 1.1, "step": i})
run2.finish()

View an archived run

After you rewind a run, you can explore archived run with the W&B App UI. Follow these steps to view archived runs:

  1. Access the Overview Tab: Navigate to the Overview tab on the run’s page. This tab provides a comprehensive view of the run’s details and history.
  2. Locate the Forked From field: Within the Overview tab, find the Forked From field. This field captures the history of the resumptions. The Forked From field includes a link to the source run, allowing you to trace back to the original run and understand the entire rewind history.

By using the Forked From field, you can effortlessly navigate the tree of archived resumptions and gain insights into the sequence and origin of each rewind.

Fork from a run that you rewind

To fork from a rewound run, use the fork_from argument in wandb.init() and specify the source run ID and the step from the source run to fork from:

import wandb

# Fork the run from a specific step
forked_run = wandb.init(
    project="your_project_name",
    entity="your_entity_name",
    fork_from=f"{rewind_run.id}?_step=500",
)

# Continue logging in the new run
for i in range(500, 1000):
    forked_run.log({"metric": i*3})
forked_run.finish()

8 - Send an alert

Send alerts, triggered from your Python code, to your Slack or email

Create alerts with Slack or email if your run crashes or with a custom trigger. For example, you can create an alert if the gradient of your training loop starts to blow up (reports NaN) or a step in your ML pipeline completes. Alerts apply to all projects where you initialize runs, including both personal and team projects.

And then see W&B Alerts messages in Slack (or your email):

How to create an alert

There are two main steps to set up an alert:

  1. Turn on Alerts in your W&B User Settings
  2. Add run.alert() to your code
  3. Confirm alert is set up properly

1. Turn on alerts in your W&B User Settings

In your User Settings:

  • Scroll to the Alerts section
  • Turn on Scriptable run alerts to receive alerts from run.alert()
  • Use Connect Slack to pick a Slack channel to post alerts. We recommend the Slackbot channel because it keeps the alerts private.
  • Email will go to the email address you used when you signed up for W&B. We recommend setting up a filter in your email so all these alerts go into a folder and don’t fill up your inbox.

You will only have to do this the first time you set up W&B Alerts, or when you’d like to modify how you receive alerts.

Alerts settings in W&B User Settings

2. Add run.alert() to your code

Add run.alert() to your code (either in a Notebook or Python script) wherever you’d like it to be triggered

import wandb

run = wandb.init()
run.alert(title="High Loss", text="Loss is increasing rapidly")

3. Check your Slack or email

Check your Slack or emails for the alert message. If you didn’t receive any, make sure you’ve got emails or Slack turned on for Scriptable Alerts in your User Settings

Example

This simple alert sends a warning when accuracy falls below a threshold. In this example, it only sends alerts at least 5 minutes apart.

import wandb
from wandb import AlertLevel

run = wandb.init()

if acc < threshold:
    run.alert(
        title="Low accuracy",
        text=f"Accuracy {acc} is below the acceptable threshold {threshold}",
        level=AlertLevel.WARN,
        wait_duration=300,
    )

How to tag or mention users

Use the at sign @ followed by the Slack user ID to tag yourself or your colleagues in either the title or the text of the alert. You can find a Slack user ID from their Slack profile page.

run.alert(title="Loss is NaN", text=f"Hey <@U1234ABCD> loss has gone to NaN")

Team alerts

Team admins can set up alerts for the team on the team settings page: wandb.ai/teams/your-team.

Team alerts apply to everyone on your team. W&B recommends using the Slackbot channel because it keeps alerts private.

Change Slack channel to send alerts to

To change what channel alerts are sent to, click Disconnect Slack and then reconnect. After you reconnect, pick a different Slack channel.