Resume Runs

You can have wandb automatically resume runs by passing resume=True to wandb.init(). If your process doesn't exit successfully, the next time you run it wandb will start logging from the last step. Below is a simple example in Keras:

import keras
import numpy as np
import wandb
from wandb.keras import WandbCallback
wandb.init(project="preemptible", resume=True)
if wandb.run.resumed:
# restore the best model
model = keras.models.load_model(wandb.restore("model-best.h5").name)
else:
a = keras.layers.Input(shape=(32,))
b = keras.layers.Dense(10)(a)
model = keras.models.Model(input=a, output=b)
model.compile("adam", loss="mse")
model.fit(np.random.rand(100, 32), np.random.rand(100, 10),
# set the resumed epoch
initial_epoch=wandb.run.step, epochs=300,
# save the best model if it improved each epoch
callbacks=[WandbCallback(save_model=True, monitor="loss")])

Automatic resuming only works if the process is restarted on top of the same filesystem as the failed process. If you can't share a filesystem, we allow you to set the WANDB_RUN_ID: a globally unique string (per project) corresponding to a single run of your script. It must be no longer than 64 characters. All non-word characters will be converted to dashes.

# store this id to use it later when resuming
id = wandb.util.generate_id()
wandb.init(id=id, resume="allow")
# or via environment variables
os.environ["WANDB_RESUME"] = "allow"
os.environ["WANDB_RUN_ID"] = wandb.util.generate_id()
wandb.init()

If you set WANDB_RESUME equal to "allow", you can always set WANDB_RUN_ID to a unique string and restarts of the process will be handled automatically. If you set WANDB_RESUME equal to "must", wandb will throw an error if the run to be resumed does not exist yet instead of auto-creating a new run.

Method

Syntax

Never Resume (default)

Always Resume

Resume specifying run id

Resume from same directory

Envrionment Variable

WANDB_RESUME=value

"never"

"must"

"allow" (Requires WANDB_RUN_ID=RUN_ID)

(not available)

In Python Script

wandb.init(resume=value)

"never"

(not available)

resume=RUN_ID

resume=True

If multiple processes use the same run_id concurrently unexpected results will be recorded and rate limiting will occur.

If you resume a run and you have notes specified in wandb.init(), those notes will overwrite any notes that you have added in the UI.

Note that resuming a run which was executed as part of a Sweep is not supported.

Preemptible Sweeps

If you are running a sweep agent in a compute environment that is subject to preemption (e.g., a SLURM job in a preemptible queue, an EC2 spot instance, or a Google Cloud preemptible VM), you can automatically requeue your interrupted sweep runs, ensuring they will be retried until they run to completion.

When you learn your current run is about to be preempted, call

wandb.mark_preempting()

to immediately signal to the W&B backend that your run believes it is about to be preempted. If a run that is marked preeempting exits with status code 0, W&B will consider the run to have terminated successfully and it will not be requeued. If a preempting run exits with a nonzero status, W&B will consider the run to have been preempted, and it will automatically append the run to a run queue associated with the sweep. If a run exits with no status, W&B will mark the run preempted 5 minutes after the run's final heartbeat, then add it to the sweep run queue. Sweep agents will consume runs off the run queue until the queue is exhausted, at which point they will resume generating new runs based on the standard sweep search algorithm.

By default, requeued runs begin logging from their initial step. To instruct a run to resume logging at the step where it was interrupted, initialize the resumed run with wandb.init(resume=True).