Documentation
Search…
Resume Runs
You can have wandb automatically resume runs by passing resume=True to wandb.init(). If your process doesn't exit successfully, the next time you run it wandb will start logging from the last step.
Keras
Pytorch
1
import keras
2
import numpy as np
3
import wandb
4
from wandb.keras import WandbCallback
5
6
wandb.init(project="preemptible", resume=True)
7
8
if wandb.run.resumed:
9
# restore the best model
10
model = keras.models.load_model(wandb.restore("model-best.h5").name)
11
else:
12
a = keras.layers.Input(shape=(32,))
13
b = keras.layers.Dense(10)(a)
14
model = keras.models.Model(input=a, output=b)
15
16
model.compile("adam", loss="mse")
17
model.fit(np.random.rand(100, 32), np.random.rand(100, 10),
18
# set the resumed epoch
19
initial_epoch=wandb.run.step, epochs=300,
20
# save the best model if it improved each epoch
21
callbacks=[WandbCallback(save_model=True, monitor="loss")])
Copied!
1
import wandb
2
import torch
3
import torch.nn as nn
4
import torch.optim as optim
5
6
PROJECT_NAME = 'pytorch-resume-run'
7
CHECKPOINT_PATH = './checkpoint.tar'
8
N_EPOCHS = 100
9
10
# Dummy data
11
X = torch.randn(64, 8, requires_grad=True)
12
Y = torch.empty(64, 1).random_(2)
13
model = nn.Sequential(
14
nn.Linear(8, 16),
15
nn.ReLU(),
16
nn.Linear(16, 1),
17
nn.Sigmoid()
18
)
19
metric = nn.BCELoss()
20
optimizer = optim.SGD(model.parameters(), lr=0.01)
21
epoch = 0
22
run = wandb.init(project=PROJECT_NAME, resume=True)
23
if wandb.run.resumed:
24
checkpoint = torch.load(wandb.restore(CHECKPOINT_PATH))
25
model.load_state_dict(checkpoint['model_state_dict'])
26
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
27
epoch = checkpoint['epoch']
28
loss = checkpoint['loss']
29
30
model.train()
31
while epoch < N_EPOCHS:
32
optimizer.zero_grad()
33
output = model(X)
34
loss = metric(output, Y)
35
wandb.log({'loss': loss.item()}, step=epoch)
36
loss.backward()
37
optimizer.step()
38
39
torch.save({ # Save our checkpoint loc
40
'epoch': epoch,
41
'model_state_dict': model.state_dict(),
42
'optimizer_state_dict': optimizer.state_dict(),
43
'loss': loss,
44
}, CHECKPOINT_PATH)
45
wandb.save(CHECKPOINT_PATH) # saves checkpoint to wandb
46
epoch += 1
Copied!

Resuming Guidance

There are different ways in which W&B can be used to resume runs as detailed below:
  1. 1.
    resume``
    This is our recommended method for resuming runs with W&B.
    1. 1.
      As described above, runs can be resumed by passingresume=True to wandb.init(). This can be thought of as auto-resuming, where we “automatically” pick up from where an aborted run left off. If your process doesn't exit successfully, the next time you run it wandb will start logging from the last step.
      • Note: This only works if you are running your script in the same directory as the one that failed as the file is stored at: wandb/wandb-resume.json.
    2. 2.
      The other form of resume requires you to provide the actual run id: wandb.init(id=run_id) and then when you resume (if you want to be sure that it is resuming, you do wandb.init(id=run_id, resume="must").
      • You can also have full control over resuming if you manage the run_id. We provide a utility to generate run_id: wandb.util.generate_id(). As long as you set the id to one of these unique ids for each unique run, you can say resume="allow" and wandb will automatically resume the run with that id.
    More context regarding automatic and controlled resuming can be found in this section.
  2. 2.
    • This will allow you to log new historical values for your metrics to a run starting from where you left off but does not take care of re-establishing the state of your code, you will need to make sure you have written checkpoints that you can load!
    • You can use wandb.save to record the state of your run via checkpoint files. Create a checkpoint file through wandb.save(), which can then be used through wandb.init(resume=<run-id>). This report illustrates how to save and restore models with W&B.

Automatic and controlled resuming

Automatic resuming only works if the process is restarted on top of the same filesystem as the failed process. If you can't share a filesystem, we allow you to set the WANDB_RUN_ID: a globally unique string (per project) corresponding to a single run of your script. It must be no longer than 64 characters. All non-word characters will be converted to dashes.
1
# store this id to use it later when resuming
2
id = wandb.util.generate_id()
3
wandb.init(id=id, resume="allow")
4
# or via environment variables
5
os.environ["WANDB_RESUME"] = "allow"
6
os.environ["WANDB_RUN_ID"] = wandb.util.generate_id()
7
wandb.init()
Copied!
If you set WANDB_RESUME equal to "allow", you can always set WANDB_RUN_ID to a unique string and restarts of the process will be handled automatically. If you set WANDB_RESUME equal to "must", wandb will throw an error if the run to be resumed does not exist yet instead of auto-creating a new run.
If multiple processes use the same run_id concurrently unexpected results will be recorded and rate limiting will occur.
If you resume a run and you have notes specified in wandb.init(), those notes will overwrite any notes that you have added in the UI.
Note that resuming a run which was executed as part of a Sweep is not supported.

Preemptible Sweeps

If you are running a sweep agent in a compute environment that is subject to preemption (e.g., a SLURM job in a preemptible queue, an EC2 spot instance, or a Google Cloud preemptible VM), you can automatically requeue your interrupted sweep runs, ensuring they will be retried until they run to completion.
When you learn your current run is about to be preempted, call
1
wandb.mark_preempting()
Copied!
to immediately signal to the W&B backend that your run believes it is about to be preempted. If a run that is marked preempting exits with status code 0, W&B will consider the run to have terminated successfully and it will not be requeued. If a preempting run exits with a nonzero status, W&B will consider the run to have been preempted, and it will automatically append the run to a run queue associated with the sweep. If a run exits with no status, W&B will mark the run preempted 5 minutes after the run's final heartbeat, then add it to the sweep run queue. Sweep agents will consume runs off the run queue until the queue is exhausted, at which point they will resume generating new runs based on the standard sweep search algorithm.
By default, requeued runs begin logging from their initial step. To instruct a run to resume logging at the step where it was interrupted, initialize the resumed run with wandb.init(resume=True).