You can have wandb automatically resume runs by passing
wandb.init(). If your process doesn't exit successfully, the next time you run it wandb will start logging from the last step. Below is a simple example in Keras:
import kerasimport numpy as npimport wandbfrom wandb.keras import WandbCallbackwandb.init(project="preemptible", resume=True)if wandb.run.resumed:# restore the best modelmodel = keras.models.load_model(wandb.restore("model-best.h5").name)else:a = keras.layers.Input(shape=(32,))b = keras.layers.Dense(10)(a)model = keras.models.Model(input=a, output=b)model.compile("adam", loss="mse")model.fit(np.random.rand(100, 32), np.random.rand(100, 10),# set the resumed epochinitial_epoch=wandb.run.step, epochs=300,# save the best model if it improved each epochcallbacks=[WandbCallback(save_model=True, monitor="loss")])
Automatic resuming only works if the process is restarted on top of the same filesystem as the failed process. If you can't share a filesystem, we allow you to set the WANDB_RUN_ID: a globally unique string (per project) corresponding to a single run of your script. It must be no longer than 64 characters. All non-word characters will be converted to dashes.
# store this id to use it later when resumingid = wandb.util.generate_id()wandb.init(id=id, resume="allow")# or via environment variablesos.environ["WANDB_RESUME"] = "allow"os.environ["WANDB_RUN_ID"] = wandb.util.generate_id()wandb.init()
If you set
WANDB_RESUME equal to
"allow", you can always set
WANDB_RUN_ID to a unique string and restarts of the process will be handled automatically. If you set
WANDB_RESUME equal to
"must", wandb will throw an error if the run to be resumed does not exist yet instead of auto-creating a new run.
Never Resume (default)
Resume specifying run id
Resume from same directory
In Python Script
If you are running a sweep agent in a compute environment that is subject to preemption (e.g., a SLURM job in a preemptible queue, an EC2 spot instance, or a Google Cloud preemptible VM), you can automatically requeue your interrupted sweep runs, ensuring they will be retried until they run to completion.
When you learn your current run is about to be preempted, call
to immediately signal to the W&B backend that your run believes it is about to be preempted. If a run that is marked preeempting exits with status code 0, W&B will consider the run to have terminated successfully and it will not be requeued. If a preempting run exits with a nonzero status, W&B will consider the run to have been preempted, and it will automatically append the run to a run queue associated with the sweep. If a run exits with no status, W&B will mark the run preempted 5 minutes after the run's final heartbeat, then add it to the sweep run queue. Sweep agents will consume runs off the run queue until the queue is exhausted, at which point they will resume generating new runs based on the standard sweep search algorithm.
By default, requeued runs begin logging from their initial step. To instruct a run to resume logging at the step where it was interrupted, initialize the resumed run with