run_id or use wandb.init(resume=...) while a sweep agent is running, W&B ignores the run ID and starts a fresh run instead. You will see the following warning:
-
Checkpoint and reload within a single run: Save model checkpoints at regular intervals inside your training function. On restart, load the latest checkpoint at the beginning of
train(). The sweep starts a new run, but training picks up from the saved state. -
Use
--count 1on SLURM with requeue: Submit each sweep agent job withwandb agent --count 1 SWEEP_ID. If the job is preempted, SLURM can requeue it and the sweep controller will assign a new configuration. - Mark a run as failed and requeue manually: If a run crashes mid-way, the sweep controller will eventually mark it as failed and may assign the same configuration to a new agent depending on your sweep settings.
wandb.init(resume="allow", id="YOUR_RUN_ID") in a standalone script instead. See Resume runs and Troubleshoot sweeps.
Sweeps Resuming