wandb.init(), without the process having called wandb.finish(). This happens when the training process is killed, exits unexpectedly, or loses connectivity before it can report a clean finish.
Common causes
- Out-of-memory (OOM) error: The process is killed by the OS or GPU driver when it exceeds available memory. Check
output.logforCUDA out of memoryorKilledmessages. - Uncaught exception: An unhandled Python exception causes the process to exit without calling
wandb.finish(). The exception appears inoutput.log. - Job scheduler preemption: On SLURM or other cluster schedulers, jobs can be preempted and killed without warning. The run never gets a chance to finish cleanly.
- Network loss: In rare cases, a long network outage causes the W&B backend to time out waiting for heartbeats and mark the run as crashed, even though the process is still running.
- Process killed manually: Using
kill -9orSIGKILLbypasses Python’s signal handlers, preventingwandb.finish()from being called.
- In the project sidebar, click Runs.
- Click the name of your run and then click the Files tab.
- Download
output.logfor stdout/stderr. This file usually contains the error that caused the crash. - Download
debug.loganddebug-internal.logfor W&B-level diagnostics (connectivity issues, upload errors). - If the run was on a cluster, also check the scheduler’s job log for preemption or OOM signals.
wandb sync. Replace [TIMESTAMP] and [ID] with your run’s values:
wandb.init() as a context manager so the run finishes cleanly when your script raises an exception. The run is marked as Failed (rather than Crashed) and buffered data is flushed:
Runs Run Crashes