Skip to main content
W&B marks a run as Crashed when it stops receiving heartbeats from the process that called wandb.init(), without the process having called wandb.finish(). This happens when the training process is killed, exits unexpectedly, or loses connectivity before it can report a clean finish. Common causes
  • Out-of-memory (OOM) error: The process is killed by the OS or GPU driver when it exceeds available memory. Check output.log for CUDA out of memory or Killed messages.
  • Uncaught exception: An unhandled Python exception causes the process to exit without calling wandb.finish(). The exception appears in output.log.
  • Job scheduler preemption: On SLURM or other cluster schedulers, jobs can be preempted and killed without warning. The run never gets a chance to finish cleanly.
  • Network loss: In rare cases, a long network outage causes the W&B backend to time out waiting for heartbeats and mark the run as crashed, even though the process is still running.
  • Process killed manually: Using kill -9 or SIGKILL bypasses Python’s signal handlers, preventing wandb.finish() from being called.
How to debug
  1. In the project sidebar, click Runs.
  2. Click the name of your run and then click the Files tab.
  3. Download output.log for stdout/stderr. This file usually contains the error that caused the crash.
  4. Download debug.log and debug-internal.log for W&B-level diagnostics (connectivity issues, upload errors).
  5. If the run was on a cluster, also check the scheduler’s job log for preemption or OOM signals.
Data from a crashed run Metrics logged before the crash are preserved and visible in the UI. The run’s charts, system metrics, and any artifacts that were fully uploaded before the crash are all accessible. Partially-uploaded artifacts may be incomplete. If steps that were logged locally are missing from the UI (for example, the process kept running after the run was marked crashed), push the buffered data from the local run directory with wandb sync. Replace [TIMESTAMP] and [ID] with your run’s values:
wandb sync wandb/run-[TIMESTAMP]-[ID]
See My run’s state is crashed on the UI but is still running on my machine for details. Preventing crashes from losing data Use wandb.init() as a context manager so the run finishes cleanly when your script raises an exception. The run is marked as Failed (rather than Crashed) and buffered data is flushed:
import wandb

with wandb.init(project="[YOUR-PROJECT]") as run:
    for step in range(1000):
        loss = ...  # your training step
        run.log({"loss": loss})
For run state definitions, see Run states. For console logs after a crash, see Why is console output not captured for my run?.
Runs Run Crashes