wandb.log()from a single process, e.g. the rank0 process. This is the most common solution for logging with PyTorch DDP. In some cases, users funnel data over from other processes using a multiprocessing queue (or another communication primitive) to the main logging process.
wandb.init(). These are effectively separate experiments, so use the
groupparameter to set a shared experiment name and group the logged values together in the UI.
torch.nn) is a popular library for distributed training. In this walkthrough, we'll show how to track metrics with Weights & Biases using PyTorch DDP on two GPUs on a single machine. The basic principles apply to any distributed training setup, but the details of implementation may differ.
rank0process is the main process and coordinates the other processes. Often, it's useful to just track this single process as a W&B run, using
wandb.init()in just the
rank0process and only calling
wandb.log()there, not in any sub-processes.
torch.distributed.launch. With this module, we can determine the rank of the process from the
--local_rankargument. Now that we have the rank of the process, we can set up
wandblogging conditionally in the
wandb.initon all processes
wandb.log()from each process separately. It's also useful to call
wandb.finish()at the end of training, to mark that the run has completed so that all processes exit properly.
wandb.init(). These results will be shown together on a group page in the W&B UI, so our experiments stay organized.
wandbprocess hangs, it could be because the
wandbmultiprocessing is interfering with the multiprocessing from distributed training. Try setting the
WANDB_START_METHODenvironment variable to
"thread"to use multithreading instead. We also recommend using the new wandb service to improve the reliability of your distributed jobs.
wandbprocess might not know it needs to exit, and that will cause your job to hang. In this case, call
wandb.finish()at the end of your script to mark the run as finished and cause
wandbto exit. If you are using
wandbin a distributed training setup and experiencing hangs, please consider using the wandb service to improve the reliability of your runs.
wandb serviceenhances how W&B handles multiprocessing runs and thus improves reliability in a distributed training setting.
wandbpreviously in a distributed training setup could experience hanging jobs and made for an overall poor experience. Now with
wandb serviceenabled by default, there is no extra work required by the user to log multiprocessing runs. You can enable wandb service directly in your script, or install a pre-release wandb package with it enabled by default:
servicecan be enabled by adding the following to your script:
wandb.setup()in the main process (line 8):
serviceis enabled by default in PyTorch Lightning, so if you are using version 1.6.0 or later you were using