How can I use wandb with multiprocessing, e.g. distributed training?
If a training program uses multiple processes, structure the program to avoid making wandb method calls from processes without wandb.init()
.
Manage multiprocess training using these approaches:
- Call
wandb.init
in all processes and use the group keyword argument to create a shared group. Each process will have its own wandb run, and the UI will group the training processes together. - Call
wandb.init
from only one process and pass data to log through multiprocessing queues.
info
Refer to the Distributed Training Guide for detailed explanations of these approaches, including code examples with Torch DDP.