Observe infrastructure alerts such as GPU failures, thermal violations, and more during machine learning experiments you log to W&B. During a W&B run, CoreWeave Mission Control monitors your compute infrastructure.
This feature is in Preview and only available when training on a CoreWeave cluster. Contact your W&B representative for access.
If an error occurs, CoreWeave sends that information to W&B. W&B populates infrastructure information onto your run’s plots in your project’s workspace. CoreWeave attempts to automatically resolve some issues, and W&B surfaces that information in the run’s page.