Metaflow

How to integrate W&B with Metaflow.

Overview

Metaflow is a framework created by Netflix for creating and running ML workflows.

This integration lets users apply decorators to Metaflow steps and flows to automatically log parameters and artifacts to W&B.

  • Decorating a step will turn logging off or on for certain types within that step.
  • Decorating the flow will turn logging off or on for every step in the flow.

Quickstart

Sign up and create an API key

An API key authenticates your machine to W&B. You can generate an API key from your user profile.

  1. Click your user profile icon in the upper right corner.
  2. Select User Settings, then scroll to the API Keys section.
  3. Click Reveal. Copy the displayed API key. To hide the API key, reload the page.

Install the wandb library and log in

To install the wandb library locally and log in:

  1. Set the WANDB_API_KEY environment variable to your API key.

    export WANDB_API_KEY=<your_api_key>
    
  2. Install the wandb library and log in.

    pip install -Uqqq metaflow fastcore wandb
    
    wandb login
    
pip install -Uqqq metaflow fastcore wandb
import wandb
wandb.login()
!pip install -Uqqq metaflow fastcore wandb

import wandb
wandb.login()

Decorate your flows and steps

Decorating a step turns logging off or on for certain types within that step.

In this example, all datasets and models in start will be logged

from wandb.integration.metaflow import wandb_log

class WandbExampleFlow(FlowSpec):
    @wandb_log(datasets=True, models=True, settings=wandb.Settings(...))
    @step
    def start(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.transform)

Decorating a flow is equivalent to decorating all the constituent steps with a default.

In this case, all steps in WandbExampleFlow default to logging datasets and models by default, just like decorating each step with @wandb_log(datasets=True, models=True)

from wandb.integration.metaflow import wandb_log

@wandb_log(datasets=True, models=True)  # decorate all @step 
class WandbExampleFlow(FlowSpec):
    @step
    def start(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.transform)

Decorating the flow is equivalent to decorating all steps with a default. That means if you later decorate a Step with another @wandb_log, it overrides the flow-level decoration.

In this example:

  • start and mid log both datasets and models.
  • end logs neither datasets nor models.
from wandb.integration.metaflow import wandb_log

@wandb_log(datasets=True, models=True)  # same as decorating start and mid
class WandbExampleFlow(FlowSpec):
  # this step will log datasets and models
  @step
  def start(self):
    self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
    self.model_file = torch.load(...)  # nn.Module    -> upload as model
    self.next(self.mid)

  # this step will also log datasets and models
  @step
  def mid(self):
    self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
    self.model_file = torch.load(...)  # nn.Module    -> upload as model
    self.next(self.end)

  # this step is overwritten and will NOT log datasets OR models
  @wandb_log(datasets=False, models=False)
  @step
  def end(self):
    self.raw_df = pd.read_csv(...).    
    self.model_file = torch.load(...)

Access your data programmatically

You can access the information we’ve captured in three ways: inside the original Python process being logged using the wandb client library, with the web app UI, or programmatically using our Public API. Parameters are saved to W&B’s config and can be found in the Overview tab. datasets, models, and others are saved to W&B Artifacts and can be found in the Artifacts tab. Base python types are saved to W&B’s summary dict and can be found in the Overview tab. See our guide to the Public API for details on using the API to get this information programmatically from outside .

Quick reference

Data Client library UI
Parameter(...) wandb.config Overview tab, Config
datasets, models, others wandb.use_artifact("{var_name}:latest") Artifacts tab
Base Python types (dict, list, str, etc.) wandb.summary Overview tab, Summary

wandb_log kwargs

kwarg Options
datasets
  • True: Log instance variables that are a dataset
  • False
models
  • True: Log instance variables that are a model
  • False
others
  • True: Log anything else that is serializable as a pickle
  • False
settings
  • wandb.Settings(…): Specify your own wandb settings for this step or flow
  • None: Equivalent to passing wandb.Settings()

By default, if:

  • settings.run_group is None, it will be set to {flow_name}/{run_id}
  • settings.run_job_type is None, it will be set to {run_job_type}/{step_name}

Frequently Asked Questions

What exactly do you log? Do you log all instance and local variables?

wandb_log only logs instance variables. Local variables are NEVER logged. This is useful to avoid logging unnecessary data.

Which data types get logged?

We currently support these types:

Logging Setting Type
default (always on)
  • dict, list, set, str, int, float, bool
datasets
  • pd.DataFrame
  • pathlib.Path
models
  • nn.Module
  • sklearn.base.BaseEstimator
others

How can I configure logging behavior?

Kind of Variable behavior Example Data Type
Instance Auto-logged self.accuracy float
Instance Logged if datasets=True self.df pd.DataFrame
Instance Not logged if datasets=False self.df pd.DataFrame
Local Never logged accuracy float
Local Never logged df pd.DataFrame

Is artifact lineage tracked?

Yes. If you have an artifact that is an output of step A and an input to step B, we automatically construct the lineage DAG for you.

For an example of this behavior, please see this notebook and its corresponding W&B Artifacts page


Last modified February 13, 2025: a09c831