Import and export data
Export data or import data with W&B Public APIs.
This feature requires python>=3.8
Import data from MLFlow
W&B supports importing data from MLFlow, including experiments, runs, artifacts, metrics, and other metadata.
Install dependencies:
# note: this requires py38+
pip install wandb[importers]
Log in to W&B. Follow the prompts if you have not logged in before.
wandb login
Import all runs from an existing MLFlow server:
from wandb.apis.importers.mlflow import MlflowImporter
importer = MlflowImporter(mlflow_tracking_uri="...")
runs = importer.collect_runs()
importer.import_runs(runs)
By default, importer.collect_runs()
collects all runs from the MLFlow server. If you prefer to upload a special subset, you can construct your own runs iterable and pass it to the importer.
import mlflow
from wandb.apis.importers.mlflow import MlflowRun
client = mlflow.tracking.MlflowClient(mlflow_tracking_uri)
runs: Iterable[MlflowRun] = []
for run in mlflow_client.search_runs(...):
runs.append(MlflowRun(run, client))
importer.import_runs(runs)
You might need to configure the Databricks CLI first if you import from Databricks MLFlow.
Set mlflow-tracking-uri="databricks"
in the previous step.
To skip importing artifacts, you can pass artifacts=False
:
importer.import_runs(runs, artifacts=False)
To import to a specific W&B entity and project, you can pass a Namespace
:
from wandb.apis.importers import Namespace
importer.import_runs(runs, namespace=Namespace(entity, project))
Export Data
Use the Public API to export or update data that you have saved to W&B. Before using this API, you'll want to log data from your script — check the Quickstart for more details.
Use Cases for the Public API
- Export Data: Pull down a dataframe for custom analysis in a Jupyter Notebook. Once you have explored the data, you can sync your findings by creating a new analysis run and logging results, for example:
wandb.init(job_type="analysis")
- Update Existing Runs: You can update the data logged in association with a W&B run. For example, you might want to update the config of a set of runs to include additional information, like the architecture or a hyperparameter that wasn't originally logged.
See the Generated Reference Docs for details on available functions.
Authentication
Authenticate your machine with your API key in one of two ways:
- Run
wandb login
on the command line and paste in your API key. - Set the
WANDB_API_KEY
environment variable to your API key.
Find the run path
To use the Public API, you'll often need the run path which is <entity>/<project>/<run_id>
. In the app UI, open a run page and click the Overview tab to get the run path.
Export Run Data
Download data from a finished or active run. Common usage includes downloading a dataframe for custom analysis in a Jupyter notebook, or using custom logic in an automated environment.
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
The most commonly used attributes of a run object are:
Attribute | Meaning |
---|---|
run.config | A dictionary of the run's configuration information, such as the hyperparameters for a training run or the preprocessing methods for a run that creates a dataset Artifact. Think of these as the run's "inputs". |
run.history() | A list of dictionaries meant to store values that change while the model is training such as loss. The command wandb.log() appends to this object. |
run.summary | A dictionary of information that summarizes the run's results. This can be scalars like accuracy and loss, or large files. By default, wandb.log() sets the summary to the final value of a logged time series. The contents of the summary can also be set directly. Think of the summary as the run's "outputs". |
You can also modify or update the data of past runs. By default a single instance of an api object will cache all network requests. If your use case requires real time information in a running script, call api.flush()
to get updated values.
Understanding the Different Attributes
For the below run
n_epochs = 5
config = {"n_epochs": n_epochs}
run = wandb.init(project=project, config=config)
for n in range(run.config.get("n_epochs")):
run.log(
{"val": random.randint(0, 1000), "loss": (random.randint(0, 1000) / 1000.00)}
)
run.finish()
these are the different outputs for the above run object attributes