This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Artifacts

Overview of what W&B Artifacts are, how they work, and how to get started using W&B Artifacts.

Use W&B Artifacts to track and version data as the inputs and outputs of your W&B Runs. For example, a model training run might take in a dataset as input and produce a trained model as output. You can log hyperparameters, metadatra, and metrics to a run, and you can use an artifact to log, track, and version the dataset used to train the model as input and another artifact for the resulting model checkpoints as output.

Use cases

You can use artifacts throughout your entire ML workflow as inputs and outputs of runs. You can use datasets, models, or even other artifacts as inputs for processing.

Use Case Input Output
Model Training Dataset (training and validation data) Trained Model
Dataset Pre-Processing Dataset (raw data) Dataset (pre-processed data)
Model Evaluation Model + Dataset (test data) W&B Table
Model Optimization Model Optimized Model

Create an artifact

Create an artifact with four lines of code:

  1. Create a W&B run.
  2. Create an artifact object with the wandb.Artifact API.
  3. Add one or more files, such as a model file or dataset, to your artifact object.
  4. Log your artifact to W&B.

For example, the proceeding code snippet shows how to log a file called dataset.h5 to an artifact called example_artifact:

import wandb

run = wandb.init(project = "artifacts-example", job_type = "add-dataset")
artifact = wandb.Artifact(name = "example_artifact", type = "dataset")
artifact.add_file(local_path = "./dataset.h5", name = "training_dataset")
artifact.save()

# Logs the artifact version "my_data" as a dataset with data from dataset.h5

Download an artifact

Indicate the artifact you want to mark as input to your run with the use_artifact method.

Following the preceding code snippet, this next code block shows how to use the training_dataset artifact:

artifact = run.use_artifact("training_dataset:latest") #returns a run object using the "my_data" artifact

This returns an artifact object.

Next, use the returned object to download all contents of the artifact:

datadir = artifact.download() #downloads the full "my_data" artifact to the default directory.

Next steps

1 - Create an artifact

Create, construct a W&B Artifact. Learn how to add one or more files or a URI reference to an Artifact.

Use the W&B Python SDK to construct artifacts from W&B Runs. You can add files, directories, URIs, and files from parallel runs to artifacts. After you add a file to an artifact, save the artifact to the W&B Server or your own private server.

For information on how to track external files, such as files stored in Amazon S3, see the Track external files page.

How to construct an artifact

Construct a W&B Artifact in three steps:

1. Create an artifact Python object with wandb.Artifact()

Initialize the wandb.Artifact() class to create an artifact object. Specify the following parameters:

  • Name: Specify a name for your artifact. The name should be unique, descriptive, and easy to remember. Use an artifacts name to both: identify the artifact in the W&B App UI and when you want to use that artifact.
  • Type: Provide a type. The type should be simple, descriptive and correspond to a single step of your machine learning pipeline. Common artifact types include 'dataset' or 'model'.

You can optionally provide a description and metadata when you initialize an artifact object. For more information on available attributes and parameters, see wandb.Artifact Class definition in the Python SDK Reference Guide.

The proceeding example demonstrates how to create a dataset artifact:

import wandb

artifact = wandb.Artifact(name="<replace>", type="<replace>")

Replace the string arguments in the preceding code snippet with your own name and type.

2. Add one more files to the artifact

Add files, directories, external URI references (such as Amazon S3) and more with artifact methods. For example, to add a single text file, use the add_file method:

artifact.add_file(local_path="hello_world.txt", name="optional-name")

You can also add multiple files with the add_dir method. For more information on how to add files, see Update an artifact.

3. Save your artifact to the W&B server

Finally, save your artifact to the W&B server. Artifacts are associated with a run. Therefore, use a run objects log_artifact() method to save the artifact.

# Create a W&B Run. Replace 'job-type'.
run = wandb.init(project="artifacts-example", job_type="job-type")

run.log_artifact(artifact)

You can optionally construct an artifact outside of a W&B run. For more information, see Track external files.

Add files to an artifact

The following sections demonstrate how to construct artifacts with different file types and from parallel runs.

For the following examples, assume you have a project directory with multiple files and a directory structure:

project-directory
|-- images
|   |-- cat.png
|   +-- dog.png
|-- checkpoints
|   +-- model.h5
+-- model.h5

Add a single file

The proceeding code snippet demonstrates how to add a single, local file to your artifact:

# Add a single file
artifact.add_file(local_path="path/file.format")

For example, suppose you had a file called 'file.txt' in your working local directory.

artifact.add_file("path/file.txt")  # Added as `file.txt'

The artifact now has the following content:

file.txt

Optionally, pass the desired path within the artifact for the name parameter.

artifact.add_file(local_path="path/file.format", name="new/path/file.format")

The artifact is stored as:

new/path/file.txt
API Call Resulting artifact
artifact.add_file('model.h5') model.h5
artifact.add_file('checkpoints/model.h5') model.h5
artifact.add_file('model.h5', name='models/mymodel.h5') models/mymodel.h5

Add multiple files

The proceeding code snippet demonstrates how to add an entire, local directory to your artifact:

# Recursively add a directory
artifact.add_dir(local_path="path/file.format", name="optional-prefix")

The proceeding API calls produce the proceeding artifact content:

API Call Resulting artifact
artifact.add_dir('images')

cat.png

dog.png

artifact.add_dir('images', name='images')

images/cat.png

images/dog.png

artifact.new_file('hello.txt') hello.txt

Add a URI reference

Artifacts track checksums and other information for reproducibility if the URI has a scheme that W&B library knows how to handle.

Add an external URI reference to an artifact with the add_reference method. Replace the 'uri' string with your own URI. Optionally pass the desired path within the artifact for the name parameter.

# Add a URI reference
artifact.add_reference(uri="uri", name="optional-name")

Artifacts currently support the following URI schemes:

  • http(s)://: A path to a file accessible over HTTP. The artifact will track checksums in the form of etags and size metadata if the HTTP server supports the ETag and Content-Length response headers.
  • s3://: A path to an object or object prefix in S3. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.
  • gs://: A path to an object or object prefix in GCS. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.

The proceeding API calls will produce the proceeding artifacts:

API call Resulting artifact contents
artifact.add_reference('s3://my-bucket/model.h5') model.h5
artifact.add_reference('s3://my-bucket/checkpoints/model.h5') model.h5
artifact.add_reference('s3://my-bucket/model.h5', name='models/mymodel.h5') models/mymodel.h5
artifact.add_reference('s3://my-bucket/images')

cat.png

dog.png

artifact.add_reference('s3://my-bucket/images', name='images')

images/cat.png

images/dog.png

Add files to artifacts from parallel runs

For large datasets or distributed training, multiple parallel runs might need to contribute to a single artifact.

import wandb
import time

# We will use ray to launch our runs in parallel
# for demonstration purposes. You can orchestrate
# your parallel runs however you want.
import ray

ray.init()

artifact_type = "dataset"
artifact_name = "parallel-artifact"
table_name = "distributed_table"
parts_path = "parts"
num_parallel = 5

# Each batch of parallel writers should have its own
# unique group name.
group_name = "writer-group-{}".format(round(time.time()))


@ray.remote
def train(i):
    """
    Our writer job. Each writer will add one image to the artifact.
    """
    with wandb.init(group=group_name) as run:
        artifact = wandb.Artifact(name=artifact_name, type=artifact_type)

        # Add data to a wandb table. In this case we use example data
        table = wandb.Table(columns=["a", "b", "c"], data=[[i, i * 2, 2**i]])

        # Add the table to folder in the artifact
        artifact.add(table, "{}/table_{}".format(parts_path, i))

        # Upserting the artifact creates or appends data to the artifact
        run.upsert_artifact(artifact)


# Launch your runs in parallel
result_ids = [train.remote(i) for i in range(num_parallel)]

# Join on all the writers to make sure their files have
# been added before finishing the artifact.
ray.get(result_ids)

# Once all the writers are finished, finish the artifact
# to mark it ready.
with wandb.init(group=group_name) as run:
    artifact = wandb.Artifact(artifact_name, type=artifact_type)

    # Create a "PartitionTable" pointing to the folder of tables
    # and add it to the artifact.
    artifact.add(wandb.data_types.PartitionedTable(parts_path), table_name)

    # Finish artifact finalizes the artifact, disallowing future "upserts"
    # to this version.
    run.finish_artifact(artifact)

2 - Download and use artifacts

Download and use Artifacts from multiple projects.

Download and use an artifact that is already stored on the W&B server or construct an artifact object and pass it in to for de-duplication as necessary.

Download and use an artifact stored on W&B

Download and use an artifact stored in W&B either inside or outside of a W&B Run. Use the Public API (wandb.Api) to export (or update data) already saved in W&B. For more information, see the W&B Public API Reference guide.

First, import the W&B Python SDK. Next, create a W&B Run:

import wandb

run = wandb.init(project="<example>", job_type="<job-type>")

Indicate the artifact you want to use with the use_artifact method. This returns a run object. In the proceeding code snippet specifies an artifact called 'bike-dataset' with the alias 'latest':

artifact = run.use_artifact("bike-dataset:latest")

Use the object returned to download all the contents of the artifact:

datadir = artifact.download()

You can optionally pass a path to the root parameter to download the contents of the artifact to a specific directory. For more information, see the Python SDK Reference Guide.

Use the get_path method to download only subset of files:

path = artifact.get_path(name)

This fetches only the file at the path name. It returns an Entry object with the following methods:

  • Entry.download: Downloads file from the artifact at path name
  • Entry.ref: If add_reference stored the entry as a reference, returns the URI

References that have schemes that W&B knows how to handle get downloaded just like artifact files. For more information, see Track external files.

First, import the W&B SDK. Next, create an artifact from the Public API Class. Provide the entity, project, artifact, and alias associated with that artifact:

import wandb

api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")

Use the object returned to download the contents of the artifact:

artifact.download()

You can optionally pass a path the root parameter to download the contents of the artifact to a specific directory. For more information, see the API Reference Guide.

Use the wandb artifact get command to download an artifact from the W&B server.

$ wandb artifact get project/artifact:alias --root mnist/

Partially download an artifact

You can optionally download part of an artifact based on a prefix. Using the path_prefix parameter, you can download a single file or the content of a sub-folder.

artifact = run.use_artifact("bike-dataset:latest")

artifact.download(path_prefix="bike.png") # downloads only bike.png

Alternatively, you can download files from a certain directory:

artifact.download(path_prefix="images/bikes/") # downloads files in the images/bikes directory

Use an artifact from a different project

Specify the name of artifact along with its project name to reference an artifact. You can also reference artifacts across entities by specifying the name of the artifact with its entity name.

The following code example demonstrates how to query an artifact from another project as input to the current W&B run.

import wandb

run = wandb.init(project="<example>", job_type="<job-type>")
# Query W&B for an artifact from another project and mark it
# as an input to this run.
artifact = run.use_artifact("my-project/artifact:alias")

# Use an artifact from another entity and mark it as an input
# to this run.
artifact = run.use_artifact("my-entity/my-project/artifact:alias")

Construct and use an artifact simultaneously

Simultaneously construct and use an artifact. Create an artifact object and pass it to use_artifact. This creates an artifact in W&B if it does not exist yet. The use_artifact API is idempotent, so you can call it as many times as you like.

import wandb

artifact = wandb.Artifact("reference model")
artifact.add_file("model.h5")
run.use_artifact(artifact)

For more information about constructing an artifact, see Construct an artifact.

3 - Update an artifact

Update an existing Artifact inside and outside of a W&B Run.

Pass desired values to update the description, metadata, and alias of an artifact. Call the save() method to update the artifact on the W&B servers. You can update an artifact during a W&B Run or outside of a Run.

Use the W&B Public API (wandb.Api) to update an artifact outside of a run. Use the Artifact API (wandb.Artifact) to update an artifact during a run.

The proceeding code example demonstrates how to update the description of an artifact using the wandb.Artifact API:

import wandb

run = wandb.init(project="<example>")
artifact = run.use_artifact("<artifact-name>:<alias>")
artifact.description = "<description>"
artifact.save()

The proceeding code example demonstrates how to update the description of an artifact using the wandb.Api API:

import wandb

api = wandb.Api()

artifact = api.artifact("entity/project/artifact:alias")

# Update the description
artifact.description = "My new description"

# Selectively update metadata keys
artifact.metadata["oldKey"] = "new value"

# Replace the metadata entirely
artifact.metadata = {"newKey": "new value"}

# Add an alias
artifact.aliases.append("best")

# Remove an alias
artifact.aliases.remove("latest")

# Completely replace the aliases
artifact.aliases = ["replaced"]

# Persist all artifact modifications
artifact.save()

For more information, see the Weights and Biases Artifact API.

You can also update an Artifact collection in the same way as a singular artifact:

import wandb
run = wandb.init(project="<example>")
api = wandb.Api()
artifact = api.artifact_collection(type="<type-name>", collection="<collection-name>")
artifact.name = "<new-collection-name>"
artifact.description = "<This is where you'd describe the purpose of your collection.>"
artifact.save()

For more information, see the Artifacts Collection reference.

4 - Create an artifact alias

Create custom aliases for W&B Artifacts.

Use aliases as pointers to specific versions. By default, Run.log_artifact adds the latest alias to the logged version.

An artifact version v0 is created and attached to your artifact when you log an artifact for the first time. W&B checksums the contents when you log again to the same artifact. If the artifact changed, W&B saves a new version v1.

For example, if you want your training script to pull the most recent version of a dataset, specify latest when you use that artifact. The proceeding code example demonstrates how to download a recent dataset artifact named bike-dataset that has an alias, latest:

import wandb

run = wandb.init(project="<example-project>")

artifact = run.use_artifact("bike-dataset:latest")

artifact.download()

You can also apply a custom alias to an artifact version. For example, if you want to mark that model checkpoint is the best on the metric AP-50, you could add the string 'best-ap50' as an alias when you log the model artifact.

artifact = wandb.Artifact("run-3nq3ctyy-bike-model", type="model")
artifact.add_file("model.h5")
run.log_artifact(artifact, aliases=["latest", "best-ap50"])

5 - Create an artifact version

Create a new artifact version from a single run or from a distributed process.

Create a new artifact version with a single run or collaboratively with distributed runs. You can optionally create a new artifact version from a previous version known as an incremental artifact.

Create new artifact versions from scratch

There are two ways to create a new artifact version: from a single run and from distributed runs. They are defined as follows:

  • Single run: A single run provides all the data for a new version. This is the most common case and is best suited when the run fully recreates the needed data. For example: outputting saved models or model predictions in a table for analysis.
  • Distributed runs: A set of runs collectively provides all the data for a new version. This is best suited for distributed jobs which have multiple runs generating data, often in parallel. For example: evaluating a model in a distributed manner, and outputting the predictions.

W&B will create a new artifact and assign it a v0 alias if you pass a name to the wandb.Artifact API that does not exist in your project. W&B checksums the contents when you log again to the same artifact. If the artifact changed, W&B saves a new version v1.

W&B will retrieve an existing artifact if you pass a name and artifact type to the wandb.Artifact API that matches an existing artifact in your project. The retrieved artifact will have a version greater than 1.

Single run

Log a new version of an Artifact with a single run that produces all the files in the artifact. This case occurs when a single run produces all the files in the artifact.

Based on your use case, select one of the tabs below to create a new artifact version inside or outside of a run:

Create an artifact version within a W&B run:

  1. Create a run with wandb.init. (Line 1)
  2. Create a new artifact or retrieve an existing one with wandb.Artifact . (Line 2)
  3. Add files to the artifact with .add_file. (Line 9)
  4. Log the artifact to the run with .log_artifact. (Line 10)
with wandb.init() as run:
    artifact = wandb.Artifact("artifact_name", "artifact_type")

    # Add Files and Assets to the artifact using
    # `.add`, `.add_file`, `.add_dir`, and `.add_reference`
    artifact.add_file("image1.png")
    run.log_artifact(artifact)

Create an artifact version outside of a W&B run:

  1. Create a new artifact or retrieve an existing one with wanb.Artifact. (Line 1)
  2. Add files to the artifact with .add_file. (Line 4)
  3. Save the artifact with .save. (Line 5)
artifact = wandb.Artifact("artifact_name", "artifact_type")
# Add Files and Assets to the artifact using
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
artifact.add_file("image1.png")
artifact.save()

Distributed runs

Allow a collection of runs to collaborate on a version before committing it. This is in contrast to single run mode described above where one run provides all the data for a new version.

Consider the following example. Different runs (labelled below as Run 1, Run 2, and Run 3) add a different image file to the same artifact with upsert_artifact.

Run 1:

with wandb.init() as run:
    artifact = wandb.Artifact("artifact_name", "artifact_type")
    # Add Files and Assets to the artifact using
    # `.add`, `.add_file`, `.add_dir`, and `.add_reference`
    artifact.add_file("image1.png")
    run.upsert_artifact(artifact, distributed_id="my_dist_artifact")

Run 2:

with wandb.init() as run:
    artifact = wandb.Artifact("artifact_name", "artifact_type")
    # Add Files and Assets to the artifact using
    # `.add`, `.add_file`, `.add_dir`, and `.add_reference`
    artifact.add_file("image2.png")
    run.upsert_artifact(artifact, distributed_id="my_dist_artifact")

Run 3

Must run after Run 1 and Run 2 complete. The Run that calls finish_artifact can include files in the artifact, but does not need to.

with wandb.init() as run:
    artifact = wandb.Artifact("artifact_name", "artifact_type")
    # Add Files and Assets to the artifact
    # `.add`, `.add_file`, `.add_dir`, and `.add_reference`
    artifact.add_file("image3.png")
    run.finish_artifact(artifact, distributed_id="my_dist_artifact")

Create a new artifact version from an existing version

Add, modify, or remove a subset of files from a previous artifact version without the need to re-index the files that didn’t change. Adding, modifying, or removing a subset of files from a previous artifact version creates a new artifact version known as an incremental artifact.

Here are some scenarios for each type of incremental change you might encounter:

  • add: you periodically add a new subset of files to a dataset after collecting a new batch.
  • remove: you discovered several duplicate files and want to remove them from your artifact.
  • update: you corrected annotations for a subset of files and want to replace the old files with the correct ones.

You could create an artifact from scratch to perform the same function as an incremental artifact. However, when you create an artifact from scratch, you will need to have all the contents of your artifact on your local disk. When making an incremental change, you can add, remove, or modify a single file without changing the files from a previous artifact version.

Follow the procedure below to incrementally change an artifact:

  1. Obtain the artifact version you want to perform an incremental change on:
saved_artifact = run.use_artifact("my_artifact:latest")
client = wandb.Api()
saved_artifact = client.artifact("my_artifact:latest")
  1. Create a draft with:
draft_artifact = saved_artifact.new_draft()
  1. Perform any incremental changes you want to see in the next version. You can either add, remove, or modify an existing entry.

Select one of the tabs for an example on how to perform each of these changes:

Add a file to an existing artifact version with the add_file method:

draft_artifact.add_file("file_to_add.txt")

Remove a file from an existing artifact version with the remove method:

draft_artifact.remove("file_to_remove.txt")

Modify or replace contents by removing the old contents from the draft and adding the new contents back in:

draft_artifact.remove("modified_file.txt")
draft_artifact.add_file("modified_file.txt")
  1. Lastly, log or save your changes. The following tabs show you how to save your changes inside and outside of a W&B run. Select the tab that is appropriate for your use case:
run.log_artifact(draft_artifact)
draft_artifact.save()

Putting it all together, the code examples above look like:

with wandb.init(job_type="modify dataset") as run:
    saved_artifact = run.use_artifact(
        "my_artifact:latest"
    )  # fetch artifact and input it into your run
    draft_artifact = saved_artifact.new_draft()  # create a draft version

    # modify a subset of files in the draft version
    draft_artifact.add_file("file_to_add.txt")
    draft_artifact.remove("dir_to_remove/")
    run.log_artifact(
        artifact
    )  # log your changes to create a new version and mark it as output to your run
client = wandb.Api()
saved_artifact = client.artifact("my_artifact:latest")  # load your artifact
draft_artifact = saved_artifact.new_draft()  # create a draft version

# modify a subset of files in the draft version
draft_artifact.remove("deleted_file.txt")
draft_artifact.add_file("modified_file.txt")
draft_artifact.save()  # commit changes to the draft

6 - Track external files

Track files saved outside the W&B such as in an Amazon S3 bucket, GCS bucket, HTTP file server, or even an NFS share.

Use reference artifacts to track files saved outside the W&B system, for example in an Amazon S3 bucket, GCS bucket, Azure blob, HTTP file server, or even an NFS share. Log artifacts outside of a W&B Run with the W&B CLI.

Log artifacts outside of runs

W&B creates a run when you log an artifact outside of a run. Each artifact belongs to a run, which in turn belongs to a project. An artifact (version) also belongs to a collection, and has a type.

Use the wandb artifact put command to upload an artifact to the W&B server outside of a W&B run. Provide the name of the project you want the artifact to belong to along with the name of the artifact (project/artifact_name).Optionally provide the type (TYPE). Replace PATH in the code snippet below with the file path of the artifact you want to upload.

$ wandb artifact put --name project/artifact_name --type TYPE PATH

W&B will create a new project if a the project you specify does not exist. For information on how to download an artifact, see Download and use artifacts.

Track artifacts outside of W&B

Use W&B Artifacts for dataset versioning and model lineage, and use reference artifacts to track files saved outside the W&B server. In this mode an artifact only stores metadata about the files, such as URLs, size, and checksums. The underlying data never leaves your system. See the Quick start for information on how to save files and directories to W&B servers instead.

The following describes how to construct reference artifacts and how to best incorporate them into your workflows.

Amazon S3 / GCS / Azure Blob Storage References

Use W&B Artifacts for dataset and model versioning to track references in cloud storage buckets. With artifact references, seamlessly layer tracking on top of your buckets with no modifications to your existing storage layout.

Artifacts abstract away the underlying cloud storage vendor (such AWS, GCP or Azure). Information described in the proceeding section apply uniformly to Amazon S3, Google Cloud Storage and Azure Blob Storage.

Assume we have a bucket with the following structure:

s3://my-bucket
+-- datasets/
|		+-- mnist/
+-- models/
		+-- cnn/

Under mnist/ we have our dataset, a collection of images. Lets track it with an artifact:

import wandb

run = wandb.init()
artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("s3://my-bucket/datasets/mnist")
run.log_artifact(artifact)

Our new reference artifact mnist:latest looks and behaves similarly to a regular artifact. The only difference is that the artifact only consists of metadata about the S3/GCS/Azure object such as its ETag, size, and version ID (if object versioning is enabled on the bucket).

W&B will use the default mechanism to look for credentials based on the cloud provider you use. Read the documentation from your cloud provider to learn more about the credentials used:

Cloud provider Credentials Documentation
AWS Boto3 documentation
GCP Google Cloud documentation
Azure Azure documentation

For AWS, if the bucket is not located in the configured user’s default region, you must set the AWS_REGION environment variable to match the bucket region.

Interact with this artifact similarly to a normal artifact. In the App UI, you can look through the contents of the reference artifact using the file browser, explore the full dependency graph, and scan through the versioned history of your artifact.

Download a reference artifact

import wandb

run = wandb.init()
artifact = run.use_artifact("mnist:latest", type="dataset")
artifact_dir = artifact.download()

W&B will use the metadata recorded when the artifact was logged to retrieve the files from the underlying bucket when it downloads a reference artifact. If your bucket has object versioning enabled, W&B will retrieve the object version corresponding to the state of the file at the time an artifact was logged. This means that as you evolve the contents of your bucket, you can still point to the exact iteration of your data a given model was trained on since the artifact serves as a snapshot of your bucket at the time of training.

Tying it together

The following code example demonstrates a simple workflow you can use to track a dataset in Amazon S3, GCS, or Azure that feeds into a training job:

import wandb

run = wandb.init()

artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("s3://my-bucket/datasets/mnist")

# Track the artifact and mark it as an input to
# this run in one swoop. A new artifact version
# is only logged if the files in the bucket changed.
run.use_artifact(artifact)

artifact_dir = artifact.download()

# Perform training here...

To track models, we can log the model artifact after the training script uploads the model files to the bucket:

import boto3
import wandb

run = wandb.init()

# Training here...

s3_client = boto3.client("s3")
s3_client.upload_file("my_model.h5", "my-bucket", "models/cnn/my_model.h5")

model_artifact = wandb.Artifact("cnn", type="model")
model_artifact.add_reference("s3://my-bucket/models/cnn/")
run.log_artifact(model_artifact)

Filesystem References

Another common pattern for fast access to datasets is to expose an NFS mount point to a remote filesystem on all machines running training jobs. This can be an even simpler solution than a cloud storage bucket because from the perspective of the training script, the files look just like they are sitting on your local filesystem. Luckily, that ease of use extends into using Artifacts to track references to file systems, whether they are mounted or not.

Assume we have a filesystem mounted at /mount with the following structure:

mount
+-- datasets/
|		+-- mnist/
+-- models/
		+-- cnn/

Under mnist/ we have our dataset, a collection of images. Let’s track it with an artifact:

import wandb

run = wandb.init()
artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("file:///mount/datasets/mnist/")
run.log_artifact(artifact)

By default, W&B imposes a 10,000 file limit when adding a reference to a directory. You can adjust this limit by specifying max_objects= in calls to add_reference.

Note the triple slash in the URL. The first component is the file:// prefix that denotes the use of filesystem references. The second is the path to our dataset, /mount/datasets/mnist/.

The resulting artifact mnist:latest looks and acts just like a regular artifact. The only difference is that the artifact only consists of metadata about the files, such as their sizes and MD5 checksums. The files themselves never leave your system.

You can interact with this artifact just as you would a normal artifact. In the UI, you can browse the contents of the reference artifact using the file browser, explore the full dependency graph, and scan through the versioned history of your artifact. However, the UI will not be able to render rich media such as images, audio, etc. as the data itself is not contained within the artifact.

Downloading a reference artifact is simple:

import wandb

run = wandb.init()
artifact = run.use_artifact("entity/project/mnist:latest", type="dataset")
artifact_dir = artifact.download()

For filesystem references, a download() operation copies the files from the referenced paths to construct the artifact directory. In the above example, the contents of /mount/datasets/mnist will be copied into the directory artifacts/mnist:v0/. If an artifact contains a reference to a file that was overwritten, then download() will throw an error as the artifact can no longer be reconstructed.

Putting everything together, here’s a simple workflow you can use to track a dataset under a mounted filesystem that feeds into a training job:

import wandb

run = wandb.init()

artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("file:///mount/datasets/mnist/")

# Track the artifact and mark it as an input to
# this run in one swoop. A new artifact version
# is only logged if the files under the directory
# changed.
run.use_artifact(artifact)

artifact_dir = artifact.download()

# Perform training here...

To track models, we can log the model artifact after the training script writes the model files to the mount point:

import wandb

run = wandb.init()

# Training here...

# Write model to disk

model_artifact = wandb.Artifact("cnn", type="model")
model_artifact.add_reference("file:///mount/cnn/my_model.h5")
run.log_artifact(model_artifact)

7 - Manage data

7.1 - Delete an artifact

Delete artifacts interactively with the App UI or programmatically with the W&B SDK/

Delete artifacts interactively with the App UI or programmatically with the W&B SDK. When you delete an artifact, W&B marks that artifact as a soft-delete. In other words, the artifact is marked for deletion but files are not immediately deleted from storage.

The contents of the artifact remain as a soft-delete, or pending deletion state, until a regularly run garbage collection process reviews all artifacts marked for deletion. The garbage collection process deletes associated files from storage if the artifact and its associated files are not used by a previous or subsequent artifact versions.

The sections in this page describe how to delete specific artifact versions, how to delete an artifact collection, how to delete artifacts with and without aliases, and more. You can schedule when artifacts are deleted from W&B with TTL policies. For more information, see Manage data retention with Artifact TTL policy.

Delete an artifact version

To delete an artifact version:

  1. Select the name of the artifact. This will expand the artifact view and list all the artifact versions associated with that artifact.
  2. From the list of artifacts, select the artifact version you want to delete.
  3. On the right hand side of the workspace, select the kebab dropdown.
  4. Choose Delete.

An artifact version can also be deleted programatically via the delete() method. See the examples below.

Delete multiple artifact versions with aliases

The following code example demonstrates how to delete artifacts that have aliases associated with them. Provide the entity, project name, and run ID that created the artifacts.

import wandb

run = api.run("entity/project/run_id")

for artifact in run.logged_artifacts():
    artifact.delete()

Set the delete_aliases parameter to the boolean value, True to delete aliases if the artifact has one or more aliases.

import wandb

run = api.run("entity/project/run_id")

for artifact in run.logged_artifacts():
    # Set delete_aliases=True in order to delete
    # artifacts with one more aliases
    artifact.delete(delete_aliases=True)

Delete multiple artifact versions with a specific alias

The proceeding code demonstrates how to delete multiple artifact versions that have a specific alias. Provide the entity, project name, and run ID that created the artifacts. Replace the deletion logic with your own:

import wandb

runs = api.run("entity/project_name/run_id")

# Delete artifact ith alias 'v3' and 'v4
for artifact_version in runs.logged_artifacts():
    # Replace with your own deletion logic.
    if artifact_version.name[-2:] == "v3" or artifact_version.name[-2:] == "v4":
        artifact.delete(delete_aliases=True)

Delete all versions of an artifact that do not have an alias

The following code snippet demonstrates how to delete all versions of an artifact that do not have an alias. Provide the name of the project and entity for the project and entity keys in wandb.Api, respectively. Replace the <> with the name of your artifact:

import wandb

# Provide your entity and a project name when you
# use wandb.Api methods.
api = wandb.Api(overrides={"project": "project", "entity": "entity"})

artifact_type, artifact_name = "<>"  # provide type and name
for v in api.artifact_versions(artifact_type, artifact_name):
    # Clean up versions that don't have an alias such as 'latest'.
    # NOTE: You can put whatever deletion logic you want here.
    if len(v.aliases) == 0:
        v.delete()

Delete an artifact collection

To delete an artifact collection:

  1. Navigate to the artifact collection you want to delete and hover over it.
  2. Select the kebab dropdown next to the artifact collection name.
  3. Choose Delete.

You can also delete artifact collection programmatically with the delete() method. Provide the name of the project and entity for the project and entity keys in wandb.Api, respectively:

import wandb

# Provide your entity and a project name when you
# use wandb.Api methods.
api = wandb.Api(overrides={"project": "project", "entity": "entity"})
collection = api.artifact_collection(
    "<artifact_type>", "entity/project/artifact_collection_name"
)
collection.delete()

How to enable garbage collection based on how W&B is hosted

Garbage collection is enabled by default if you use W&B’s shared cloud. Based on how you host W&B, you might need to take additional steps to enable garbage collection, this includes:

  • Set the GORILLA_ARTIFACT_GC_ENABLED environment variable to true: GORILLA_ARTIFACT_GC_ENABLED=true
  • Enable bucket versioning if you use AWS, GCP or any other storage provider such as Minio. If you use Azure, enable soft deletion.

The following table describes how to satisfy requirements to enable garbage collection based on your deployment type.

The X indicates you must satisfy the requirement:

Environment variable Enable versioning
Shared cloud
Shared cloud with secure storage connector X
Dedicated cloud
Dedicated cloud with secure storage connector X
Customer-managed cloud X X
Customer managed on-prem X X

7.2 - Manage artifact data retention

Time to live policies (TTL)

Schedule when artifacts are deleted from W&B with W&B Artifact time-to-live (TTL) policy. When you delete an artifact, W&B marks that artifact as a soft-delete. In other words, the artifact is marked for deletion but files are not immediately deleted from storage. For more information on how W&B deletes artifacts, see the Delete artifacts page.

Check out this video tutorial to learn how to manage data retention with Artifacts TTL in the W&B App.

Auto-generated Artifacts

Only user-generated artifacts can use TTL policies. Artifacts auto-generated by W&B cannot have TTL policies set for them.

The following Artifact types indicate an auto-generated Artifact:

  • run_table
  • code
  • job
  • Any Artifact type starting with: wandb-*

You can check an Artifact’s type on the W&B platform or programmatically:

import wandb

run = wandb.init(project="<my-project-name>")
artifact = run.use_artifact(artifact_or_name="<my-artifact-name>")
print(artifact.type)

Replace the values enclosed with <> with your own.

Define who can edit and set TTL policies

Define who can set and edit TTL policies within a team. You can either grant TTL permissions only to team admins, or you can grant both team admins and team members TTL permissions.

  1. Navigate to your team’s profile page.
  2. Select the Settings tab.
  3. Navigate to the Artifacts time-to-live (TTL) section.
  4. From the TTL permissions dropdown, select who can set and edit TTL policies.
  5. Click on Review and save settings.
  6. Confirm the changes and select Save settings.

Create a TTL policy

Set a TTL policy for an artifact either when you create the artifact or retroactively after the artifact is created.

For all the code snippets below, replace the content wrapped in <> with your information to use the code snippet.

Set a TTL policy when you create an artifact

Use the W&B Python SDK to define a TTL policy when you create an artifact. TTL policies are typically defined in days.

The steps are as follows:

  1. Create an artifact.
  2. Add content to the artifact such as files, a directory, or a reference.
  3. Define a TTL time limit with the datetime.timedelta data type that is part of Python’s standard library.
  4. Log the artifact.

The following code snippet demonstrates how to create an artifact and set a TTL policy.

import wandb
from datetime import timedelta

run = wandb.init(project="<my-project-name>", entity="<my-entity>")
artifact = wandb.Artifact(name="<artifact-name>", type="<type>")
artifact.add_file("<my_file>")

artifact.ttl = timedelta(days=30)  # Set TTL policy
run.log_artifact(artifact)

The preceding code snippet sets the TTL policy for the artifact to 30 days. In other words, W&B deletes the artifact after 30 days.

Set or edit a TTL policy after you create an artifact

Use the W&B App UI or the W&B Python SDK to define a TTL policy for an artifact that already exists.

  1. Fetch your artifact.
  2. Pass in a time delta to the artifact’s ttl attribute.
  3. Update the artifact with the save method.

The following code snippet shows how to set a TTL policy for an artifact:

import wandb
from datetime import timedelta

artifact = run.use_artifact("<my-entity/my-project/my-artifact:alias>")
artifact.ttl = timedelta(days=365 * 2)  # Delete in two years
artifact.save()

The preceding code example sets the TTL policy to two years.

  1. Navigate to your W&B project in the W&B App UI.
  2. Select the artifact icon on the left panel.
  3. From the list of artifacts, expand the artifact type you
  4. Select on the artifact version you want to edit the TTL policy for.
  5. Click on the Version tab.
  6. From the dropdown, select Edit TTL policy.
  7. Within the modal that appears, select Custom from the TTL policy dropdown.
  8. Within the TTL duration field, set the TTL policy in units of days.
  9. Select the Update TTL button to save your changes.

Set default TTL policies for a team

Set a default TTL policy for your team. Default TTL policies apply to all existing and future artifacts based on their respective creation dates. Artifacts with existing version-level TTL policies are not affected by the team’s default TTL.

  1. Navigate to your team’s profile page.
  2. Select the Settings tab.
  3. Navigate to the Artifacts time-to-live (TTL) section.
  4. Click on the Set team’s default TTL policy.
  5. Within the Duration field, set the TTL policy in units of days.
  6. Click on Review and save settings. 7/ Confirm the changes and then select Save settings.

Set a TTL policy outside of a run

Use the public API to retrieve an artifact without fetching a run, and set the TTL policy. TTL policies are typically defined in days.

The following code sample shows how to fetch an artifact using the public API and set the TTL policy.

api = wandb.Api()

artifact = api.artifact("entity/project/artifact:alias")

artifact.ttl = timedelta(days=365)  # Delete in one year

artifact.save()

Deactivate a TTL policy

Use the W&B Python SDK or W&B App UI to deactivate a TTL policy for a specific artifact version.

  1. Fetch your artifact.
  2. Set the artifact’s ttl attribute to None.
  3. Update the artifact with the save method.

The following code snippet shows how to turn off a TTL policy for an artifact:

artifact = run.use_artifact("<my-entity/my-project/my-artifact:alias>")
artifact.ttl = None
artifact.save()
  1. Navigate to your W&B project in the W&B App UI.
  2. Select the artifact icon on the left panel.
  3. From the list of artifacts, expand the artifact type you
  4. Select on the artifact version you want to edit the TTL policy for.
  5. Click on the Version tab.
  6. Click on the meatball UI icon next to the Link to registry button.
  7. From the dropdown, select Edit TTL policy.
  8. Within the modal that appears, select Deactivate from the TTL policy dropdown.
  9. Select the Update TTL button to save your changes.

View TTL policies

View TTL policies for artifacts with the Python SDK or with the W&B App UI.

Use a print statement to view an artifact’s TTL policy. The following example shows how to retrieve an artifact and view its TTL policy:

artifact = run.use_artifact("<my-entity/my-project/my-artifact:alias>")
print(artifact.ttl)

View a TTL policy for an artifact with the W&B App UI.

  1. Navigate to the W&B App at https://wandb.ai.
  2. Go to your W&B Project.
  3. Within your project, select the Artifacts tab in the left sidebar.
  4. Click on a collection.

Within the collection view you can see all of the artifacts in the selected collection. Within the Time to Live column you will see the TTL policy assigned to that artifact.

7.3 - Manage artifact storage and memory allocation

Manage storage, memory allocation of W&B Artifacts.

W&B stores artifact files in a private Google Cloud Storage bucket located in the United States by default. All files are encrypted at rest and in transit.

For sensitive files, we recommend you set up Private Hosting or use reference artifacts.

During training, W&B locally saves logs, artifacts, and configuration files in the following local directories:

File Default location To change default location set:
logs ./wandb dir in wandb.init or set the WANDB_DIR environment variable
artifacts ~/.cache/wandb the WANDB_CACHE_DIR environment variable
configs ~/.config/wandb the WANDB_CONFIG_DIR environment variable

Clean up local artifact cache

W&B caches artifact files to speed up downloads across versions that share files in common. Over time this cache directory can become large. Run the wandb artifact cache cleanup command to prune the cache and to remove any files that have not been used recently.

The proceeding code snippet demonstrates how to limit the size of the cache to 1GB. Copy and paste the code snippet into your terminal:

$ wandb artifact cache cleanup 1GB

8 - Explore artifact graphs

Traverse automatically created direct acyclic W&B Artifact graphs.

W&B automatically tracks the artifacts a given run logged as well as the artifacts a given run uses. These artifacts can include datasets, models, evaluation results, or more. You can explore an artifact’s lineage to track and manage the various artifacts produced throughout the machine learning lifecycle.

Lineage

Tracking an artifact’s lineage has several key benefits:

  • Reproducibility: By tracking the lineage of all artifacts, teams can reproduce experiments, models, and results, which is essential for debugging, experimentation, and validating machine learning models.

  • Version Control: Artifact lineage involves versioning artifacts and tracking their changes over time. This allows teams to roll back to previous versions of data or models if needed.

  • Auditing: Having a detailed history of the artifacts and their transformations enables organizations to comply with regulatory and governance requirements.

  • Collaboration and Knowledge Sharing: Artifact lineage facilitates better collaboration among team members by providing a clear record of attempts as well as what worked, and what didn’t. This helps in avoiding duplication of efforts and accelerates the development process.

Finding an artifact’s lineage

When selecting an artifact in the Artifacts tab, you can see your artifact’s lineage. This graph view shows a general overview of your pipeline.

To view an artifact graph:

  1. Navigate to your project in the W&B App UI
  2. Choose the artifact icon on the left panel.
  3. Select Lineage.
Getting to the Lineage tab

The artifact or job type you provide appears in front of its name, with artifacts represented by blue icons and runs represented by green icons. Arrows detail the input and output of a run or artifact on the graph.

Run and artifact nodes Inputs and outputs

For a more detailed view, click any individual artifact or run to get more information on a particular object.

Previewing a run

Artifact clusters

When a level of the graph has five or more runs or artifacts, it creates a cluster. A cluster has a search bar to find specific versions of runs or artifacts and pulls an individual node from a cluster to continue investigating the lineage of a node inside a cluster.

Clicking on a node opens a preview with an overview of the node. Clicking on the arrow extracts the individual run or artifact so you can examine the lineage of the extracted node.

Searching a run cluster

Use the API to track lineage

You can also navigate a graph using the W&B API.

Create an artifact. First, create a run with wandb.init. Then,create a new artifact or retrieve an existing one with wandb.Artifact. Next, add files to the artifact with .add_file. Finally, log the artifact to the run with .log_artifact. The finished code looks something like this:

with wandb.init() as run:
    artifact = wandb.Artifact("artifact_name", "artifact_type")

    # Add Files and Assets to the artifact using
    # `.add`, `.add_file`, `.add_dir`, and `.add_reference`
    artifact.add_file("image1.png")
    run.log_artifact(artifact)

Use the artifact object’s logged_by and used_by methods to walk the graph from the artifact:

# Walk up and down the graph from an artifact:
producer_run = artifact.logged_by()
consumer_runs = artifact.used_by()

Next steps

9 - Artifact data privacy and compliance

Learn where W&B files are stored by default. Explore how to save, store sensitive information.

Files are uploaded to Google Cloud bucket managed by W&B when you log artifacts. The contents of the bucket are encrypted both at rest and in transit. Artifact files are only visible to users who have access to the corresponding project.

GCS W&B Client Server diagram

When you delete a version of an artifact, it is marked for soft deletion in our database and removed from your storage cost. When you delete an entire artifact, it is queued for permanently deletion and all of its contents are removed from the W&B bucket. If you have specific needs around file deletion please reach out to Customer Support.

For sensitive datasets that cannot reside in a multi-tenant environment, you can use either a private W&B server connected to your cloud bucket or reference artifacts. Reference artifacts track references to private buckets without sending file contents to W&B. Reference artifacts maintain links to files on your buckets or servers. In other words, W&B only keeps track of the metadata associated with the files and not the files themselves.

W&B Client Server Cloud diagram

Create a reference artifact similar to how you create a non reference artifact:

import wandb

run = wandb.init()
artifact = wandb.Artifact("animals", type="dataset")
artifact.add_reference("s3://my-bucket/animals")

For alternatives, contact us at contact@wandb.com to talk about private cloud and on-premises installations.

10 - Tutorial: Create, track, and use a dataset artifact

Artifacts quickstart shows how to create, track, and use a dataset artifact with W&B.

This walkthrough demonstrates how to create, track, and use a dataset artifact from W&B Runs.

1. Log into W&B

Import the W&B library and log in to W&B. You will need to sign up for a free W&B account if you have not done so already.

import wandb

wandb.login()

2. Initialize a run

Use the wandb.init() API to generate a background process to sync and log data as a W&B Run. Provide a project name and a job type:

# Create a W&B Run. Here we specify 'dataset' as the job type since this example
# shows how to create a dataset artifact.
run = wandb.init(project="artifacts-example", job_type="upload-dataset")

3. Create an artifact object

Create an artifact object with the wandb.Artifact() API. Provide a name for the artifact and a description of the file type for the name and type parameters, respectively.

For example, the following code snippet demonstrates how to create an artifact called ‘bicycle-dataset’ with a ‘dataset’ label:

artifact = wandb.Artifact(name="bicycle-dataset", type="dataset")

For more information about how to construct an artifact, see Construct artifacts.

Add the dataset to the artifact

Add a file to the artifact. Common file types include models and datasets. The following example adds a dataset named dataset.h5 that is saved locally on our machine to the artifact:

# Add a file to the artifact's contents
artifact.add_file(local_path="dataset.h5")

Replace the filename dataset.h5 in the preceding code snippet with the path to the file you want to add to the artifact.

4. Log the dataset

Use the W&B run objects log_artifact() method to both save your artifact version and declare the artifact as an output of the run.

# Save the artifact version to W&B and mark it
# as the output of this run
run.log_artifact(artifact)

A 'latest' alias is created by default when you log an artifact. For more information about artifact aliases and versions, see Create a custom alias and Create new artifact versions, respectively.

5. Download and use the artifact

The following code example demonstrates the steps you can take to use an artifact you have logged and saved to the W&B servers.

  1. First, initialize a new run object with wandb.init().
  2. Second, use the run objects use_artifact() method to tell W&B what artifact to use. This returns an artifact object.
  3. Third, use the artifacts download() method to download the contents of the artifact.
# Create a W&B Run. Here we specify 'training' for 'type'
# because we will use this run to track training.
run = wandb.init(project="artifacts-example", job_type="training")

# Query W&B for an artifact and mark it as input to this run
artifact = run.use_artifact("bicycle-dataset:latest")

# Download the artifact's contents
artifact_dir = artifact.download()

Alternatively, you can use the Public API (wandb.Api) to export (or update data) data already saved in a W&B outside of a Run. See Track external files for more information.