Documentation
Search…
Artifacts Walkthrough
Use W&B Artifacts for dataset tracking and model versioning. Initialize a run, create an artifact, and then use it in another part of your workflow. You can use artifacts to track and save files, or track external URIs.
This feature is available in the client starting from wandb version 0.9.0.
Check out our video tutorial on artifacts! It comes with an interactive Colab notebook.

1. Initialize a run

To track a step of your pipeline, initialize a run in your script. Specify a string for job_type to differentiate different pipeline steps— preprocessing, training, evaluation, etc. If you've never instrumented a run with W&B, we have more detailed guidance for experiment tracking in our Python Library docs.
1
run = wandb.init(job_type='train')
Copied!

2. Create an artifact

An artifact is like a folder of data, with contents that are actual files stored in the artifact or references to external URIs. To create an artifact, log it as the output of a run. Specify a string for type to differentiate different artifacts— dataset, model, result etc. Give this artifact a name, like bike-dataset, to help you remember what is inside the artifact. In a later step of your pipeline, you can use this name along with a version like bike-dataset:v1 to download this artifact.
When you call log_artifact, we check to see if the contents of the artifact has changed, and if so we automatically create a new version of the artifact: v0, v1, v2 etc.
wandb.Artifact()
    type (str): Differentiate kinds of artifacts, used for organizational purposes. We recommend things like "dataset", "model" and "result".
    name (str): Give your artifact a unique name, used when you reference the artifact elsewhere. You can use numbers, letters, underscores, hyphens, and dots in the name.
    description (str, optional): Free text displayed next to the artifact version in the UI
    metadata (dict, optional): Structured data associated with the artifact, for example class distribution of a dataset. As we build out the web interface, you'll be able to use this data to query and make plots.
1
artifact = wandb.Artifact('bike-dataset', type='dataset')
2
3
# Add a file to the artifact's contents
4
artifact.add_file('bicycle-data.h5')
5
6
# Save the artifact version to W&B and mark it as the output of this run
7
run.log_artifact(artifact)
Copied!
NOTE: Calls to log_artifact are performed asynchronously for performant uploads. This can cause surprising behavior when logging artifacts in a loop. For example:
1
for i in range(10):
2
a = wandb.Artifact('race', type='dataset', metadata={
3
"index": i,
4
})
5
# ... add files to artifact a ...
6
run.log_artifact(a)
Copied!
The artifact version v0 is NOT guaranteed to have an index of 0 in its metadata, as the artifacts may be logged in an arbitrary order.

3. Use an artifact

You can use an artifact as input to a run. For example, we could take bike-dataset:v0 , the first version of bike-dataset, and use it in the next script in our pipeline. When you call use_artifact, your script queries W&B to find that named artifact and marks it as input to the run.
1
# Query W&B for an artifact and mark it as input to this run
2
artifact = run.use_artifact('bike-dataset:v0')
3
4
# Download the artifact's contents
5
artifact_dir = artifact.download()
Copied!
Using an artifact from a different project You can freely reference artifacts from any project to which you have access by qualifying the name of the artifact with its project name. You can also reference artifacts across entities by further qualifying the name of the artifact with its entity name.
1
# Query W&B for an artifact from another project and mark it
2
# as an input to this run.
3
artifact = run.use_artifact('my-project/bike-model:v0')
4
5
# Use an artifact from another entity and mark it as an input
6
# to this run.
7
artifact = run.use_artifact('my-entity/my-project/bike-model:v0')
Copied!
Using an artifact that has not been logged You can also construct an artifact object and pass it to use_artifact. We check if the artifact already exists in W&B, and if not we creates a new artifact. This is idempotent— you can pass an artifact to use_artifact as many times as you like, and we'll deduplicate it as long as the contents stay the same.
1
artifact = wandb.Artifact('bike-model', type='model')
2
artifact.add_file('model.h5')
3
run.use_artifact(artifact)
Copied!

Versions and aliases

When you log an artifact for the first time, we create version v0. When you log again to the same artifact, we checksum the contents, and if the artifact has changed we save a new version v1.
You can use aliases as pointers to specific versions. By default, run.log_artifact adds the latest alias to the logged version.
You can fetch an artifact using an alias. For example, if you want your training script to always pull the most recent version of a dataset, specify latest when you use that artifact.
1
artifact = run.use_artifact('bike-dataset:latest')
Copied!
You can also apply a custom alias to an artifact version. For example, if you want to mark which model checkpoint is the best on the metric AP-50, you could add the string best-ap50 as an alias when you log the model artifact.
1
artifact = wandb.Artifact('run-3nq3ctyy-bike-model', type='model')
2
artifact.add_file('model.h5')
3
run.log_artifact(artifact, aliases=['latest','best-ap50'])
Copied!

Constructing artifacts

An artifact is like a folder of data. Each entry is either an actual file stored in the artifact, or a reference to an external URI. You can nest folders inside an artifact just like a regular filesystem. Construct new artifacts by initializing the wandb.Artifact() class.
You can pass the following fields to an Artifact() constructor, or set them directly on an artifact object:
    type: Freeform string, like ‘dataset’, ‘model’, or ‘result’
    description: Freeform text that will be displayed in the UI.
    metadata: A dictionary that can contain any structured data. You’ll be able to use this data for querying and making plots. E.g. you may choose to store the class distribution for a dataset artifact as metadata.
1
artifact = wandb.Artifact('bike-dataset', type='dataset')
Copied!
Use name to specify an optional file name, or a file path prefix if you're adding a directory.
1
# Add a single file
2
artifact.add_file(path, name='optional-name')
3
4
# Recursively add a directory
5
artifact.add_dir(path, name='optional-prefix')
6
7
# Return a writeable file-like object, stored as <name> in the artifact
8
with artifact.new_file(name) as f:
9
... # Write contents into the file
10
11
# Add a URI reference
12
artifact.add_reference(uri, name='optional-name')
Copied!

Adding files and directories

For the following examples, assume we have a project directory with these files:
1
project-directory
2
|-- images
3
| |-- cat.png
4
| +-- dog.png
5
|-- checkpoints
6
| +-- model.h5
7
+-- model.h5
Copied!
API call
Resulting artifact contents
artifact.add_file('model.h5')
model.h5
artifact.add_file('checkpoints/model.h5')
model.h5
artifact.add_file('model.h5', name='models/mymodel.h5')
models/mymodel.h5
artifact.add_dir('images')
cat.png
dog.png
artifact.add_dir('images', name='images')
images/cat.png
images/dog.png
artifact.new_file('hello.txt')
hello.txt

Adding references

1
artifact.add_reference(uri, name=None, checksum=True)
Copied!
    uri (string): The reference URI to track.
    name (string): An optional name override. If not provided, a name is inferred from uri.
    checksum (bool): If true, the reference collects checksum information and metadata from uri for validation purposes.
You can add references to external URIs to artifacts, instead of actual files. If a URI has a scheme that wandb knows how to handle, the artifact will track checksums and other information for reproducibility. Artifacts currently support the following URI schemes:
    http(s)://: A path to a file accessible over HTTP. The artifact will track checksums in the form of etags and size metadata if the HTTP server supports the ETag and Content-Length response headers.
    s3://: A path to an object or object prefix in S3. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.
    gs://: A path to an object or object prefix in GCS. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.
For the following examples, assume we have an S3 bucket with these files:
1
s3://my-bucket
2
|-- images
3
| |-- cat.png
4
| +-- dog.png
5
|-- checkpoints
6
| +-- model.h5
7
+-- model.h5
Copied!
API call
Resulting artifact contents
artifact.add_reference('s3://my-bucket/model.h5')
model.h5
artifact.add_reference('s3://my-bucket/checkpoints/model.h5')
model.h5
artifact.add_reference('s3://my-bucket/model.h5', name='models/mymodel.h5')
models/mymodel.h5
artifact.add_reference('s3://my-bucket/images')
cat.png
dog.png
artifact.add_reference('s3://my-bucket/images', name='images')
images/cat.png
images/dog.png

Adding files from parallel runs

For large datasets or distributed training, multiple parallel runs might need to contribute to a single artifact. You can use the following pattern to construct such parallel artifacts:
1
import wandb
2
import time
3
4
# We will use ray to launch our runs in parallel
5
# for demonstration purposes. You can orchestrate
6
# your parallel runs however you want.
7
import ray
8
9
ray.init()
10
11
artifact_type = "dataset"
12
artifact_name = "parallel-artifact"
13
table_name = "distributed_table"
14
parts_path = "parts"
15
num_parallel = 5
16
17
# Each batch of parallel writers should have its own
18
# unique group name.
19
group_name = "writer-group-{}".format(round(time.time()))
20
21
@ray.remote
22
def train(i):
23
"""
24
Our writer job. Each writer will add one image to the artifact.
25
"""
26
with wandb.init(group=group_name) as run:
27
artifact = wandb.Artifact(name=artifact_name, type=artifact_type)
28
29
# Add data to a wandb table. In this case we use example data
30
table = wandb.Table(columns=["a", "b", "c"], data=[[i, i*2, 2**i]])
31
32
# Add the table to folder in the artifact
33
artifact.add(table, "{}/table_{}".format(parts_path, i))
34
35
# Upserting the artifact creates or appends data to the artifact
36
run.upsert_artifact(artifact)
37
38
# Launch your runs in parallel
39
result_ids = [train.remote(i) for i in range(num_parallel)]
40
41
# Join on all the writers to make sure their files have
42
# been added before finishing the artifact.
43
ray.get(result_ids)
44
45
# Once all the writers arefinished, finish the artifact
46
# to mark it ready.
47
with wandb.init(group=group_name) as run:
48
artifact = wandb.Artifact(artifact_name, type=artifact_type)
49
50
# Create a "PartitionTable" pointing to the folder of tables
51
# and add it to the artifact.
52
artifact.add(wandb.data_types.PartitionedTable(parts_path), table_name)
53
54
# Finish artifact finalizes the artifact, disallowing future "upserts"
55
# to this version.
56
run.finish_artifact(artifact)
Copied!

Using and downloading artifacts

There are two patterns for using artifacts. You can use an artifact name that is explicitly stored in W&B, or you can construct an artifact object and pass it in to be deduplicated as necessary.

Use an artifact stored in W&B

To get started, first indicate which artifact you'd like to use with use_artifact.
1
artifact = run.use_artifact('bike-dataset:latest')
Copied!
Then, if you want to use most or all of the files in the artifact, call the .download method.
1
datadir = artifact.download(root=None)
Copied!
This will download all of the artifact’s contents that aren't currently present. The returned datadir is a path to a directory containing the artifact’s contents. You can explicitly specify the download destination by setting root.
If you're only interested in a subset of files, use the get_path method.
1
path = artifact.get_path(name)
Copied!
This fetches only the file at the path name. It returns an Entry object with the following methods:
    Entry.download: Downloads file from the artifact at path name
    Entry.ref: If the entry was stored as a reference using add_reference, returns the URI
References that have schemes that W&B knows how to handle can be downloaded just like artifact files. The consumer API is the same.

Construct and use an artifact

You can also construct an artifact object and pass it to use_artifact. This will create the artifact in W&B if it doesn’t exist yet. This is idempotent, so you can do it as many times as you like. The artifact will only be created once, as long as the contents of model.h5 remain the same.
1
artifact = wandb.Artifact('reference model')
2
artifact.add_file('model.h5')
3
run.use_artifact(artifact)
Copied!

Download an artifact outside of a run

1
api = wandb.Api()
2
artifact = api.artifact('entity/project/artifact:alias')
3
artifact.download()
Copied!

Updating artifacts

You can update the description, metadata, and aliases of an artifact by just setting them to the desired values and then calling save().
1
api = wandb.Api()
2
artifact = api.artifact('bike-dataset:latest')
3
4
# Update the description
5
artifact.description = "My new description"
6
7
# Selectively update metadata keys
8
artifact.metadata["oldKey"] = "new value"
9
10
# Replace the metadata entirely
11
artifact.metadata = {"newKey": "new value"}
12
13
# Add an alias
14
artifact.aliases.append('best')
15
16
# Remove an alias
17
artifact.aliases.remove('latest')
18
19
# Completely replace the aliases
20
artifact.aliases = ['replaced']
21
22
# Persist all artifact modifications
23
artifact.save()
Copied!

Traversing the artifact graph

W&B automatically tracks the artifacts a given run has logged as well as the artifacts a given run has used. You can walk this graph by using the following APIs:
1
api = wandb.Api()
2
3
artifact = api.artifact('data:v0')
4
5
# Walk up and down the graph from an artifact:
6
producer_run = artifact.logged_by()
7
consumer_runs = artifact.used_by()
8
9
# Walk up and down the graph from a run:
10
logged_artifacts = run.logged_artifacts()
11
used_artifacts = run.used_artifacts()
Copied!

Cleaning up unused versions

As an artifact evolves over time, you might end up with a large number of versions that clutter the UI. This is especially true if you are using artifacts for model checkpoints, where only the most recent version (the version tagged latest) of your artifact is useful. W&B makes it easy to clean up these unneeded versions:
1
api = wandb.Api()
2
3
artifact_type, artifact_name = ... # fill in the desired type + name
4
for version in api.artifact_versions(artifact_type, artifact_name):
5
# Clean up all versions that don't have an alias such as 'latest'.
6
if len(version.aliases) == 0:
7
version.delete()
Copied!

Data privacy

Artifacts use secure API-level access control. Files are encrypted at rest and in transit. Artifacts can also track references to private buckets without sending file contents to W&B. For alternatives, contact us at [email protected] to talk about private cloud and on-prem installations.

Explore the Graph

To navigate from the graph tab on an artifact, click "Explode" to see all the individual instances of each job type and artifact type. Then click a node to open that run or artifact in a new tab. Try it yourself on this example Graph page.
Last modified 28d ago