Documentation
Search…
Artifact Creation Modes
Create a new artifact version using a single run, collaboratively using distributed writers, or as a patch against a prior version.
New versions of an Artifact can be created in one of three ways:
  • Simple: A single run provides all the data for a new version. This is the most common case and is best suited for when the run fully recreates the needed data. For example: outputting saved models or model predictions in a table for analysis.
  • Collaborative: A set of runs collectively provides all the data for a new version. This is best suited for distributed jobs which have multiple runs generating data, often in parallel. For example: evaluating a model in a distributed manner, and outputting the predictions.
  • Patch: (coming soon) A single run provides a patch of the differences to be applied. This is best suited when a run wants to add data to an artifact without needing to recreate all the already existing data. For example: you have a golden dataset which is created by running a daily web scraper - in this case, you want the run to append new data to the dataset.

Simple Mode

To log a new version of an Artifact with a single run which produces all the files in the artifact, use the Simple Mode:
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image1.png")
6
run.log_artifact(artifact)
Copied!
This pattern is so common, that we provide a method of creating the version without explicitly starting a run using Artifact.save():
1
artifact = wandb.Artifact("artifact_name", "artifact_type")
2
# Add Files and Assets to the artifact using
3
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
4
artifact.add_file("image1.png")
5
artifact.save()
Copied!

Collaborative Mode

To allow a collection of runs to collaborate on a version before committing it, use the Collaborative Mode. There are two key ideas to keep in mind when using Collaborative Mode:
  1. 1.
    Each Run in the collection needs to be aware of the same, unique ID (called distributed_id) in order to collaborate on the same version. As a default, if present, we will use the run's group as set by wandb.init(group=GROUP) as the distributed_id.
  2. 2.
    There must be a final run which "commits" the version, permanently locking its state.
Consider the following example. Take note that rather than using log_artifact we use upsert_artifact to add the the collaborative artifact and finish_artifact to finalize the commit.:
Run 1:
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image1.png")
6
run.upsert_artifact(artifact, distributed_id="my_dist_artifact")
Copied!
Run 2:
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image2.png")
6
run.upsert_artifact(artifact, distributed_id="my_dist_artifact")
Copied!
Run 3: Must run after Run 1 and Run 2 complete. The Run which calls finish_artifact is welcome to include files in the artifact, but does not need to.
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image3.png")
6
run.finish_artifact(artifact, distributed_id="my_dist_artifact")
Copied!

Patch Mode (Coming Soon)

To create a new version of an Artifact by modifying a previous version, use Patch Mode. A code snippet for Patch Mode will be provided once available for use.

FAQ

How do I log a Table in Collaborative Mode?

For large datasets multiple runs might need to contribute to a single Table. These runs can be performed in parallel or sequentially. You can use the following pattern to construct such artifacts. The key idea is that each run worker puts their own Table in a directory in the artifact. Then, the final worker adds a PartitionTable to the artifact which points to the folder of "parts". The below example shows how to perform this in a parallel setting. To get the same behavior for non-parallel settings, the core functions need not change. Commented is the one line needed to be changed to sequentially construct this artifact.
1
import wandb
2
import time
3
4
# We will use ray to launch our runs in parallel
5
# for demonstration purposes. You can orchestrate
6
# your parallel runs however you want.
7
import ray
8
9
ray.init()
10
11
artifact_type = "dataset"
12
artifact_name = "parallel-artifact"
13
table_name = "distributed_table"
14
parts_path = "parts"
15
num_parallel = 5
16
17
# Each batch of parallel writers should have its own
18
# unique group name.
19
group_name = "writer-group-{}".format(round(time.time()))
20
21
@ray.remote
22
def train(i):
23
"""
24
Our writer job. Each writer will add one image to the artifact.
25
"""
26
with wandb.init(group=group_name) as run:
27
artifact = wandb.Artifact(name=artifact_name, type=artifact_type)
28
29
# Add data to a wandb table. In this case we use example data
30
table = wandb.Table(columns=["a", "b", "c"], data=[[i, i*2, 2**i]])
31
32
# Add the table to folder in the artifact
33
artifact.add(table, "{}/table_{}".format(parts_path, i))
34
35
# Upserting the artifact creates or appends data to the artifact
36
run.upsert_artifact(artifact)
37
38
# Launch your runs in parallel
39
result_ids = [train.remote(i) for i in range(num_parallel)]
40
# OR Launch your runs sequentially. Remember to remove the ray components if you do so
41
# _ = [train(i) for i in range(num_parts)]
42
43
# Join on all the writers to make sure their files have
44
# been added before finishing the artifact.
45
ray.get(result_ids)
46
47
# Once all the writers are done writing, finish the artifact
48
# to mark it ready.
49
with wandb.init(group=group_name) as run:
50
artifact = wandb.Artifact(artifact_name, type=artifact_type)
51
52
# Create a "PartitionTable" pointing to the folder of tables
53
# and add it to the artifact.
54
artifact.add(wandb.data_types.PartitionedTable(parts_path), table_name)
55
56
# Finish artifact finalizes the artifact, disallowing future "upserts"
57
# to this version.
58
run.finish_artifact(artifact)
Copied!