Documentation
Search…
Artifact Creation Modes
Create a new artifact version using a single run, collaboratively using distributed writers, or as a patch against a prior version.
New versions of an Artifact can be created in one of three ways:
    Simple: A single run provides all the data for a new version. This is the most common case and is best suited for when the run fully recreates the needed data. For example: outputting saved models or model predictions in a table for analysis.
    Collaborative: A set of runs collectively provides all the data for a new version. This is best suited for distributed jobs which have multiple runs generating data, often in parallel. For example: evaluating a model in a distributed manner, and outputting the predictions.
    Patch: (coming soon) A single run provides a patch of the differences to be applied. This is best suited when a run wants to add data to an artifact without needing to recreate all the already existing data. For example: you have a golden dataset which is created by running a daily web scraper - in this case, you want the run to append new data to the dataset.

Simple Mode

To log a new version of an Artifact with a single run which produces all the files in the artifact, use the Simple Mode:
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image1.png")
6
run.log_artifact(artifact)
Copied!
This pattern is so common, that we provide a method of creating the version without explicitly starting a run using Artifact.save():
1
artifact = wandb.Artifact("artifact_name", "artifact_type")
2
# Add Files and Assets to the artifact using
3
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
4
artifact.add_file("image1.png")
5
artifact.save()
Copied!

Collaborative Mode

To allow a collection of runs to collaborate on a version before committing it, use the Collaborative Mode. There are two key ideas to keep in mind when using Collaborative Mode:
    1.
    Each Run in the collection needs to be aware of the same, unique ID (called distributed_id) in order to collaborate on the same version. As a default, if present, we will use the run's group as set by wandb.init(group=GROUP) as the distributed_id.
    2.
    There must be a final run which "commits" the version, permanently locking its state.
Consider the following example. Take note that rather than using log_artifact we use upsert_artifact to add the the collaborative artifact and finish_artifact to finalize the commit.:
Run 1:
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image1.png")
6
run.upsert_artifact(artifact, distributed_id="my_dist_artifact")
Copied!
Run 2:
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image2.png")
6
run.upsert_artifact(artifact, distributed_id="my_dist_artifact")
Copied!
Run 3: Must run after Run 1 and Run 2 complete. The Run which calls finish_artifact is welcome to include files in the artifact, but does not need to.
1
with wandb.init() as run:
2
artifact = wandb.Artifact("artifact_name", "artifact_type")
3
# Add Files and Assets to the artifact using
4
# `.add`, `.add_file`, `.add_dir`, and `.add_reference`
5
artifact.add_file("image3.png")
6
run.finish_artifact(artifact, distributed_id="my_dist_artifact")
Copied!

Patch Mode (Coming Soon)

To create a new version of an Artifact by modifying a previous version, use Patch Mode. A code snippet for Patch Mode will be provided once available for use.

FAQ

How do I log a Table in Collaborative Mode?

For large datasets multiple parallel runs might need to contribute to a single Table. You can use the following pattern to construct such parallel artifacts. The key idea is that each worker puts their own Table in a directory in the artifact. Then, the final worker adds a PartitionTable to the artifact which points to the folder of "parts".
1
import wandb
2
import time
3
4
# We will use ray to launch our runs in parallel
5
# for demonstration purposes. You can orchestrate
6
# your parallel runs however you want.
7
import ray
8
9
ray.init()
10
11
artifact_type = "dataset"
12
artifact_name = "parallel-artifact"
13
table_name = "distributed_table"
14
parts_path = "parts"
15
num_parallel = 5
16
17
# Each batch of parallel writers should have its own
18
# unique group name.
19
group_name = "writer-group-{}".format(round(time.time()))
20
21
@ray.remote
22
def train(i):
23
"""
24
Our writer job. Each writer will add one image to the artifact.
25
"""
26
with wandb.init(group=group_name) as run:
27
artifact = wandb.Artifact(name=artifact_name, type=artifact_type)
28
29
# Add data to a wandb table. In this case we use example data
30
table = wandb.Table(columns=["a", "b", "c"], data=[[i, i*2, 2**i]])
31
32
# Add the table to folder in the artifact
33
artifact.add(table, "{}/table_{}".format(parts_path, i))
34
35
# Upserting the artifact creates or appends data to the artifact
36
run.upsert_artifact(artifact)
37
38
# Launch your runs in parallel
39
result_ids = [train.remote(i) for i in range(num_parallel)]
40
41
# Join on all the writers to make sure their files have
42
# been added before finishing the artifact.
43
ray.get(result_ids)
44
45
# Once all the writers are done writing, finish the artifact
46
# to mark it ready.
47
with wandb.init(group=group_name) as run:
48
artifact = wandb.Artifact(artifact_name, type=artifact_type)
49
50
# Create a "PartitionTable" pointing to the folder of tables
51
# and add it to the artifact.
52
artifact.add(wandb.data_types.PartitionedTable(parts_path), table_name)
53
54
# Finish artifact finalizes the artifact, disallowing future "upserts"
55
# to this version.
56
run.finish_artifact(artifact)
Copied!
Last modified 6mo ago