Use W&B Artifacts to track and version data as the inputs and outputs of your W&B Runs. For example, a model training run might take in a dataset as input and produce a trained model as output. You can log hyperparameters, metadatra, and metrics to a run, and you can use an artifact to log, track, and version the dataset used to train the model as input and another artifact for the resulting model checkpoints as output.
Use cases
You can use artifacts throughout your entire ML workflow as inputs and outputs of runs. You can use datasets, models, or even other artifacts as inputs for processing.
Add one or more files, such as a model file or dataset, to your artifact object.
Log your artifact to W&B.
For example, the proceeding code snippet shows how to log a file called dataset.h5 to an artifact called example_artifact:
import wandb
run = wandb.init(project ="artifacts-example", job_type ="add-dataset")
artifact = wandb.Artifact(name ="example_artifact", type ="dataset")
artifact.add_file(local_path ="./dataset.h5", name ="training_dataset")
artifact.save()
# Logs the artifact version "my_data" as a dataset with data from dataset.h5
See the track external files page for information on how to add references to files or directories stored in external object storage, like an Amazon S3 bucket.
Download an artifact
Indicate the artifact you want to mark as input to your run with the use_artifact method.
Following the preceding code snippet, this next code block shows how to use the training_dataset artifact:
artifact = run.use_artifact("training_dataset:latest") #returns a run object using the "my_data" artifact
This returns an artifact object.
Next, use the returned object to download all contents of the artifact:
datadir = artifact.download() #downloads the full "my_data" artifact to the default directory.
You can pass a custom path into the rootparameter to download an artifact to a specific directory. For alternate ways to download artifacts and to see additional parameters, see the guide on downloading and using artifacts.
1. Create an artifact Python object with wandb.Artifact()
Initialize the wandb.Artifact() class to create an artifact object. Specify the following parameters:
Name: Specify a name for your artifact. The name should be unique, descriptive, and easy to remember. Use an artifacts name to both: identify the artifact in the W&B App UI and when you want to use that artifact.
Type: Provide a type. The type should be simple, descriptive and correspond to a single step of your machine learning pipeline. Common artifact types include 'dataset' or 'model'.
The “name” and “type” you provide is used to create a directed acyclic graph. This means you can view the lineage of an artifact on the W&B App.
Artifacts can not have the same name, even if you specify a different type for the types parameter. In other words, you can not create an artifact named cats of type dataset and another artifact with the same name of type model.
You can optionally provide a description and metadata when you initialize an artifact object. For more information on available attributes and parameters, see wandb.Artifact Class definition in the Python SDK Reference Guide.
The proceeding example demonstrates how to create a dataset artifact:
Replace the string arguments in the preceding code snippet with your own name and type.
2. Add one more files to the artifact
Add files, directories, external URI references (such as Amazon S3) and more with artifact methods. For example, to add a single text file, use the add_file method:
You can also add multiple files with the add_dir method. For more information on how to add files, see Update an artifact.
3. Save your artifact to the W&B server
Finally, save your artifact to the W&B server. Artifacts are associated with a run. Therefore, use a run objects log_artifact() method to save the artifact.
You can optionally construct an artifact outside of a W&B run. For more information, see Track external files.
Calls to log_artifact are performed asynchronously for performant uploads. This can cause surprising behavior when logging artifacts in a loop. For example:
for i in range(10):
a = wandb.Artifact(
"race",
type="dataset",
metadata={
"index": i,
},
)
# ... add files to artifact a ... run.log_artifact(a)
The artifact version v0 is NOT guaranteed to have an index of 0 in its metadata, as the artifacts may be logged in an arbitrary order.
Add files to an artifact
The following sections demonstrate how to construct artifacts with different file types and from parallel runs.
For the following examples, assume you have a project directory with multiple files and a directory structure:
The proceeding code snippet demonstrates how to add an entire, local directory to your artifact:
# Recursively add a directoryartifact.add_dir(local_path="path/file.format", name="optional-prefix")
The proceeding API calls produce the proceeding artifact content:
API Call
Resulting artifact
artifact.add_dir('images')
cat.png
dog.png
artifact.add_dir('images', name='images')
images/cat.png
images/dog.png
artifact.new_file('hello.txt')
hello.txt
Add a URI reference
Artifacts track checksums and other information for reproducibility if the URI has a scheme that W&B library knows how to handle.
Add an external URI reference to an artifact with the add_reference method. Replace the 'uri' string with your own URI. Optionally pass the desired path within the artifact for the name parameter.
# Add a URI referenceartifact.add_reference(uri="uri", name="optional-name")
Artifacts currently support the following URI schemes:
http(s)://: A path to a file accessible over HTTP. The artifact will track checksums in the form of etags and size metadata if the HTTP server supports the ETag and Content-Length response headers.
s3://: A path to an object or object prefix in S3. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.
gs://: A path to an object or object prefix in GCS. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.
The proceeding API calls will produce the proceeding artifacts:
For large datasets or distributed training, multiple parallel runs might need to contribute to a single artifact.
import wandb
import time
# We will use ray to launch our runs in parallel# for demonstration purposes. You can orchestrate# your parallel runs however you want.import ray
ray.init()
artifact_type ="dataset"artifact_name ="parallel-artifact"table_name ="distributed_table"parts_path ="parts"num_parallel =5# Each batch of parallel writers should have its own# unique group name.group_name ="writer-group-{}".format(round(time.time()))
@ray.remotedeftrain(i):
"""
Our writer job. Each writer will add one image to the artifact.
"""with wandb.init(group=group_name) as run:
artifact = wandb.Artifact(name=artifact_name, type=artifact_type)
# Add data to a wandb table. In this case we use example data table = wandb.Table(columns=["a", "b", "c"], data=[[i, i *2, 2**i]])
# Add the table to folder in the artifact artifact.add(table, "{}/table_{}".format(parts_path, i))
# Upserting the artifact creates or appends data to the artifact run.upsert_artifact(artifact)
# Launch your runs in parallelresult_ids = [train.remote(i) for i in range(num_parallel)]
# Join on all the writers to make sure their files have# been added before finishing the artifact.ray.get(result_ids)
# Once all the writers are finished, finish the artifact# to mark it ready.with wandb.init(group=group_name) as run:
artifact = wandb.Artifact(artifact_name, type=artifact_type)
# Create a "PartitionTable" pointing to the folder of tables# and add it to the artifact. artifact.add(wandb.data_types.PartitionedTable(parts_path), table_name)
# Finish artifact finalizes the artifact, disallowing future "upserts"# to this version. run.finish_artifact(artifact)
2 - Download and use artifacts
Download and use Artifacts from multiple projects.
Download and use an artifact that is already stored on the W&B server or construct an artifact object and pass it in to for de-duplication as necessary.
Team members with view-only seats cannot download artifacts.
Download and use an artifact stored on W&B
Download and use an artifact stored in W&B either inside or outside of a W&B Run. Use the Public API (wandb.Api) to export (or update data) already saved in W&B. For more information, see the W&B Public API Reference guide.
First, import the W&B Python SDK. Next, create a W&B Run:
import wandb
run = wandb.init(project="<example>", job_type="<job-type>")
Indicate the artifact you want to use with the use_artifact method. This returns a run object. In the proceeding code snippet specifies an artifact called 'bike-dataset' with the alias 'latest':
Use the object returned to download all the contents of the artifact:
datadir = artifact.download()
You can optionally pass a path to the root parameter to download the contents of the artifact to a specific directory. For more information, see the Python SDK Reference Guide.
Use the get_path method to download only subset of files:
path = artifact.get_path(name)
This fetches only the file at the path name. It returns an Entry object with the following methods:
Entry.download: Downloads file from the artifact at path name
Entry.ref: If add_reference stored the entry as a reference, returns the URI
References that have schemes that W&B knows how to handle get downloaded just like artifact files. For more information, see Track external files.
First, import the W&B SDK. Next, create an artifact from the Public API Class. Provide the entity, project, artifact, and alias associated with that artifact:
import wandb
api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")
Use the object returned to download the contents of the artifact:
artifact.download()
You can optionally pass a path the root parameter to download the contents of the artifact to a specific directory. For more information, see the API Reference Guide.
Use the wandb artifact get command to download an artifact from the W&B server.
$ wandb artifact get project/artifact:alias --root mnist/
Partially download an artifact
You can optionally download part of an artifact based on a prefix. Using the path_prefix parameter, you can download a single file or the content of a sub-folder.
artifact = run.use_artifact("bike-dataset:latest")
artifact.download(path_prefix="bike.png") # downloads only bike.png
Alternatively, you can download files from a certain directory:
artifact.download(path_prefix="images/bikes/") # downloads files in the images/bikes directory
Use an artifact from a different project
Specify the name of artifact along with its project name to reference an artifact. You can also reference artifacts across entities by specifying the name of the artifact with its entity name.
The following code example demonstrates how to query an artifact from another project as input to the current W&B run.
import wandb
run = wandb.init(project="<example>", job_type="<job-type>")
# Query W&B for an artifact from another project and mark it# as an input to this run.artifact = run.use_artifact("my-project/artifact:alias")
# Use an artifact from another entity and mark it as an input# to this run.artifact = run.use_artifact("my-entity/my-project/artifact:alias")
Construct and use an artifact simultaneously
Simultaneously construct and use an artifact. Create an artifact object and pass it to use_artifact. This creates an artifact in W&B if it does not exist yet. The use_artifact API is idempotent, so you can call it as many times as you like.
Update an existing Artifact inside and outside of a W&B Run.
Pass desired values to update the description, metadata, and alias of an artifact. Call the save() method to update the artifact on the W&B servers. You can update an artifact during a W&B Run or outside of a Run.
Use the W&B Public API (wandb.Api) to update an artifact outside of a run. Use the Artifact API (wandb.Artifact) to update an artifact during a run.
You can not update the alias of artifact linked to a model in Model Registry.
The proceeding code example demonstrates how to update the description of an artifact using the wandb.Artifact API:
import wandb
run = wandb.init(project="<example>")
artifact = run.use_artifact("<artifact-name>:<alias>")
artifact.description ="<description>"artifact.save()
The proceeding code example demonstrates how to update the description of an artifact using the wandb.Api API:
import wandb
api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")
# Update the descriptionartifact.description ="My new description"# Selectively update metadata keysartifact.metadata["oldKey"] ="new value"# Replace the metadata entirelyartifact.metadata = {"newKey": "new value"}
# Add an aliasartifact.aliases.append("best")
# Remove an aliasartifact.aliases.remove("latest")
# Completely replace the aliasesartifact.aliases = ["replaced"]
# Persist all artifact modificationsartifact.save()
For more information, see the Weights and Biases Artifact API.
You can also update an Artifact collection in the same way as a singular artifact:
import wandb
run = wandb.init(project="<example>")
api = wandb.Api()
artifact = api.artifact_collection(type="<type-name>", collection="<collection-name>")
artifact.name ="<new-collection-name>"artifact.description ="<This is where you'd describe the purpose of your collection.>"artifact.save()
Use aliases as pointers to specific versions. By default, Run.log_artifact adds the latest alias to the logged version.
An artifact version v0 is created and attached to your artifact when you log an artifact for the first time. W&B checksums the contents when you log again to the same artifact. If the artifact changed, W&B saves a new version v1.
For example, if you want your training script to pull the most recent version of a dataset, specify latest when you use that artifact. The proceeding code example demonstrates how to download a recent dataset artifact named bike-dataset that has an alias, latest:
import wandb
run = wandb.init(project="<example-project>")
artifact = run.use_artifact("bike-dataset:latest")
artifact.download()
You can also apply a custom alias to an artifact version. For example, if you want to mark that model checkpoint is the best on the metric AP-50, you could add the string 'best-ap50' as an alias when you log the model artifact.
Create a new artifact version from a single run or from a distributed process.
Create a new artifact version with a single run or collaboratively with distributed runs. You can optionally create a new artifact version from a previous version known as an incremental artifact.
We recommend that you create an incremental artifact when you need to apply changes to a subset of files in an artifact, where the size of the original artifact is significantly larger.
Create new artifact versions from scratch
There are two ways to create a new artifact version: from a single run and from distributed runs. They are defined as follows:
Single run: A single run provides all the data for a new version. This is the most common case and is best suited when the run fully recreates the needed data. For example: outputting saved models or model predictions in a table for analysis.
Distributed runs: A set of runs collectively provides all the data for a new version. This is best suited for distributed jobs which have multiple runs generating data, often in parallel. For example: evaluating a model in a distributed manner, and outputting the predictions.
W&B will create a new artifact and assign it a v0 alias if you pass a name to the wandb.Artifact API that does not exist in your project. W&B checksums the contents when you log again to the same artifact. If the artifact changed, W&B saves a new version v1.
W&B will retrieve an existing artifact if you pass a name and artifact type to the wandb.Artifact API that matches an existing artifact in your project. The retrieved artifact will have a version greater than 1.
Single run
Log a new version of an Artifact with a single run that produces all the files in the artifact. This case occurs when a single run produces all the files in the artifact.
Based on your use case, select one of the tabs below to create a new artifact version inside or outside of a run:
Create an artifact version within a W&B run:
Create a run with wandb.init. (Line 1)
Create a new artifact or retrieve an existing one with wandb.Artifact . (Line 2)
Add files to the artifact with .add_file. (Line 9)
Log the artifact to the run with .log_artifact. (Line 10)
with wandb.init() as run:
artifact = wandb.Artifact("artifact_name", "artifact_type")
# Add Files and Assets to the artifact using# `.add`, `.add_file`, `.add_dir`, and `.add_reference` artifact.add_file("image1.png")
run.log_artifact(artifact)
Create an artifact version outside of a W&B run:
Create a new artifact or retrieve an existing one with wanb.Artifact. (Line 1)
Add files to the artifact with .add_file. (Line 4)
Save the artifact with .save. (Line 5)
artifact = wandb.Artifact("artifact_name", "artifact_type")
# Add Files and Assets to the artifact using# `.add`, `.add_file`, `.add_dir`, and `.add_reference`artifact.add_file("image1.png")
artifact.save()
Distributed runs
Allow a collection of runs to collaborate on a version before committing it. This is in contrast to single run mode described above where one run provides all the data for a new version.
Each run in the collection needs to be aware of the same unique ID (called distributed_id) in order to collaborate on the same version. By default, if present, W&B uses the run’s group as set by wandb.init(group=GROUP) as the distributed_id.
There must be a final run that “commits” the version, permanently locking its state.
Use upsert_artifact to add to the collaborative artifact and finish_artifact to finalize the commit.
Consider the following example. Different runs (labelled below as Run 1, Run 2, and Run 3) add a different image file to the same artifact with upsert_artifact.
Run 1:
with wandb.init() as run:
artifact = wandb.Artifact("artifact_name", "artifact_type")
# Add Files and Assets to the artifact using# `.add`, `.add_file`, `.add_dir`, and `.add_reference` artifact.add_file("image1.png")
run.upsert_artifact(artifact, distributed_id="my_dist_artifact")
Run 2:
with wandb.init() as run:
artifact = wandb.Artifact("artifact_name", "artifact_type")
# Add Files and Assets to the artifact using# `.add`, `.add_file`, `.add_dir`, and `.add_reference` artifact.add_file("image2.png")
run.upsert_artifact(artifact, distributed_id="my_dist_artifact")
Run 3
Must run after Run 1 and Run 2 complete. The Run that calls finish_artifact can include files in the artifact, but does not need to.
with wandb.init() as run:
artifact = wandb.Artifact("artifact_name", "artifact_type")
# Add Files and Assets to the artifact# `.add`, `.add_file`, `.add_dir`, and `.add_reference` artifact.add_file("image3.png")
run.finish_artifact(artifact, distributed_id="my_dist_artifact")
Create a new artifact version from an existing version
Add, modify, or remove a subset of files from a previous artifact version without the need to re-index the files that didn’t change. Adding, modifying, or removing a subset of files from a previous artifact version creates a new artifact version known as an incremental artifact.
Here are some scenarios for each type of incremental change you might encounter:
add: you periodically add a new subset of files to a dataset after collecting a new batch.
remove: you discovered several duplicate files and want to remove them from your artifact.
update: you corrected annotations for a subset of files and want to replace the old files with the correct ones.
You could create an artifact from scratch to perform the same function as an incremental artifact. However, when you create an artifact from scratch, you will need to have all the contents of your artifact on your local disk. When making an incremental change, you can add, remove, or modify a single file without changing the files from a previous artifact version.
You can create an incremental artifact within a single run or with a set of runs (distributed mode).
Follow the procedure below to incrementally change an artifact:
Obtain the artifact version you want to perform an incremental change on:
Lastly, log or save your changes. The following tabs show you how to save your changes inside and outside of a W&B run. Select the tab that is appropriate for your use case:
run.log_artifact(draft_artifact)
draft_artifact.save()
Putting it all together, the code examples above look like:
with wandb.init(job_type="modify dataset") as run:
saved_artifact = run.use_artifact(
"my_artifact:latest" ) # fetch artifact and input it into your run draft_artifact = saved_artifact.new_draft() # create a draft version# modify a subset of files in the draft version draft_artifact.add_file("file_to_add.txt")
draft_artifact.remove("dir_to_remove/")
run.log_artifact(
artifact
) # log your changes to create a new version and mark it as output to your run
client = wandb.Api()
saved_artifact = client.artifact("my_artifact:latest") # load your artifactdraft_artifact = saved_artifact.new_draft() # create a draft version# modify a subset of files in the draft versiondraft_artifact.remove("deleted_file.txt")
draft_artifact.add_file("modified_file.txt")
draft_artifact.save() # commit changes to the draft
6 - Track external files
Track files saved outside the W&B such as in an Amazon S3 bucket, GCS bucket, HTTP file server, or even an NFS share.
Use reference artifacts to track files saved outside the W&B system, for example in an Amazon S3 bucket, GCS bucket, Azure blob, HTTP file server, or even an NFS share. Log artifacts outside of a W&B Run with the W&B CLI.
Log artifacts outside of runs
W&B creates a run when you log an artifact outside of a run. Each artifact belongs to a run, which in turn belongs to a project. An artifact (version) also belongs to a collection, and has a type.
Use the wandb artifact put command to upload an artifact to the W&B server outside of a W&B run. Provide the name of the project you want the artifact to belong to along with the name of the artifact (project/artifact_name).Optionally provide the type (TYPE). Replace PATH in the code snippet below with the file path of the artifact you want to upload.
$ wandb artifact put --name project/artifact_name --type TYPE PATH
W&B will create a new project if a the project you specify does not exist. For information on how to download an artifact, see Download and use artifacts.
Track artifacts outside of W&B
Use W&B Artifacts for dataset versioning and model lineage, and use reference artifacts to track files saved outside the W&B server. In this mode an artifact only stores metadata about the files, such as URLs, size, and checksums. The underlying data never leaves your system. See the Quick start for information on how to save files and directories to W&B servers instead.
The following describes how to construct reference artifacts and how to best incorporate them into your workflows.
Amazon S3 / GCS / Azure Blob Storage References
Use W&B Artifacts for dataset and model versioning to track references in cloud storage buckets. With artifact references, seamlessly layer tracking on top of your buckets with no modifications to your existing storage layout.
Artifacts abstract away the underlying cloud storage vendor (such AWS, GCP or Azure). Information described in the proceeding section apply uniformly to Amazon S3, Google Cloud Storage and Azure Blob Storage.
W&B Artifacts support any Amazon S3 compatible interface, including MinIO. The scripts below work as-is, when you set the AWS_S3_ENDPOINT_URL environment variable to point at your MinIO server.
Assume we have a bucket with the following structure:
By default, W&B imposes a 10,000 object limit when adding an object prefix. You can adjust this limit by specifying max_objects= in calls to add_reference.
Our new reference artifact mnist:latest looks and behaves similarly to a regular artifact. The only difference is that the artifact only consists of metadata about the S3/GCS/Azure object such as its ETag, size, and version ID (if object versioning is enabled on the bucket).
W&B will use the default mechanism to look for credentials based on the cloud provider you use. Read the documentation from your cloud provider to learn more about the credentials used:
For AWS, if the bucket is not located in the configured user’s default region, you must set the AWS_REGION environment variable to match the bucket region.
Interact with this artifact similarly to a normal artifact. In the App UI, you can look through the contents of the reference artifact using the file browser, explore the full dependency graph, and scan through the versioned history of your artifact.
Rich media such as images, audio, video, and point clouds may fail to render in the App UI depending on the CORS configuration of your bucket. Allow listing app.wandb.ai in your bucket’s CORS settings will allow the App UI to properly render such rich media.
Panels might fail to render in the App UI for private buckets. If your company has a VPN, you could update your bucket’s access policy to whitelist IPs within your VPN.
W&B will use the metadata recorded when the artifact was logged to retrieve the files from the underlying bucket when it downloads a reference artifact. If your bucket has object versioning enabled, W&B will retrieve the object version corresponding to the state of the file at the time an artifact was logged. This means that as you evolve the contents of your bucket, you can still point to the exact iteration of your data a given model was trained on since the artifact serves as a snapshot of your bucket at the time of training.
W&B recommends that you enable ‘Object Versioning’ on your storage buckets if you overwrite files as part of your workflow. With versioning enabled on your buckets, artifacts with references to files that have been overwritten will still be intact because the older object versions are retained.
Based on your use case, read the instructions to enable object versioning: AWS, GCP, Azure.
Tying it together
The following code example demonstrates a simple workflow you can use to track a dataset in Amazon S3, GCS, or Azure that feeds into a training job:
import wandb
run = wandb.init()
artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("s3://my-bucket/datasets/mnist")
# Track the artifact and mark it as an input to# this run in one swoop. A new artifact version# is only logged if the files in the bucket changed.run.use_artifact(artifact)
artifact_dir = artifact.download()
# Perform training here...
To track models, we can log the model artifact after the training script uploads the model files to the bucket:
Another common pattern for fast access to datasets is to expose an NFS mount point to a remote filesystem on all machines running training jobs. This can be an even simpler solution than a cloud storage bucket because from the perspective of the training script, the files look just like they are sitting on your local filesystem. Luckily, that ease of use extends into using Artifacts to track references to file systems, whether they are mounted or not.
Assume we have a filesystem mounted at /mount with the following structure:
mount
+-- datasets/
| +-- mnist/
+-- models/
+-- cnn/
Under mnist/ we have our dataset, a collection of images. Let’s track it with an artifact:
By default, W&B imposes a 10,000 file limit when adding a reference to a directory. You can adjust this limit by specifying max_objects= in calls to add_reference.
Note the triple slash in the URL. The first component is the file:// prefix that denotes the use of filesystem references. The second is the path to our dataset, /mount/datasets/mnist/.
The resulting artifact mnist:latest looks and acts just like a regular artifact. The only difference is that the artifact only consists of metadata about the files, such as their sizes and MD5 checksums. The files themselves never leave your system.
You can interact with this artifact just as you would a normal artifact. In the UI, you can browse the contents of the reference artifact using the file browser, explore the full dependency graph, and scan through the versioned history of your artifact. However, the UI will not be able to render rich media such as images, audio, etc. as the data itself is not contained within the artifact.
For filesystem references, a download() operation copies the files from the referenced paths to construct the artifact directory. In the above example, the contents of /mount/datasets/mnist will be copied into the directory artifacts/mnist:v0/. If an artifact contains a reference to a file that was overwritten, then download() will throw an error as the artifact can no longer be reconstructed.
Putting everything together, here’s a simple workflow you can use to track a dataset under a mounted filesystem that feeds into a training job:
import wandb
run = wandb.init()
artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("file:///mount/datasets/mnist/")
# Track the artifact and mark it as an input to# this run in one swoop. A new artifact version# is only logged if the files under the directory# changed.run.use_artifact(artifact)
artifact_dir = artifact.download()
# Perform training here...
To track models, we can log the model artifact after the training script writes the model files to the mount point:
import wandb
run = wandb.init()
# Training here...# Write model to diskmodel_artifact = wandb.Artifact("cnn", type="model")
model_artifact.add_reference("file:///mount/cnn/my_model.h5")
run.log_artifact(model_artifact)
7 - Manage data
7.1 - Delete an artifact
Delete artifacts interactively with the App UI or programmatically with the W&B SDK/
Delete artifacts interactively with the App UI or programmatically with the W&B SDK. When you delete an artifact, W&B marks that artifact as a soft-delete. In other words, the artifact is marked for deletion but files are not immediately deleted from storage.
The contents of the artifact remain as a soft-delete, or pending deletion state, until a regularly run garbage collection process reviews all artifacts marked for deletion. The garbage collection process deletes associated files from storage if the artifact and its associated files are not used by a previous or subsequent artifact versions.
The sections in this page describe how to delete specific artifact versions, how to delete an artifact collection, how to delete artifacts with and without aliases, and more. You can schedule when artifacts are deleted from W&B with TTL policies. For more information, see Manage data retention with Artifact TTL policy.
Artifacts that are scheduled for deletion with a TTL policy, deleted with the W&B SDK, or deleted with the W&B App UI are first soft-deleted. Artifacts that are soft deleted undergo garbage collection before they are hard-deleted.
Delete an artifact version
To delete an artifact version:
Select the name of the artifact. This will expand the artifact view and list all the artifact versions associated with that artifact.
From the list of artifacts, select the artifact version you want to delete.
On the right hand side of the workspace, select the kebab dropdown.
Choose Delete.
An artifact version can also be deleted programatically via the delete() method. See the examples below.
Delete multiple artifact versions with aliases
The following code example demonstrates how to delete artifacts that have aliases associated with them. Provide the entity, project name, and run ID that created the artifacts.
import wandb
run = api.run("entity/project/run_id")
for artifact in run.logged_artifacts():
artifact.delete()
Set the delete_aliases parameter to the boolean value, True to delete aliases if the artifact has one or more aliases.
import wandb
run = api.run("entity/project/run_id")
for artifact in run.logged_artifacts():
# Set delete_aliases=True in order to delete# artifacts with one more aliases artifact.delete(delete_aliases=True)
Delete multiple artifact versions with a specific alias
The proceeding code demonstrates how to delete multiple artifact versions that have a specific alias. Provide the entity, project name, and run ID that created the artifacts. Replace the deletion logic with your own:
import wandb
runs = api.run("entity/project_name/run_id")
# Delete artifact ith alias 'v3' and 'v4for artifact_version in runs.logged_artifacts():
# Replace with your own deletion logic.if artifact_version.name[-2:] =="v3"or artifact_version.name[-2:] =="v4":
artifact.delete(delete_aliases=True)
Delete all versions of an artifact that do not have an alias
The following code snippet demonstrates how to delete all versions of an artifact that do not have an alias. Provide the name of the project and entity for the project and entity keys in wandb.Api, respectively. Replace the <> with the name of your artifact:
import wandb
# Provide your entity and a project name when you# use wandb.Api methods.api = wandb.Api(overrides={"project": "project", "entity": "entity"})
artifact_type, artifact_name ="<>"# provide type and namefor v in api.artifact_versions(artifact_type, artifact_name):
# Clean up versions that don't have an alias such as 'latest'.# NOTE: You can put whatever deletion logic you want here.if len(v.aliases) ==0:
v.delete()
Delete an artifact collection
To delete an artifact collection:
Navigate to the artifact collection you want to delete and hover over it.
Select the kebab dropdown next to the artifact collection name.
Choose Delete.
You can also delete artifact collection programmatically with the delete() method. Provide the name of the project and entity for the project and entity keys in wandb.Api, respectively:
import wandb
# Provide your entity and a project name when you# use wandb.Api methods.api = wandb.Api(overrides={"project": "project", "entity": "entity"})
collection = api.artifact_collection(
"<artifact_type>", "entity/project/artifact_collection_name")
collection.delete()
How to enable garbage collection based on how W&B is hosted
Garbage collection is enabled by default if you use W&B’s shared cloud. Based on how you host W&B, you might need to take additional steps to enable garbage collection, this includes:
Set the GORILLA_ARTIFACT_GC_ENABLED environment variable to true: GORILLA_ARTIFACT_GC_ENABLED=true
Enable bucket versioning if you use AWS, GCP or any other storage provider such as Minio. If you use Azure, enable soft deletion.
Soft deletion in Azure is equivalent to bucket versioning in other storage providers.
The following table describes how to satisfy requirements to enable garbage collection based on your deployment type.
Schedule when artifacts are deleted from W&B with W&B Artifact time-to-live (TTL) policy. When you delete an artifact, W&B marks that artifact as a soft-delete. In other words, the artifact is marked for deletion but files are not immediately deleted from storage. For more information on how W&B deletes artifacts, see the Delete artifacts page.
Check out this video tutorial to learn how to manage data retention with Artifacts TTL in the W&B App.
W&B deactivates the option to set a TTL policy for model artifacts linked to the Model Registry. This is to help ensure that linked models do not accidentally expire if used in production workflows.
Only team admins can view a team’s settings and access team level TTL settings such as (1) permitting who can set or edit a TTL policy or (2) setting a team default TTL.
If you do not see the option to set or edit a TTL policy in an artifact’s details in the W&B App UI or if setting a TTL programmatically does not successfully change an artifact’s TTL property, your team admin has not given you permissions to do so.
Auto-generated Artifacts
Only user-generated artifacts can use TTL policies. Artifacts auto-generated by W&B cannot have TTL policies set for them.
The following Artifact types indicate an auto-generated Artifact:
run_table
code
job
Any Artifact type starting with: wandb-*
You can check an Artifact’s type on the W&B platform or programmatically:
import wandb
run = wandb.init(project="<my-project-name>")
artifact = run.use_artifact(artifact_or_name="<my-artifact-name>")
print(artifact.type)
Replace the values enclosed with <> with your own.
Define who can edit and set TTL policies
Define who can set and edit TTL policies within a team. You can either grant TTL permissions only to team admins, or you can grant both team admins and team members TTL permissions.
Only team admins can define who can set or edit a TTL policy.
Navigate to your team’s profile page.
Select the Settings tab.
Navigate to the Artifacts time-to-live (TTL) section.
From the TTL permissions dropdown, select who can set and edit TTL policies.
Click on Review and save settings.
Confirm the changes and select Save settings.
Create a TTL policy
Set a TTL policy for an artifact either when you create the artifact or retroactively after the artifact is created.
For all the code snippets below, replace the content wrapped in <> with your information to use the code snippet.
Set a TTL policy when you create an artifact
Use the W&B Python SDK to define a TTL policy when you create an artifact. TTL policies are typically defined in days.
Defining a TTL policy when you create an artifact is similar to how you normally create an artifact. With the exception that you pass in a time delta to the artifact’s ttl attribute.
The following code snippet shows how to set a TTL policy for an artifact:
import wandb
from datetime import timedelta
artifact = run.use_artifact("<my-entity/my-project/my-artifact:alias>")
artifact.ttl = timedelta(days=365*2) # Delete in two yearsartifact.save()
The preceding code example sets the TTL policy to two years.
Navigate to your W&B project in the W&B App UI.
Select the artifact icon on the left panel.
From the list of artifacts, expand the artifact type you
Select on the artifact version you want to edit the TTL policy for.
Click on the Version tab.
From the dropdown, select Edit TTL policy.
Within the modal that appears, select Custom from the TTL policy dropdown.
Within the TTL duration field, set the TTL policy in units of days.
Select the Update TTL button to save your changes.
Set default TTL policies for a team
Only team admins can set a default TTL policy for a team.
Set a default TTL policy for your team. Default TTL policies apply to all existing and future artifacts based on their respective creation dates. Artifacts with existing version-level TTL policies are not affected by the team’s default TTL.
Navigate to your team’s profile page.
Select the Settings tab.
Navigate to the Artifacts time-to-live (TTL) section.
Click on the Set team’s default TTL policy.
Within the Duration field, set the TTL policy in units of days.
Click on Review and save settings.
7/ Confirm the changes and then select Save settings.
Set a TTL policy outside of a run
Use the public API to retrieve an artifact without fetching a run, and set the TTL policy. TTL policies are typically defined in days.
The following code sample shows how to fetch an artifact using the public API and set the TTL policy.
api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")
artifact.ttl = timedelta(days=365) # Delete in one yearartifact.save()
Deactivate a TTL policy
Use the W&B Python SDK or W&B App UI to deactivate a TTL policy for a specific artifact version.
Within your project, select the Artifacts tab in the left sidebar.
Click on a collection.
Within the collection view you can see all of the artifacts in the selected collection. Within the Time to Live column you will see the TTL policy assigned to that artifact.
7.3 - Manage artifact storage and memory allocation
Manage storage, memory allocation of W&B Artifacts.
W&B stores artifact files in a private Google Cloud Storage bucket located in the United States by default. All files are encrypted at rest and in transit.
During training, W&B locally saves logs, artifacts, and configuration files in the following local directories:
File
Default location
To change default location set:
logs
./wandb
dir in wandb.init or set the WANDB_DIR environment variable
artifacts
~/.cache/wandb
the WANDB_CACHE_DIR environment variable
configs
~/.config/wandb
the WANDB_CONFIG_DIR environment variable
Depending on the machine on wandb is initialized on, these default folders may not be located in a writeable part of the file system. This might trigger an error.
Clean up local artifact cache
W&B caches artifact files to speed up downloads across versions that share files in common. Over time this cache directory can become large. Run the wandb artifact cache cleanup command to prune the cache and to remove any files that have not been used recently.
The proceeding code snippet demonstrates how to limit the size of the cache to 1GB. Copy and paste the code snippet into your terminal:
$ wandb artifact cache cleanup 1GB
8 - Explore artifact graphs
Traverse automatically created direct acyclic W&B Artifact graphs.
W&B automatically tracks the artifacts a given run logged as well as the artifacts a given run uses. These artifacts can include datasets, models, evaluation results, or more. You can explore an artifact’s lineage to track and manage the various artifacts produced throughout the machine learning lifecycle.
Lineage
Tracking an artifact’s lineage has several key benefits:
Reproducibility: By tracking the lineage of all artifacts, teams can reproduce experiments, models, and results, which is essential for debugging, experimentation, and validating machine learning models.
Version Control: Artifact lineage involves versioning artifacts and tracking their changes over time. This allows teams to roll back to previous versions of data or models if needed.
Auditing: Having a detailed history of the artifacts and their transformations enables organizations to comply with regulatory and governance requirements.
Collaboration and Knowledge Sharing: Artifact lineage facilitates better collaboration among team members by providing a clear record of attempts as well as what worked, and what didn’t. This helps in avoiding duplication of efforts and accelerates the development process.
Finding an artifact’s lineage
When selecting an artifact in the Artifacts tab, you can see your artifact’s lineage. This graph view shows a general overview of your pipeline.
To view an artifact graph:
Navigate to your project in the W&B App UI
Choose the artifact icon on the left panel.
Select Lineage.
Navigating the lineage graph
The artifact or job type you provide appears in front of its name, with artifacts represented by blue icons and runs represented by green icons. Arrows detail the input and output of a run or artifact on the graph.
You can view the type and the name of artifact in both the left sidebar and in the Lineage tab.
For a more detailed view, click any individual artifact or run to get more information on a particular object.
Artifact clusters
When a level of the graph has five or more runs or artifacts, it creates a cluster. A cluster has a search bar to find specific versions of runs or artifacts and pulls an individual node from a cluster to continue investigating the lineage of a node inside a cluster.
Clicking on a node opens a preview with an overview of the node. Clicking on the arrow extracts the individual run or artifact so you can examine the lineage of the extracted node.
Create an artifact. First, create a run with wandb.init. Then,create a new artifact or retrieve an existing one with wandb.Artifact. Next, add files to the artifact with .add_file. Finally, log the artifact to the run with .log_artifact. The finished code looks something like this:
with wandb.init() as run:
artifact = wandb.Artifact("artifact_name", "artifact_type")
# Add Files and Assets to the artifact using# `.add`, `.add_file`, `.add_dir`, and `.add_reference` artifact.add_file("image1.png")
run.log_artifact(artifact)
Use the artifact object’s logged_by and used_by methods to walk the graph from the artifact:
# Walk up and down the graph from an artifact:producer_run = artifact.logged_by()
consumer_runs = artifact.used_by()
Learn where W&B files are stored by default. Explore how to save, store sensitive information.
Files are uploaded to Google Cloud bucket managed by W&B when you log artifacts. The contents of the bucket are encrypted both at rest and in transit. Artifact files are only visible to users who have access to the corresponding project.
When you delete a version of an artifact, it is marked for soft deletion in our database and removed from your storage cost. When you delete an entire artifact, it is queued for permanently deletion and all of its contents are removed from the W&B bucket. If you have specific needs around file deletion please reach out to Customer Support.
For sensitive datasets that cannot reside in a multi-tenant environment, you can use either a private W&B server connected to your cloud bucket or reference artifacts. Reference artifacts track references to private buckets without sending file contents to W&B. Reference artifacts maintain links to files on your buckets or servers. In other words, W&B only keeps track of the metadata associated with the files and not the files themselves.
Create a reference artifact similar to how you create a non reference artifact:
import wandb
run = wandb.init()
artifact = wandb.Artifact("animals", type="dataset")
artifact.add_reference("s3://my-bucket/animals")
For alternatives, contact us at contact@wandb.com to talk about private cloud and on-premises installations.
10 - Tutorial: Create, track, and use a dataset artifact
Artifacts quickstart shows how to create, track, and use a dataset artifact with W&B.
This walkthrough demonstrates how to create, track, and use a dataset artifact from W&B Runs.
1. Log into W&B
Import the W&B library and log in to W&B. You will need to sign up for a free W&B account if you have not done so already.
import wandb
wandb.login()
2. Initialize a run
Use the wandb.init() API to generate a background process to sync and log data as a W&B Run. Provide a project name and a job type:
# Create a W&B Run. Here we specify 'dataset' as the job type since this example# shows how to create a dataset artifact.run = wandb.init(project="artifacts-example", job_type="upload-dataset")
3. Create an artifact object
Create an artifact object with the wandb.Artifact() API. Provide a name for the artifact and a description of the file type for the name and type parameters, respectively.
For example, the following code snippet demonstrates how to create an artifact called ‘bicycle-dataset’ with a ‘dataset’ label:
For more information about how to construct an artifact, see Construct artifacts.
Add the dataset to the artifact
Add a file to the artifact. Common file types include models and datasets. The following example adds a dataset named dataset.h5 that is saved locally on our machine to the artifact:
# Add a file to the artifact's contentsartifact.add_file(local_path="dataset.h5")
Replace the filename dataset.h5 in the preceding code snippet with the path to the file you want to add to the artifact.
4. Log the dataset
Use the W&B run objects log_artifact() method to both save your artifact version and declare the artifact as an output of the run.
# Save the artifact version to W&B and mark it# as the output of this runrun.log_artifact(artifact)
A 'latest' alias is created by default when you log an artifact. For more information about artifact aliases and versions, see Create a custom alias and Create new artifact versions, respectively.
5. Download and use the artifact
The following code example demonstrates the steps you can take to use an artifact you have logged and saved to the W&B servers.
First, initialize a new run object with wandb.init().
Second, use the run objects use_artifact() method to tell W&B what artifact to use. This returns an artifact object.
Third, use the artifacts download() method to download the contents of the artifact.
# Create a W&B Run. Here we specify 'training' for 'type'# because we will use this run to track training.run = wandb.init(project="artifacts-example", job_type="training")
# Query W&B for an artifact and mark it as input to this runartifact = run.use_artifact("bicycle-dataset:latest")
# Download the artifact's contentsartifact_dir = artifact.download()
Alternatively, you can use the Public API (wandb.Api) to export (or update data) data already saved in a W&B outside of a Run. See Track external files for more information.