This is the multi-page printable view of this section. Click here to print.
Deployment options
- 1: Use W&B Multi-tenant SaaS
- 2: Dedicated Cloud
- 3: Self-managed
1 - Use W&B Multi-tenant SaaS
W&B Multi-tenant Cloud is a fully managed platform deployed in W&B’s Google Cloud Platform (GCP) account in GPC’s North America regions. W&B Multi-tenant Cloud utilizes autoscaling in GCP to ensure that the platform scales appropriately based on increases or decreases in traffic.
Data security
For non enterprise plan users, all data is only stored in the shared cloud storage and is processed with shared cloud compute services. Depending on your pricing plan, you may be subject to storage limits.
Enterprise plan users can bring their own bucket (BYOB) using the secure storage connector at the team level to store their files such as models, datasets, and more. You can configure a single bucket for multiple teams or you can use separate buckets for different W&B Teams. If you do not configure secure storage connector for a team, that data is stored in the shared cloud storage.
Identity and access management (IAM)
If you are on enterprise plan, you can use the identity and access managements capabilities for secure authentication and effective authorization in your W&B Organization. The following features are available for IAM in Multi-tenant Cloud:
- SSO authentication with OIDC or SAML. Reach out to your W&B team or support if you would like to configure SSO for your organization.
- Configure appropriate user roles at the scope of the organization and within a team.
- Define the scope of a W&B project to limit who can view, edit, and submit W&B runs to it with restricted projects.
Monitor
Organization admins can manage usage and billing for their account from the Billing
tab in their account view. If using the shared cloud storage on Multi-tenant Cloud, an admin can optimize storage usage across different teams in their organization.
Maintenance
W&B Multi-tenant Cloud is a multi-tenant, fully managed platform. Since W&B Multi-tenant Cloud is managed by W&B, you do not incur the overhead and costs of provisioning and maintaining the W&B platform.
Compliance
Security controls for Multi-tenant Cloud are periodically audited internally and externally. Refer to the W&B Security Portal to request the SOC2 report and other security and compliance documents.
Next steps
Access Multi-tenant Cloud directly if you are looking for non-enterprise capabilities. To start with the enterprise plan, submit this form.
2 - Dedicated Cloud
Use dedicated cloud (Single-tenant SaaS)
W&B Dedicated Cloud is a single-tenant, fully managed platform deployed in W&B’s AWS, GCP or Azure cloud accounts. Each Dedicated Cloud instance has its own isolated network, compute and storage from other W&B Dedicated Cloud instances. Your W&B specific metadata and data is stored in an isolated cloud storage and is processed using isolated cloud compute services.
W&B Dedicated Cloud is available in multiple global regions for each cloud provider
Data security
You can bring your own bucket (BYOB) using the secure storage connector at the instance and team levels to store your files such as models, datasets, and more.
Similar to W&B Multi-tenant Cloud, you can configure a single bucket for multiple teams or you can use separate buckets for different teams. If you do not configure secure storage connector for a team, that data is stored in the instance level bucket.
In addition to BYOB with secure storage connector, you can utilize IP allowlisting to restrict access to your Dedicated Cloud instance from only trusted network locations.
You can also privately connect to your Dedicated Cloud instance using cloud provider’s secure connectivity solution.
Identity and access management (IAM)
Use the identity and access management capabilities for secure authentication and effective authorization in your W&B Organization. The following features are available for IAM in Dedicated Cloud instances:
- Authenticate with SSO using OpenID Connect (OIDC) or with LDAP.
- Configure appropriate user roles at the scope of the organization and within a team.
- Define the scope of a W&B project to limit who can view, edit, and submit W&B runs to it with restricted projects.
- Leverage JSON Web Tokens with identity federation to access W&B APIs.
Monitor
Use Audit logs to track user activity within your teams and to conform to your enterprise governance requirements. Also, you can view organization usage in our Dedicated Cloud instance with W&B Organization Dashboard.
Maintenance
Similar to W&B Multi-tenant Cloud, you do not incur the overhead and costs of provisioning and maintaining the W&B platform with Dedicated Cloud.
To understand how W&B manages updates on Dedicated Cloud, refer to the server release process.
Compliance
Security controls for W&B Dedicated Cloud are periodically audited internally and externally. Refer to the W&B Security Portal to request the security and compliance documents for your product assessment exercise.
Migration options
Migration to Dedicated Cloud from a Self-managed instance or Multi-tenant Cloud is supported.
Next steps
Submit this form if you are interested in using Dedicated Cloud.
2.1 - Supported Dedicated Cloud regions
AWS, GCP, and Azure support cloud computing services in multiple locations worldwide. Global regions help ensure that you satisfy requirements related to data residency & compliance, latency, cost efficiency and more. W&B supports many of the available global regions for Dedicated Cloud.
Supported AWS Regions
The following table lists AWS Regions that W&B currently supports for Dedicated Cloud instances.
Region location | Region name |
---|---|
US East (Ohio) | us-east-2 |
US East (N. Virginia) | us-east-1 |
US West (N. California) | us-west-1 |
US West (Oregon) | us-west-2 |
Canada (Central) | ca-central-1 |
Europe (Frankfurt) | eu-central-1 |
Europe (Ireland) | eu-west-1 |
Europe (London) | eu-west-2 |
Europe (Milan) | eu-south-1 |
Europe (Stockholm) | eu-north-1 |
Asia Pacific (Mumbai) | ap-south-1 |
Asia Pacific (Singapore) | ap-southeast-1 |
Asia Pacific (Sydney) | ap-southeast-2 |
Asia Pacific (Tokyo) | ap-northeast-1 |
Asia Pacific (Seoul) | ap-northeast-2 |
For more information about AWS Regions, see the Regions, Availability Zones, and Local Zones in the AWS Documentation.
See What to Consider when Selecting a Region for your Workloads for an overview of factors that you should consider when choosing an AWS Region.
Supported GCP Regions
The following table lists GCP Regions that W&B currently supports for Dedicated Cloud instances.
Region location | Region name |
---|---|
South Carolina | us-east1 |
N. Virginia | us-east4 |
Iowa | us-central1 |
Oregon | us-west1 |
Los Angeles | us-west2 |
Las Vegas | us-west4 |
Toronto | northamerica-northeast2 |
Belgium | europe-west1 |
London | europe-west2 |
Frankfurt | europe-west3 |
Netherlands | europe-west4 |
Sydney | australia-southeast1 |
Tokyo | asia-northeast1 |
Seoul | asia-northeast3 |
For more information about GCP Regions, see Regions and zones in the GCP Documentation.
Supported Azure Region
The following table lists Azure regions that W&B currently supports for Dedicated Cloud instances.
Region location | Region name |
---|---|
Virginia | eastus |
Iowa | centralus |
Washington | westus2 |
California | westus |
Canada Central | canadacentral |
France Central | francecentral |
Netherlands | westeurope |
Tokyo, Saitama | japaneast |
Seoul | koreacentral |
For more information about Azure regions, see Azure geographies in the Azure Documentation.
2.2 - Export data from Dedicated cloud
If you would like to export all the data managed in your Dedicated cloud instance, you can use the W&B SDK API to extract the runs, metrics, artifacts, and more with the Import and Export API. The following table has covers some of the key exporting use cases.
Purpose | Documentation |
---|---|
Export project metadata | Projects API |
Export runs in a project | Runs API |
Export reports | Reports API |
Export artifacts | Explore artifact graphs, Download and use artifacts |
If you manage artifacts stored in the Dedicated cloud with Secure Storage Connector, you may not need to export the artifacts using the W&B SDK API.
3 - Self-managed
Use self-managed cloud or on-prem infrastructure
Deploy W&B Server on your AWS, GCP, or Azure cloud account or within your on-premises infrastructure.
Your IT/DevOps/MLOps team is responsible for provisioning your deployment, managing upgrades, and continuously maintaining your self managed W&B Server instance.
Deploy W&B Server within self managed cloud accounts
W&B recommends that you use official W&B Terraform scripts to deploy W&B Server into your AWS, GCP, or Azure cloud account.
See specific cloud provider documentation for more information on how to set up W&B Server in AWS, GCP or Azure.
Deploy W&B Server in on-prem infrastructure
You need to configure several infrastructure components in order to set up W&B Server in your on-prem infrastructure. Some of those components include include, but are not limited to:
- (Strongly recommended) Kubernetes cluster
- MySQL 8 database cluster
- Amazon S3-compatible object storage
- Redis cache cluster
See Install on on-prem infrastructure for more information on how to install W&B Server on your on-prem infrastructure. W&B can provide recommendations for the different components and provide guidance through the installation process.
Deploy W&B Server on a custom cloud platform
You can deploy W&B Server to a cloud platform that is not AWS, GCP, or Azure. Requirements for that are similar to that for deploying in on-prem infrastructure.
Obtain your W&B Server license
You need a W&B trial license to complete your configuration of the W&B server. Open the Deploy Manager to generate a free trial license.
If you do not already have a W&B account, create one to generate your free license.
If you need an enterprise license for W&B Server which includes support for important security & other enterprise-friendly capabilities, submit this form or reach out to your W&B team.
The URL redirects you to a Get a License for W&B Local form. Provide the following information:
- Choose a deployment type from the Choose Platform step.
- Select the owner of the license or add a new organization in the Basic Information step.
- Provide a name for the instance in the Name of Instance field and optionally provide a description in the Description field in the Get a License step.
- Select the Generate License Key button.
A page displays with an overview of your deployment along with the license associated with the instance.
3.1 - Reference Architecture
This page describes a reference architecture for a Weights & Biases deployment and outlines the recommended infrastructure and resources to support a production deployment of the platform.
Depending on your chosen deployment environment for Weights & Biases (W&B), various services can help to enhance the resiliency of your deployment.
For instance, major cloud providers offer robust managed database services which help to reduce the complexity of database configuration, maintenance, high availability, and resilience.
This reference architecture addresses some common deployment scenarios and shows how you can integrate your W&B deployment with cloud vendor services for optimal performance and reliability.
Before you start
Running any application in production comes with its own set of challenges, and W&B is no exception. While we aim to streamline the process, certain complexities may arise depending on your unique architecture and design decisions. Typically, managing a production deployment involves overseeing various components, including hardware, operating systems, networking, storage, security, the W&B platform itself, and other dependencies. This responsibility extends to both the initial setup of the environment and its ongoing maintenance.
Consider carefully whether a self-managed approach with W&B is suitable for your team and specific requirements.
A strong understanding of how to run and maintain production-grade application is an important prerequisite before you deploy self-managed W&B. If your team needs assistance, our Professional Services team and partners offer support for implementation and optimization.
To learn more about managed solutions for running W&B instead of managing it yourself, refer to W&B Multi-tenant Cloud and W&B Dedicated Cloud.
Infrastructure
Application layer
The application layer consists of a multi-node Kubernetes cluster, with resilience against node failures. The Kubernetes cluster runs and maintains W&B’s pods.
Storage layer
The storage layer consists of a MySQL database and object storage. The MySQL database stores metadata and the object storage stores artifacts such as models and datasets.
Infrastructure requirements
Kubernetes
The W&B Server application is deployed as a Kubernetes Operator that deploys multiple Pods. For this reason, W&B requires a Kubernetes cluster with:
- A fully configured and functioning Ingress controller
- The capability to provision Persistent Volumes.
MySQL
W&B stores metadata in a MySQL database. The database’s performance and storage requirements depend on the shapes of the model parameters and related metadata. For example, the database grows in size as you track more training runs, and load on the database increases based on queries in run tables, user workspaces, and reports.
Consider the following when you deploy a self-managed MySQL database:
- Backups. You should periodically back up the database to a separate facility. W&B recommends daily backups with at least 1 week of retention.
- Performance. The disk the server is running on should be fast. W&B recommends running the database on an SSD or accelerated NAS.
- Monitoring. The database should be monitored for load. If CPU usage is sustained at > 40% of the system for more than 5 minutes it is likely a good indication the server is resource starved.
- Availability. Depending on your availability and durability requirements you might want to configure a hot standby on a separate machine that streams all updates in realtime from the primary server and can be used to failover to in the event that the primary server crashes or become corrupted.
Object storage
W&B requires object storage with Pre-signed URL and CORS support, deployed in Amazon S3, Azure Cloud Storage, Google Cloud Storage, or a storage service compatible with Amazon S3.service)
Versions
- Kubernetes: at least version 1.29.
- MySQL: at least 8.0.
Networking
In a deployment connected a public or private network, egress to the following endpoints is required during installation and during runtime:
* https://deploy.wandb.ai
* https://charts.wandb.ai
* https://docker.io
* https://quay.io
* https://gcr.io
Access to W&B and to the object storage is required for the training infrastructure and for each system that tracks the needs of experiments.
DNS
The fully qualified domain name (FQDN) of the W&B deployment must resolve to the IP address of the ingress/load balancer using an A record.
SSL/TLS
W&B requires a valid signed SSL/TLS certificate for secure communication between clients and the server. SSL/TLS termination must occur on the ingress/load balancer. The W&B Server application does not terminate SSL or TLS connections.
Please note: W&B does not recommend the use self-signed certificates and custom CAs.
Supported CPU architectures
W&B runs on the Intel (x86) CPU architecture. ARM is not supported.
Infrastructure provisioning
Terraform is the recommended way to deploy W&B for production. Using Terraform, you define the required resources, their references to other resources, and their dependencies. W&B provides Terraform modules for the major cloud providers. For details, refer to Deploy W&B Server within self managed cloud accounts.
Sizing
Use the following general guidelines as a starting point when planning a deployment. W&B recommends that you monitor all components of a new deployment closely and that you make adjustments based on observed usage patterns. Continue to monitor production deployments over time and make adjustments as needed to maintain optimal performance.
Models only
Kubernetes
Environment | CPU | Memory | Disk |
---|---|---|---|
Test/Dev | 2 cores | 16 GB | 100 GB |
Production | 8 cores | 64 GB | 100 GB |
Numbers are per Kubernetes worker node.
MySQL
Environment | CPU | Memory | Disk |
---|---|---|---|
Test/Dev | 2 cores | 16 GB | 100 GB |
Production | 8 cores | 64 GB | 500 GB |
Numbers are per MySQL node.
Weave only
Kubernetes
Environment | CPU | Memory | Disk |
---|---|---|---|
Test/Dev | 4 cores | 32 GB | 100 GB |
Production | 12 cores | 96 GB | 100 GB |
Numbers are per Kubernetes worker node.
MySQL
Environment | CPU | Memory | Disk |
---|---|---|---|
Test/Dev | 2 cores | 16 GB | 100 GB |
Production | 8 cores | 64 GB | 500 GB |
Numbers are per MySQL node.
Models and Weave
Kubernetes
Environment | CPU | Memory | Disk |
---|---|---|---|
Test/Dev | 4 cores | 32 GB | 100 GB |
Production | 16 cores | 128 GB | 100 GB |
Numbers are per Kubernetes worker node.
MySQL
Environment | CPU | Memory | Disk |
---|---|---|---|
Test/Dev | 2 cores | 16 GB | 100 GB |
Production | 8 cores | 64 GB | 500 GB |
Numbers are per MySQL node.
Cloud provider instance recommendations
Services
Cloud | Kubernetes | MySQL | Object Storage |
---|---|---|---|
AWS | EKS | RDS Aurora | S3 |
GCP | GKE | Google Cloud SQL - Mysql | Google Cloud Storage (GCS) |
Azure | AKS | Azure Database for Mysql | Azure Blob Storage |
Machine types
These recommendations apply to each node of a self-managed deployment of W&B in cloud infrastructure.
AWS
Environment | K8s (Models only) | K8s (Weave only) | K8s (Models&Weave) | MySQL |
---|---|---|---|---|
Test/Dev | r6i.large | r6i.xlarge | r6i.xlarge | db.r6g.large |
Production | r6i.2xlarge | r6i.4xlarge | r6i.4xlarge | db.r6g.2xlarge |
GCP
Environment | K8s (Models only) | K8s (Weave only) | K8s (Models&Weave) | MySQL |
---|---|---|---|---|
Test/Dev | n2-highmem-2 | n2-highmem-4 | n2-highmem-4 | db-n1-highmem-2 |
Production | n2-highmem-8 | n2-highmem-16 | n2-highmem-16 | db-n1-highmem-8 |
Azure
Environment | K8s (Models only) | K8s (Weave only) | K8s (Models&Weave) | MySQL |
---|---|---|---|---|
Test/Dev | Standard_E2_v5 | Standard_E4_v5 | Standard_E4_v5 | MO_Standard_E2ds_v4 |
Production | Standard_E8_v5 | Standard_E16_v5 | Standard_E16_v5 | MO_Standard_E8ds_v4 |
3.2 - Run W&B Server on Kubernetes
W&B Kubernetes Operator
Use the W&B Kubernetes Operator to simplify deploying, administering, troubleshooting, and scaling your W&B Server deployments on Kubernetes. You can think of the operator as a smart assistant for your W&B instance.
The W&B Server architecture and design continuously evolves to expand AI developer tooling capabilities, and to provide appropriate primitives for high performance, better scalability, and easier administration. That evolution applies to the compute services, relevant storage and the connectivity between them. To help facilitate continuous updates and improvements across deployment types, W&B users a Kubernetes operator.
For more information about Kubernetes operators, see Operator pattern in the Kubernetes documentation.
Reasons for the architecture shift
Historically, the W&B application was deployed as a single deployment and pod within a Kubernetes Cluster or a single Docker container. W&B has, and continues to recommend, to externalize the Database and Object Store. Externalizing the Database and Object store decouples the application’s state.
As the application grew, the need to evolve from a monolithic container to a distributed system (microservices) was apparent. This change facilitates backend logic handling and seamlessly introduces built-in Kubernetes infrastructure capabilities. Distributed systems also supports deploying new services essential for additional features that W&B relies on.
Before 2024, any Kubernetes-related change required manually updating the terraform-kubernetes-wandb Terraform module. Updating the Terraform module ensures compatibility across cloud providers, configuring necessary Terraform variables, and executing a Terraform apply for each backend or Kubernetes-level change.
This process was not scalable since W&B Support had to assist each customer with upgrading their Terraform module.
The solution was to implement an operator that connects to a central deploy.wandb.ai server to request the latest specification changes for a given release channel and apply them. Updates are received as long as the license is valid. Helm is used as both the deployment mechanism for the W&B operator and the means for the operator to handle all configuration templating of the W&B Kubernetes stack, Helm-ception.
How it works
You can install the operator with helm or from the source. See charts/operator for detailed instructions.
The installation process creates a deployment called controller-manager
and uses a custom resource definition named weightsandbiases.apps.wandb.com
(shortName: wandb
), that takes a single spec
and applies it to the cluster:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: weightsandbiases.apps.wandb.com
The controller-manager
installs charts/operator-wandb based on the spec of the custom resource, release channel, and a user defined config. The configuration specification hierarchy enables maximum configuration flexibility at the user end and enables W&B to release new images, configurations, features, and Helm updates automatically.
Refer to the configuration specification hierarchy and configuration reference for configuration options.
Configuration specification hierarchy
Configuration specifications follow a hierarchical model where higher-level specifications override lower-level ones. Here’s how it works:
- Release Channel Values: This base level configuration sets default values and configurations based on the release channel set by W&B for the deployment.
- User Input Values: Users can override the default settings provided by the Release Channel Spec through the System Console.
- Custom Resource Values: The highest level of specification, which comes from the user. Any values specified here override both the User Input and Release Channel specifications. For a detailed description of the configuration options, see Configuration Reference.
This hierarchical model ensures that configurations are flexible and customizable to meet varying needs while maintaining a manageable and systematic approach to upgrades and changes.
Requirements to use the W&B Kubernetes Operator
Satisfy the following requirements to deploy W&B with the W&B Kubernetes operator:
Refer to the reference architecture. In addition, obtain a valid W&B Server license.
See this guide for a detailed explanation on how to set up and configure a self-managed installation.
Depending on the installation method, you might need to meet the following requirements:
- Kubectl installed and configured with the correct Kubernetes cluster context.
- Helm is installed.
Air-gapped installations
See the Deploy W&B in airgapped environment with Kubernetes tutorial on how to install the W&B Kubernetes Operator in an airgapped environment.
Deploy W&B Server application
This section describes different ways to deploy the W&B Kubernetes operator.
Choose one of the following:
- If you have provisioned all required external services and want to deploy W&B onto Kubernetes with Helm CLI, continue here.
- If you prefer managing infrastructure and the W&B Server with Terraform, continue here.
- If you want to utilize the W&B Cloud Terraform Modules, continue here.
Deploy W&B with Helm CLI
W&B provides a Helm Chart to deploy the W&B Kubernetes operator to a Kubernetes cluster. This approach allows you to deploy W&B Server with Helm CLI or a continuous delivery tool like ArgoCD. Make sure that the above mentioned requirements are in place.
Follow those steps to install the W&B Kubernetes Operator with Helm CLI:
- Add the W&B Helm repository. The W&B Helm chart is available in the W&B Helm repository. Add the repo with the following commands:
helm repo add wandb https://charts.wandb.ai
helm repo update
- Install the Operator on a Kubernetes cluster. Copy and paste the following:
helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace
-
Configure the W&B operator custom resource to trigger the W&B Server installation. Create an operator.yaml file to customize the W&B Operator deployment, specifying your custom configuration. See Configuration Reference for details.
Once you have the specification YAML created and filled with your values, run the following and the operator applies the configuration and install the W&B Server application based on your configuration.
kubectl apply -f operator.yaml
Wait until the deployment completes. This takes a few minutes.
-
To verify the installation using the web UI, create the first admin user account, then follow the verification steps outlined in Verify the installation.
Deploy W&B with Helm Terraform Module
This method allows for customized deployments tailored to specific requirements, leveraging Terraform’s infrastructure-as-code approach for consistency and repeatability. The official W&B Helm-based Terraform Module is located here.
The following code can be used as a starting point and includes all necessary configuration options for a production grade deployment.
module "wandb" {
source = "wandb/wandb/helm"
spec = {
values = {
global = {
host = "https://<HOST_URI>"
license = "eyJhbGnUzaH...j9ZieKQ2x5GGfw"
bucket = {
<details depend on the provider>
}
mysql = {
<redacted>
}
}
ingress = {
annotations = {
"a" = "b"
"x" = "y"
}
}
}
}
}
Note that the configuration options are the same as described in Configuration Reference, but that the syntax has to follow the HashiCorp Configuration Language (HCL). The Terraform module creates the W&B custom resource definition (CRD).
To see how W&B&Biases themselves use the Helm Terraform module to deploy “Dedicated cloud” installations for customers, follow those links:
Deploy W&B with W&B Cloud Terraform modules
W&B provides a set of Terraform Modules for AWS, GCP and Azure. Those modules deploy entire infrastructures including Kubernetes clusters, load balancers, MySQL databases and so on as well as the W&B Server application. The W&B Kubernetes Operator is already pre-baked with those official W&B cloud-specific Terraform Modules with the following versions:
Terraform Registry | Source Code | Version |
---|---|---|
AWS | https://github.com/wandb/terraform-aws-wandb | v4.0.0+ |
Azure | https://github.com/wandb/terraform-azurerm-wandb | v2.0.0+ |
GCP | https://github.com/wandb/terraform-google-wandb | v2.0.0+ |
This integration ensures that W&B Kubernetes Operator is ready to use for your instance with minimal setup, providing a streamlined path to deploying and managing W&B Server in your cloud environment.
For a detailed description on how to use these modules, refer to this section to self-managed installations section in the docs.
Verify the installation
To verify the installation, W&B recommends using the W&B CLI. The verify command executes several tests that verify all components and configurations.
Follow these steps to verify the installation:
-
Install the W&B CLI:
pip install wandb
-
Log in to W&B:
wandb login --host=https://YOUR_DNS_DOMAIN
For example:
wandb login --host=https://wandb.company-name.com
-
Verify the installation:
wandb verify
A successful installation and fully working W&B deployment shows the following output:
Default host selected: https://wandb.company-name.com
Find detailed logs for this test at: /var/folders/pn/b3g3gnc11_sbsykqkm3tx5rh0000gp/T/tmpdtdjbxua/wandb
Checking if logged in...................................................✅
Checking signed URL upload..............................................✅
Checking ability to send large payloads through proxy...................✅
Checking requests to base url...........................................✅
Checking requests made over signed URLs.................................✅
Checking CORs configuration of the bucket...............................✅
Checking wandb package version is up to date............................✅
Checking logged metrics, saving and downloading a file..................✅
Checking artifact save and download workflows...........................✅
Access the W&B Management Console
The W&B Kubernetes operator comes with a management console. It is located at ${HOST_URI}/console
, for example https://wandb.company-name.com/
console.
There are two ways to log in to the management console:
-
Open the W&B application in the browser and login. Log in to the W&B application with
${HOST_URI}/
, for examplehttps://wandb.company-name.com/
-
Access the console. Click on the icon in the top right corner and then click System console. Only users with admin privileges can see the System console entry.
- Open console application in browser. Open the above described URL, which redirects you to the login screen:
- Retrieve the password from the Kubernetes secret that the installation generates:
Copy the password.
kubectl get secret wandb-password -o jsonpath='{.data.password}' | base64 -d
- Login to the console. Paste the copied password, then click Login.
Update the W&B Kubernetes operator
This section describes how to update the W&B Kubernetes operator.
- Updating the W&B Kubernetes operator does not update the W&B server application.
- See the instructions here if you use a Helm chart that does not user the W&B Kubernetes operator before you follow the proceeding instructions to update the W&B operator.
Copy and paste the code snippets below into your terminal.
-
First, update the repo with
helm repo update
:helm repo update
-
Next, update the Helm chart with
helm upgrade
:helm upgrade operator wandb/operator -n wandb-cr --reuse-values
Update the W&B Server application
You no longer need to update W&B Server application if you use the W&B Kubernetes operator.
The operator automatically updates your W&B Server application when a new version of the software of W&B is released.
Migrate self-managed instances to W&B Operator
The proceeding section describe how to migrate from self-managing your own W&B Server installation to using the W&B Operator to do this for you. The migration process depends on how you installed W&B Server:
- If you used the official W&B Cloud Terraform Modules, navigate to the appropriate documentation and follow the steps there:
- If you used the W&B Non-Operator Helm chart, continue here.
- If you used the W&B Non-Operator Helm chart with Terraform, continue here.
- If you created the Kubernetes resources with manifests, continue here.
Migrate to Operator-based AWS Terraform Modules
For a detailed description of the migration process, continue here.
Migrate to Operator-based GCP Terraform Modules
Reach out to Customer Support or your W&B team if you have any questions or need assistance.
Migrate to Operator-based Azure Terraform Modules
Reach out to Customer Support or your W&B team if you have any questions or need assistance.
Migrate to Operator-based Helm chart
Follow these steps to migrate to the Operator-based Helm chart:
-
Get the current W&B configuration. If W&B was deployed with an non-operator-based version of the Helm chart, export the values like this:
helm get values wandb
If W&B was deployed with Kubernetes manifests, export the values like this:
kubectl get deployment wandb -o yaml
You now have all the configuration values you need for the next step.
-
Create a file called
operator.yaml
. Follow the format described in the Configuration Reference. Use the values from step 1. -
Scale the current deployment to 0 pods. This step is stops the current deployment.
kubectl scale --replicas=0 deployment wandb
-
Update the Helm chart repo:
helm repo update
-
Install the new Helm chart:
helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace
-
Configure the new helm chart and trigger W&B application deployment. Apply the new configuration.
kubectl apply -f operator.yaml
The deployment takes a few minutes to complete.
-
Verify the installation. Make sure that everything works by following the steps in Verify the installation.
-
Remove to old installation. Uninstall the old helm chart or delete the resources that were created with manifests.
Migrate to Operator-based Terraform Helm chart
Follow these steps to migrate to the Operator-based Helm chart:
- Prepare Terraform config. Replace the Terraform code from the old deployment in your Terraform config with the one that is described here. Set the same variables as before. Do not change .tfvars file if you have one.
- Execute Terraform run. Execute terraform init, plan and apply
- Verify the installation. Make sure that everything works by following the steps in Verify the installation.
- Remove to old installation. Uninstall the old helm chart or delete the resources that were created with manifests.
Configuration Reference for W&B Server
This section describes the configuration options for W&B Server application. The application receives its configuration as custom resource definition named WeightsAndBiases. Some configuration options are exposed with the below configuration, some need to be set as environment variables.
The documentation has two lists of environment variables: basic and advanced. Only use environment variables if the configuration option that you need are not exposed using Helm Chart.
The W&B Server application configuration file for a production deployment requires the following contents. This YAML file defines the desired state of your W&B deployment, including the version, environment variables, external resources like databases, and other necessary settings.
apiVersion: apps.wandb.com/v1
kind: WeightsAndBiases
metadata:
labels:
app.kubernetes.io/name: weightsandbiases
app.kubernetes.io/instance: wandb
name: wandb
namespace: default
spec:
values:
global:
host: https://<HOST_URI>
license: eyJhbGnUzaH...j9ZieKQ2x5GGfw
bucket:
<details depend on the provider>
mysql:
<redacted>
ingress:
annotations:
<redacted>
Find the full set of values in the W&B Helm repository, and change only those values you need to override.
Complete example
This is an example configuration that uses GCP Kubernetes with GCP Ingress and GCS (GCP Object storage):
apiVersion: apps.wandb.com/v1
kind: WeightsAndBiases
metadata:
labels:
app.kubernetes.io/name: weightsandbiases
app.kubernetes.io/instance: wandb
name: wandb
namespace: default
spec:
values:
global:
host: https://abc-wandb.sandbox-gcp.wandb.ml
bucket:
name: abc-wandb-moving-pipefish
provider: gcs
mysql:
database: wandb_local
host: 10.218.0.2
name: wandb_local
password: 8wtX6cJHizAZvYScjDzZcUarK4zZGjpV
port: 3306
user: wandb
license: eyJhbGnUzaHgyQjQyQWhEU3...ZieKQ2x5GGfw
ingress:
annotations:
ingress.gcp.kubernetes.io/pre-shared-cert: abc-wandb-cert-creative-puma
kubernetes.io/ingress.class: gce
kubernetes.io/ingress.global-static-ip-name: abc-wandb-operator-address
Host
# Provide the FQDN with protocol
global:
# example host name, replace with your own
host: https://abc-wandb.sandbox-gcp.wandb.ml
Object storage (bucket)
AWS
global:
bucket:
provider: "s3"
name: ""
kmsKey: ""
region: ""
GCP
global:
bucket:
provider: "gcs"
name: ""
Azure
global:
bucket:
provider: "az"
name: ""
secretKey: ""
Other providers (Minio, Ceph, etc.)
For other S3 compatible providers, set the bucket configuration as a environment variable as follows:
global:
extraEnv:
"BUCKET": "s3://wandb:changeme@mydb.com/wandb?tls=true"
The variable contains a connection string in this form:
s3://$ACCESS_KEY:$SECRET_KEY@$HOST/$BUCKET_NAME
You can optionally tell W&B to only connect over TLS if you configure a trusted SSL certificate for your object store. To do so, add the tls
query parameter to the url:
s3://$ACCESS_KEY:$SECRET_KEY@$HOST/$BUCKET_NAME?tls=true
MySQL
global:
mysql:
# Example values, replace with your own
database: wandb_local
host: 10.218.0.2
name: wandb_local
password: 8wtX6cJH...ZcUarK4zZGjpV
port: 3306
user: wandb
License
global:
# Example license, replace with your own
license: eyJhbGnUzaHgyQjQy...VFnPS_KETXg1hi
Ingress
To identify the ingress class, see this FAQ entry.
Without TLS
global:
# IMPORTANT: Ingress is on the same level in the YAML as ‘global’ (not a child)
ingress:
class: ""
With TLS
Create a secret that contains the certificate
kubectl create secret tls wandb-ingress-tls --key wandb-ingress-tls.key --cert wandb-ingress-tls.crt
Reference the secret in the ingress configuration
global:
# IMPORTANT: Ingress is on the same level in the YAML as ‘global’ (not a child)
ingress:
class: ""
annotations:
{}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
tls:
- secretName: wandb-ingress-tls
hosts:
- <HOST_URI>
In case of Nginx you might have to add the following annotation:
ingress:
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: 64m
Custom Kubernetes ServiceAccounts
Specify custom Kubernetes service accounts to run the W&B pods.
The following snippet creates a service account as part of the deployment with the specified name:
app:
serviceAccount:
name: custom-service-account
create: true
parquet:
serviceAccount:
name: custom-service-account
create: true
global:
...
The subsystems “app” and “parquet” run under the specified service account. The other subsystems run under the default service account.
If the service account already exists on the cluster, set create: false
:
app:
serviceAccount:
name: custom-service-account
create: false
parquet:
serviceAccount:
name: custom-service-account
create: false
global:
...
You can specify service accounts on different subsystems such as app, parquet, console, and others:
app:
serviceAccount:
name: custom-service-account
create: true
console:
serviceAccount:
name: custom-service-account
create: true
global:
...
The service accounts can be different between the subsystems:
app:
serviceAccount:
name: custom-service-account
create: false
console:
serviceAccount:
name: another-custom-service-account
create: true
global:
...
External Redis
redis:
install: false
global:
redis:
host: ""
port: 6379
password: ""
parameters: {}
caCert: ""
Alternatively with redis password in a Kubernetes secret:
kubectl create secret generic redis-secret --from-literal=redis-password=supersecret
Reference it in below configuration:
redis:
install: false
global:
redis:
host: redis.example
port: 9001
auth:
enabled: true
secret: redis-secret
key: redis-password
LDAP
Without TLS
global:
ldap:
enabled: true
# LDAP server address including "ldap://" or "ldaps://"
host:
# LDAP search base to use for finding users
baseDN:
# LDAP user to bind with (if not using anonymous bind)
bindDN:
# Secret name and key with LDAP password to bind with (if not using anonymous bind)
bindPW:
# LDAP attribute for email and group ID attribute names as comma separated string values.
attributes:
# LDAP group allow list
groupAllowList:
# Enable LDAP TLS
tls: false
With TLS
The LDAP TLS cert configuration requires a config map pre-created with the certificate content.
To create the config map you can use the following command:
kubectl create configmap ldap-tls-cert --from-file=certificate.crt
And use the config map in the YAML like the example below
global:
ldap:
enabled: true
# LDAP server address including "ldap://" or "ldaps://"
host:
# LDAP search base to use for finding users
baseDN:
# LDAP user to bind with (if not using anonymous bind)
bindDN:
# Secret name and key with LDAP password to bind with (if not using anonymous bind)
bindPW:
# LDAP attribute for email and group ID attribute names as comma separated string values.
attributes:
# LDAP group allow list
groupAllowList:
# Enable LDAP TLS
tls: true
# ConfigMap name and key with CA certificate for LDAP server
tlsCert:
configMap:
name: "ldap-tls-cert"
key: "certificate.crt"
OIDC SSO
global:
auth:
sessionLengthHours: 720
oidc:
clientId: ""
secret: ""
authMethod: ""
issuer: ""
SMTP
global:
email:
smtp:
host: ""
port: 587
user: ""
password: ""
Environment Variables
global:
extraEnv:
GLOBAL_ENV: "example"
Custom certificate authority
customCACerts
is a list and can take many certificates. Certificate authorities specified in customCACerts
only apply to the W&B Server application.
global:
customCACerts:
- |
-----BEGIN CERTIFICATE-----
MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG
SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg
P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0=
-----END CERTIFICATE-----
- |
-----BEGIN CERTIFICATE-----
MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU
SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU
aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ==
-----END CERTIFICATE-----
Configuration Reference for W&B Operator
This section describes configuration options for W&B Kubernetes operator (wandb-controller-manager
). The operator receives its configuration in the form of a YAML file.
By default, the W&B Kubernetes operator does not need a configuration file. Create a configuration file if required. For example, you might need a configuration file to specify custom certificate authorities, deploy in an air gap environment and so forth.
Find the full list of spec customization in the Helm repository.
Custom CA
A custom certificate authority (customCACerts
), is a list and can take many certificates. Those certificate authorities when added only apply to the W&B Kubernetes operator (wandb-controller-manager
).
customCACerts:
- |
-----BEGIN CERTIFICATE-----
MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG
SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg
P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0=
-----END CERTIFICATE-----
- |
-----BEGIN CERTIFICATE-----
MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU
SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU
aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ==
-----END CERTIFICATE-----
FAQ
How to get the W&B Operator Console password
See Accessing the W&B Kubernetes Operator Management Console.
How to access the W&B Operator Console if Ingress doesn’t work
Execute the following command on a host that can reach the Kubernetes cluster:
kubectl port-forward svc/wandb-console 8082
Access the console in the browser with https://localhost:8082/
console.
See Accessing the W&B Kubernetes Operator Management Console on how to get the password (Option 2).
How to view W&B Server logs
The application pod is named wandb-app-xxx.
kubectl get pods
kubectl logs wandb-XXXXX-XXXXX
How to identify the Kubernetes ingress class
You can get the ingress class installed in your cluster by running
kubectl get ingressclass
3.2.1 - Kubernetes operator for air-gapped instances
Introduction
This guide provides step-by-step instructions to deploy the W&B Platform in air-gapped customer-managed environments.
Use an internal repository or registry to host the Helm charts and container images. Run all commands in a shell console with proper access to the Kubernetes cluster.
You could utilize similar commands in any continuous delivery tooling that you use to deploy Kubernetes applications.
Step 1: Prerequisites
Before starting, make sure your environment meets the following requirements:
- Kubernetes version >= 1.28
- Helm version >= 3
- Access to an internal container registry with the required W&B images
- Access to an internal Helm repository for W&B Helm charts
Step 2: Prepare internal container registry
Before proceeding with the deployment, you must ensure that the following container images are available in your internal container registry. These images are critical for the successful deployment of W&B components.
wandb/local 0.59.2
wandb/console 2.12.2
wandb/controller 1.13.0
otel/opentelemetry-collector-contrib 0.97.0
bitnami/redis 7.2.4-debian-12-r9
quay.io/prometheus/prometheus v2.47.0
quay.io/prometheus-operator/prometheus-config-reloader v0.67.0
Step 3: Prepare internal Helm chart repository
Along with the container images, you also must ensure that the following Helm charts are available in your internal Helm Chart repository.
The operator
chart is used to deploy the W&B Operator, or the Controller Manager. While the platform
chart is used to deploy the W&B Platform using the values configured in the custom resource definition (CRD).
Step 4: Set up Helm repository
Now, configure the Helm repository to pull the W&B Helm charts from your internal repository. Run the following commands to add and update the Helm repository:
helm repo add local-repo https://charts.yourdomain.com
helm repo update
Step 5: Install the Kubernetes operator
The W&B Kubernetes operator, also known as the controller manager, is responsible for managing the W&B platform components. To install it in an air-gapped environment, you must configure it to use your internal container registry.
To do so, you must override the default image settings to use your internal container registry and set the key airgapped: true
to indicate the expected deployment type. Update the values.yaml
file as shown below:
image:
repository: registry.yourdomain.com/library/controller
tag: 1.13.3
airgapped: true
You can find all supported values in the official Kubernetes operator repository.
Step 6: Configure CustomResourceDefinitions
After installing the W&B Kubernetes operator, you must configure the Custom Resource Definitions (CRDs) to point to your internal Helm repository and container registry.
This configuration ensures that the Kubernetes operators uses your internal registry and repository are when it deploys the required components of the W&B platform.
Below is an example of how to configure the CRD.
apiVersion: apps.wandb.com/v1
kind: WeightsAndBiases
metadata:
labels:
app.kubernetes.io/instance: wandb
app.kubernetes.io/name: weightsandbiases
name: wandb
namespace: default
spec:
chart:
url: http://charts.yourdomain.com
name: operator-wandb
version: 0.18.0
values:
global:
host: https://wandb.yourdomain.com
license: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
bucket:
accessKey: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
secretKey: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
name: s3.yourdomain.com:port #Ex.: s3.yourdomain.com:9000
path: bucket_name
provider: s3
region: us-east-1
mysql:
database: wandb
host: mysql.home.lab
password: password
port: 3306
user: wandb
# Ensre it's set to use your own MySQL
mysql:
install: false
app:
image:
repository: registry.yourdomain.com/local
tag: 0.59.2
console:
image:
repository: registry.yourdomain.com/console
tag: 2.12.2
ingress:
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: 64m
class: nginx
To deploy the W&B platform, the Kubernetes Operator uses the operator-wandb
chart from your internal repository and use the values from your CRD to configure the Helm chart.
You can find all supported values in the official Kubernetes operator repository.
Step 7: Deploy the W&B platform
Finally, after setting up the Kubernetes operator and the CRD, deploy the W&B platform using the following command:
kubectl apply -f wandb.yaml
FAQ
Refer to the below frequently asked questions (FAQs) and troubleshooting tips during the deployment process:
There is another ingress class. Can that class be used?
Yes, you can configure your ingress class by modifying the ingress settings in values.yaml
.
The certificate bundle has more than one certificate. Would that work?
You must split the certificates into multiple entries in the customCACerts
section of values.yaml
.
How do you prevent the Kubernetes operator from applying unattended updates. Is that possible?
You can turn off auto-updates from the W&B console. Reach out to your W&B team for any questions on the supported versions. Also, note that W&B supports platform versions released in last 6 months. W&B recommends performing periodic upgrades.
Does the deployment work if the environment has no connection to public repositories?
As long as you have enabled the airgapped: true
configuration, the Kubernetes operator does not attempt to reach public repositories. The Kubernetes operator attempts to use your internal resources.
3.3 - Install on public cloud
3.3.1 - Deploy W&B Platform on AWS
W&B recommends using the W&B Server AWS Terraform Module to deploy the platform on AWS.
Before you start, W&B recommends that you choose one of the remote backends available for Terraform to store the State File.
The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components.
The Terraform Module deploys the following mandatory
components:
- Load Balancer
- AWS Identity & Access Management (IAM)
- AWS Key Management System (KMS)
- Amazon Aurora MySQL
- Amazon VPC
- Amazon S3
- Amazon Route53
- Amazon Certificate Manager (ACM)
- Amazon Elastic Load Balancing (ALB)
- Amazon Secrets Manager
Other deployment options can also include the following optional components:
- Elastic Cache for Redis
- SQS
Pre-requisite permissions
The account that runs Terraform needs to be able to create all components described in the Introduction and permission to create IAM Policies and IAM Roles and assign roles to resources.
General steps
The steps on this topic are common for any deployment option covered by this documentation.
-
Prepare the development environment.
- Install Terraform
- W&B recommend creating a Git repository for version control.
-
Create the
terraform.tfvars
file.The
tvfars
file content can be customized according to the installation type, but the minimum recommended will look like the example below.namespace = "wandb" license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" subdomain = "wandb-aws" domain_name = "wandb.ml" zone_id = "xxxxxxxxxxxxxxxx" allowed_inbound_cidr = ["0.0.0.0/0"] allowed_inbound_ipv6_cidr = ["::/0"]
Ensure to define variables in your
tvfars
file before you deploy because thenamespace
variable is a string that prefixes all resources created by Terraform.The combination of
subdomain
anddomain
will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will bewandb-aws.wandb.ml
and the DNSzone_id
where the FQDN record will be created.Both
allowed_inbound_cidr
andallowed_inbound_ipv6_cidr
also require setting. In the module, this is a mandatory input. The proceeding example permits access from any source to the W&B installation. -
Create the file
versions.tf
This file will contain the Terraform and Terraform provider versions required to deploy W&B in AWS
provider "aws" { region = "eu-central-1" default_tags { tags = { GithubRepo = "terraform-aws-wandb" GithubOrg = "wandb" Enviroment = "Example" Example = "PublicDnsExternal" } } }
Refer to the Terraform Official Documentation to configure the AWS provider.
Optionally, but highly recommended, add the remote backend configuration mentioned at the beginning of this documentation.
-
Create the file
variables.tf
For every option configured in the
terraform.tfvars
Terraform requires a correspondent variable declaration.variable "namespace" { type = string description = "Name prefix used for resources" } variable "domain_name" { type = string description = "Domain name used to access instance." } variable "subdomain" { type = string default = null description = "Subdomain for accessing the Weights & Biases UI." } variable "license" { type = string } variable "zone_id" { type = string description = "Domain for creating the Weights & Biases subdomain on." } variable "allowed_inbound_cidr" { description = "CIDRs allowed to access wandb-server." nullable = false type = list(string) } variable "allowed_inbound_ipv6_cidr" { description = "CIDRs allowed to access wandb-server." nullable = false type = list(string) }
Recommended deployment option
This is the most straightforward deployment option configuration that creates all Mandatory
components and installs in the Kubernetes Cluster
the latest version of W&B
.
-
Create the
main.tf
In the same directory where you created the files in the
General Steps
, create a filemain.tf
with the following content:module "wandb_infra" { source = "wandb/wandb/aws" version = "~>2.0" namespace = var.namespace domain_name = var.domain_name subdomain = var.subdomain zone_id = var.zone_id allowed_inbound_cidr = var.allowed_inbound_cidr allowed_inbound_ipv6_cidr = var.allowed_inbound_ipv6_cidr public_access = true external_dns = true kubernetes_public_access = true kubernetes_public_access_cidrs = ["0.0.0.0/0"] } data "aws_eks_cluster" "app_cluster" { name = module.wandb_infra.cluster_id } data "aws_eks_cluster_auth" "app_cluster" { name = module.wandb_infra.cluster_id } provider "kubernetes" { host = data.aws_eks_cluster.app_cluster.endpoint cluster_ca_certificate = base64decode(data.aws_eks_cluster.app_cluster.certificate_authority.0.data) token = data.aws_eks_cluster_auth.app_cluster.token } module "wandb_app" { source = "wandb/wandb/kubernetes" version = "~>1.0" license = var.license host = module.wandb_infra.url bucket = "s3://${module.wandb_infra.bucket_name}" bucket_aws_region = module.wandb_infra.bucket_region bucket_queue = "internal://" database_connection_string = "mysql://${module.wandb_infra.database_connection_string}" # TF attempts to deploy while the work group is # still spinning up if you do not wait depends_on = [module.wandb_infra] } output "bucket_name" { value = module.wandb_infra.bucket_name } output "url" { value = module.wandb_infra.url }
-
Deploy W&B
To deploy W&B, execute the following commands:
terraform init terraform apply -var-file=terraform.tfvars
Enable REDIS
Another deployment option uses Redis
to cache the SQL queries and speed up the application response when loading the metrics for the experiments.
You need to add the option create_elasticache_subnet = true
to the same main.tf
file described in the Recommended deployment section to enable the cache.
module "wandb_infra" {
source = "wandb/wandb/aws"
version = "~>2.0"
namespace = var.namespace
domain_name = var.domain_name
subdomain = var.subdomain
zone_id = var.zone_id
**create_elasticache_subnet = true**
}
[...]
Enable message broker (queue)
Deployment option 3 consists of enabling the external message broker
. This is optional because the W&B brings embedded a broker. This option doesn’t bring a performance improvement.
The AWS resource that provides the message broker is the SQS
, and to enable it, you will need to add the option use_internal_queue = false
to the same main.tf
described in the Recommended deployment section.
module "wandb_infra" {
source = "wandb/wandb/aws"
version = "~>2.0"
namespace = var.namespace
domain_name = var.domain_name
subdomain = var.subdomain
zone_id = var.zone_id
**use_internal_queue = false**
[...]
}
Other deployment options
You can combine all three deployment options adding all configurations to the same file.
The Terraform Module provides several options that can be combined along with the standard options and the minimal configuration found in Deployment - Recommended
Manual configuration
To use an Amazon S3 bucket as a file storage backend for W&B, you will need to:
- Create an Amazon S3 Bucket and Bucket Notifications
- Create SQS Queue
- Grant Permissions to Node Running W&B
you’ll need to create a bucket, along with an SQS queue configured to receive object creation notifications from that bucket. Your instance will need permissions to read from this queue.
Create an S3 Bucket and Bucket Notifications
Follow the procedure bellow to create an Amazon S3 bucket and enable bucket notifications.
- Navigate to Amazon S3 in the AWS Console.
- Select Create bucket.
- Within the Advanced settings, select Add notification within the Events section.
- Configure all object creation events to be sent to the SQS Queue you configured earlier.
Enable CORS access. Your CORS configuration should look like the following:
<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>http://YOUR-W&B-SERVER-IP</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<AllowedMethod>PUT</AllowedMethod>
<AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>
Create an SQS Queue
Follow the procedure below to create an SQS Queue:
- Navigate to Amazon SQS in the AWS Console.
- Select Create queue.
- From the Details section, select a Standard queue type.
- Within the Access policy section, add permission to the following principals:
SendMessage
ReceiveMessage
ChangeMessageVisibility
DeleteMessage
GetQueueUrl
Optionally add an advanced access policy in the Access Policy section. For example, the policy for accessing Amazon SQS with a statement is as follows:
{
"Version" : "2012-10-17",
"Statement" : [
{
"Effect" : "Allow",
"Principal" : "*",
"Action" : ["sqs:SendMessage"],
"Resource" : "<sqs-queue-arn>",
"Condition" : {
"ArnEquals" : { "aws:SourceArn" : "<s3-bucket-arn>" }
}
}
]
}
Grant permissions to node that runs W&B
The node where W&B server is running must be configured to permit access to Amazon S3 and Amazon SQS. Depending on the type of server deployment you have opted for, you may need to add the following policy statements to your node role:
{
"Statement":[
{
"Sid":"",
"Effect":"Allow",
"Action":"s3:*",
"Resource":"arn:aws:s3:::<WANDB_BUCKET>"
},
{
"Sid":"",
"Effect":"Allow",
"Action":[
"sqs:*"
],
"Resource":"arn:aws:sqs:<REGION>:<ACCOUNT>:<WANDB_QUEUE>"
}
]
}
Configure W&B server
Finally, configure your W&B Server.
- Navigate to the W&B settings page at
http(s)://YOUR-W&B-SERVER-HOST/system-admin
. - Enable the **Use an external file storage backend option
- Provide information about your Amazon S3 bucket, region, and Amazon SQS queue in the following format:
- File Storage Bucket:
s3://<bucket-name>
- File Storage Region (AWS only):
<region>
- Notification Subscription:
sqs://<queue-name>
- Select Update settings to apply the new settings.
Upgrade your W&B version
Follow the steps outlined here to update W&B:
- Add
wandb_version
to your configuration in yourwandb_app
module. Provide the version of W&B you want to upgrade to. For example, the following line specifies W&B version0.48.1
:
module "wandb_app" {
source = "wandb/wandb/kubernetes"
version = "~>1.0"
license = var.license
wandb_version = "0.48.1"
wandb_version
to the terraform.tfvars
and create a variable with the same name and instead of using the literal value, use the var.wandb_version
- After you update your configuration, complete the steps described in the Recommended deployment section.
Migrate to operator-based AWS Terraform modules
This section details the steps required to upgrade from pre-operator to post-operator environments using the terraform-aws-wandb module.
Before and after architecture
Previously, the W&B architecture used:
module "wandb_infra" {
source = "wandb/wandb/aws"
version = "1.16.10"
...
}
to control the infrastructure:
and this module to deploy the W&B Server:
module "wandb_app" {
source = "wandb/wandb/kubernetes"
version = "1.12.0"
}
Post-transition, the architecture uses:
module "wandb_infra" {
source = "wandb/wandb/aws"
version = "4.7.2"
...
}
to manage both the installation of infrastructure and the W&B Server to the Kubernetes cluster, thus eliminating the need for the module "wandb_app"
in post-operator.tf
.
This architectural shift enables additional features (like OpenTelemetry, Prometheus, HPAs, Kafka, and image updates) without requiring manual Terraform operations by SRE/Infrastructure teams.
To commence with a base installation of the W&B Pre-Operator, ensure that post-operator.tf
has a .disabled
file extension and pre-operator.tf
is active (that does not have a .disabled
extension). Those files can be found here.
Prerequisites
Before initiating the migration process, ensure the following prerequisites are met:
- Egress: The deployment can’t be airgapped. It needs access to deploy.wandb.ai to get the latest spec for the Release Channel.
- AWS Credentials: Proper AWS credentials configured to interact with your AWS resources.
- Terraform Installed: The latest version of Terraform should be installed on your system.
- Route53 Hosted Zone: An existing Route53 hosted zone corresponding to the domain under which the application will be served.
- Pre-Operator Terraform Files: Ensure
pre-operator.tf
and associated variable files likepre-operator.tfvars
are correctly set up.
Pre-Operator set up
Execute the following Terraform commands to initialize and apply the configuration for the Pre-Operator setup:
terraform init -upgrade
terraform apply -var-file=./pre-operator.tfvars
pre-operator.tf
should look something like this:
namespace = "operator-upgrade"
domain_name = "sandbox-aws.wandb.ml"
zone_id = "Z032246913CW32RVRY0WU"
subdomain = "operator-upgrade"
wandb_license = "ey..."
wandb_version = "0.51.2"
The pre-operator.tf
configuration calls two modules:
module "wandb_infra" {
source = "wandb/wandb/aws"
version = "1.16.10"
...
}
This module spins up the infrastructure.
module "wandb_app" {
source = "wandb/wandb/kubernetes"
version = "1.12.0"
}
This module deploys the application.
Post-Operator Setup
Make sure that pre-operator.tf
has a .disabled
extension, and post-operator.tf
is active.
The post-operator.tfvars
includes additional variables:
...
# wandb_version = "0.51.2" is now managed via the Release Channel or set in the User Spec.
# Required Operator Variables for Upgrade:
size = "small"
enable_dummy_dns = true
enable_operator_alb = true
custom_domain_filter = "sandbox-aws.wandb.ml"
Run the following commands to initialize and apply the Post-Operator configuration:
terraform init -upgrade
terraform apply -var-file=./post-operator.tfvars
The plan and apply steps will update the following resources:
actions:
create:
- aws_efs_backup_policy.storage_class
- aws_efs_file_system.storage_class
- aws_efs_mount_target.storage_class["0"]
- aws_efs_mount_target.storage_class["1"]
- aws_eks_addon.efs
- aws_iam_openid_connect_provider.eks
- aws_iam_policy.secrets_manager
- aws_iam_role_policy_attachment.ebs_csi
- aws_iam_role_policy_attachment.eks_efs
- aws_iam_role_policy_attachment.node_secrets_manager
- aws_security_group.storage_class_nfs
- aws_security_group_rule.nfs_ingress
- random_pet.efs
- aws_s3_bucket_acl.file_storage
- aws_s3_bucket_cors_configuration.file_storage
- aws_s3_bucket_ownership_controls.file_storage
- aws_s3_bucket_server_side_encryption_configuration.file_storage
- helm_release.operator
- helm_release.wandb
- aws_cloudwatch_log_group.this[0]
- aws_iam_policy.default
- aws_iam_role.default
- aws_iam_role_policy_attachment.default
- helm_release.external_dns
- aws_default_network_acl.this[0]
- aws_default_route_table.default[0]
- aws_iam_policy.default
- aws_iam_role.default
- aws_iam_role_policy_attachment.default
- helm_release.aws_load_balancer_controller
update_in_place:
- aws_iam_policy.node_IMDSv2
- aws_iam_policy.node_cloudwatch
- aws_iam_policy.node_kms
- aws_iam_policy.node_s3
- aws_iam_policy.node_sqs
- aws_eks_cluster.this[0]
- aws_elasticache_replication_group.default
- aws_rds_cluster.this[0]
- aws_rds_cluster_instance.this["1"]
- aws_default_security_group.this[0]
- aws_subnet.private[0]
- aws_subnet.private[1]
- aws_subnet.public[0]
- aws_subnet.public[1]
- aws_launch_template.workers["primary"]
destroy:
- kubernetes_config_map.config_map
- kubernetes_deployment.wandb
- kubernetes_priority_class.priority
- kubernetes_secret.secret
- kubernetes_service.prometheus
- kubernetes_service.service
- random_id.snapshot_identifier[0]
replace:
- aws_autoscaling_attachment.autoscaling_attachment["primary"]
- aws_route53_record.alb
- aws_eks_node_group.workers["primary"]
You should see something like this:
Note that in post-operator.tf
, there is a single:
module "wandb_infra" {
source = "wandb/wandb/aws"
version = "4.7.2"
...
}
Changes in the post-operator configuration:
- Update Required Providers: Change
required_providers.aws.version
from3.6
to4.0
for provider compatibility. - DNS and Load Balancer Configuration: Integrate
enable_dummy_dns
andenable_operator_alb
to manage DNS records and AWS Load Balancer setup through an Ingress. - License and Size Configuration: Transfer the
license
andsize
parameters directly to thewandb_infra
module to match new operational requirements. - Custom Domain Handling: If necessary, use
custom_domain_filter
to troubleshoot DNS issues by checking the External DNS pod logs within thekube-system
namespace. - Helm Provider Configuration: Enable and configure the Helm provider to manage Kubernetes resources effectively:
provider "helm" {
kubernetes {
host = data.aws_eks_cluster.app_cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.app_cluster.certificate_authority[0].data)
token = data.aws_eks_cluster_auth.app_cluster.token
exec {
api_version = "client.authentication.k8s.io/v1beta1"
args = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.app_cluster.name]
command = "aws"
}
}
}
This comprehensive setup ensures a smooth transition from the Pre-Operator to the Post-Operator configuration, leveraging new efficiencies and capabilities enabled by the operator model.
3.3.2 - Deploy W&B Platform on GCP
If you’ve determined to self-managed W&B Server, W&B recommends using the W&B Server GCP Terraform Module to deploy the platform on GCP.
The module documentation is extensive and contains all available options that can be used.
Before you start, W&B recommends that you choose one of the remote backends available for Terraform to store the State File.
The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components.
The Terraform Module will deploy the following mandatory
components:
- VPC
- Cloud SQL for MySQL
- Cloud Storage Bucket
- Google Kubernetes Engine
- KMS Crypto Key
- Load Balancer
Other deployment options can also include the following optional components:
- Memory store for Redis
- Pub/Sub messages system
Pre-requisite permissions
The account that will run the terraform need to have the role roles/owner
in the GCP project used.
General steps
The steps on this topic are common for any deployment option covered by this documentation.
-
Prepare the development environment.
- Install Terraform
- We recommend creating a Git repository with the code that will be used, but you can keep your files locally.
- Create a project in Google Cloud Console
- Authenticate with GCP (make sure to install gcloud before)
gcloud auth application-default login
-
Create the
terraform.tfvars
file.The
tvfars
file content can be customized according to the installation type, but the minimum recommended will look like the example below.project_id = "wandb-project" region = "europe-west2" zone = "europe-west2-a" namespace = "wandb" license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" subdomain = "wandb-gcp" domain_name = "wandb.ml"
The variables defined here need to be decided before the deployment because. The
namespace
variable will be a string that will prefix all resources created by Terraform.The combination of
subdomain
anddomain
will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will bewandb-gcp.wandb.ml
-
Create the file
variables.tf
For every option configured in the
terraform.tfvars
Terraform requires a correspondent variable declaration.variable "project_id" { type = string description = "Project ID" } variable "region" { type = string description = "Google region" } variable "zone" { type = string description = "Google zone" } variable "namespace" { type = string description = "Namespace prefix used for resources" } variable "domain_name" { type = string description = "Domain name for accessing the Weights & Biases UI." } variable "subdomain" { type = string description = "Subdomain for access the Weights & Biases UI." } variable "license" { type = string description = "W&B License" }
Deployment - Recommended (~20 mins)
This is the most straightforward deployment option configuration that will create all Mandatory
components and install in the Kubernetes Cluster
the latest version of W&B
.
-
Create the
main.tf
In the same directory where you created the files in the General Steps, create a file
main.tf
with the following content:provider "google" { project = var.project_id region = var.region zone = var.zone } provider "google-beta" { project = var.project_id region = var.region zone = var.zone } data "google_client_config" "current" {} provider "kubernetes" { host = "https://${module.wandb.cluster_endpoint}" cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate) token = data.google_client_config.current.access_token } # Spin up all required services module "wandb" { source = "wandb/wandb/google" version = "~> 5.0" namespace = var.namespace license = var.license domain_name = var.domain_name subdomain = var.subdomain } # You'll want to update your DNS with the provisioned IP address output "url" { value = module.wandb.url } output "address" { value = module.wandb.address } output "bucket_name" { value = module.wandb.bucket_name }
-
Deploy W&B
To deploy W&B, execute the following commands:
terraform init terraform apply -var-file=terraform.tfvars
Deployment with REDIS Cache
Another deployment option uses Redis
to cache the SQL queries and speedup the application response when loading the metrics for the experiments.
You need to add the option create_redis = true
to the same main.tf
file specified in the recommended Deployment option section to enable the cache.
[...]
module "wandb" {
source = "wandb/wandb/google"
version = "~> 1.0"
namespace = var.namespace
license = var.license
domain_name = var.domain_name
subdomain = var.subdomain
allowed_inbound_cidrs = ["*"]
#Enable Redis
create_redis = true
}
[...]
Deployment with External Queue
Deployment option 3 consists of enabling the external message broker
. This is optional because the W&B brings embedded a broker. This option doesn’t bring a performance improvement.
The GCP resource that provides the message broker is the Pub/Sub
, and to enable it, you will need to add the option use_internal_queue = false
to the same main.tf
specified in the recommended Deployment option section
[...]
module "wandb" {
source = "wandb/wandb/google"
version = "~> 1.0"
namespace = var.namespace
license = var.license
domain_name = var.domain_name
subdomain = var.subdomain
allowed_inbound_cidrs = ["*"]
#Create and use Pub/Sub
use_internal_queue = false
}
[...]
Other deployment options
You can combine all three deployment options adding all configurations to the same file.
The Terraform Module provides several options that can be combined along with the standard options and the minimal configuration found in Deployment - Recommended
Manual configuration
To use a GCP Storage bucket as a file storage backend for W&B, you will need to create a:
Create PubSub Topic and Subscription
Follow the procedure below to create a PubSub topic and subscription:
- Navigate to the Pub/Sub service within the GCP Console
- Select Create Topic and provide a name for your topic.
- At the bottom of the page, select Create subscription. Ensure Delivery Type is set to Pull.
- Click Create.
Make sure the service account or account that your instance is running has the pubsub.admin
role on this subscription. For details, see https://cloud.google.com/pubsub/docs/access-control#console.
Create Storage Bucket
- Navigate to the Cloud Storage Buckets page.
- Select Create bucket and provide a name for your bucket. Ensure you choose a Standard storage class.
Ensure that the service account or account that your instance is running has both:
- access to the bucket you created in the previous step
storage.objectAdmin
role on this bucket. For details, see https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add
iam.serviceAccounts.signBlob
permission in GCP to create signed file URLs. Add Service Account Token Creator
role to the service account or IAM member that your instance is running as to enable permission.- Enable CORS access. This can only be done using the command line. First, create a JSON file with the following CORS configuration.
cors:
- maxAgeSeconds: 3600
method:
- GET
- PUT
origin:
- '<YOUR_W&B_SERVER_HOST>'
responseHeader:
- Content-Type
Note that the scheme, host, and port of the values for the origin must match exactly.
- Make sure you have
gcloud
installed, and logged into the correct GCP Project. - Next, run the following:
gcloud storage buckets update gs://<BUCKET_NAME> --cors-file=<CORS_CONFIG_FILE>
Create PubSub Notification
Follow the procedure below in your command line to create a notification stream from the Storage Bucket to the Pub/Sub topic.
gcloud
installed.- Log into your GCP Project.
- Run the following in your terminal:
gcloud pubsub topics list # list names of topics for reference
gcloud storage ls # list names of buckets for reference
# create bucket notification
gcloud storage buckets notifications create gs://<BUCKET_NAME> --topic=<TOPIC_NAME>
Further reference is available on the Cloud Storage website.
Configure W&B server
- Finally, navigate to the W&B
System Connections
page athttp(s)://YOUR-W&B-SERVER-HOST/console/settings/system
. - Select the provider
Google Cloud Storage (gcs)
, - Provide the name of the GCS bucket
- Press Update settings to apply the new settings.
Upgrade W&B Server
Follow the steps outlined here to update W&B:
- Add
wandb_version
to your configuration in yourwandb_app
module. Provide the version of W&B you want to upgrade to. For example, the following line specifies W&B version0.48.1
:
module "wandb_app" {
source = "wandb/wandb/kubernetes"
version = "~>5.0"
license = var.license
wandb_version = "0.58.1"
wandb_version
to the terraform.tfvars
and create a variable with the same name and instead of using the literal value, use the var.wandb_version
- After you update your configuration, complete the steps described in the Deployment option section.
3.3.3 - Deploy W&B Platform on Azure
If you’ve determined to self-managed W&B Server, W&B recommends using the W&B Server Azure Terraform Module to deploy the platform on Azure.
The module documentation is extensive and contains all available options that can be used. We will cover some deployment options in this document.
Before you start, we recommend you choose one of the remote backends available for Terraform to store the State File.
The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components.
The Terraform Module will deploy the following mandatory
components:
- Azure Resource Group
- Azure Virtual Network (VPC)
- Azure MySQL Fliexible Server
- Azure Storage Account & Blob Storage
- Azure Kubernetes Service
- Azure Application Gateway
Other deployment options can also include the following optional components:
- Azure Cache for Redis
- Azure Event Grid
Pre-requisite permissions
The simplest way to get the AzureRM provider configured is via Azure CLI but the incase of automation using Azure Service Principal can also be useful. Regardless the authentication method used, the account that will run the Terraform needs to be able to create all components described in the Introduction.
General steps
The steps on this topic are common for any deployment option covered by this documentation.
- Prepare the development environment.
- Install Terraform
- We recommend creating a Git repository with the code that will be used, but you can keep your files locally.
-
Create the
terraform.tfvars
file Thetvfars
file content can be customized according to the installation type, but the minimum recommended will look like the example below.namespace = "wandb" wandb_license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz" subdomain = "wandb-aws" domain_name = "wandb.ml" location = "westeurope"
The variables defined here need to be decided before the deployment because. The
namespace
variable will be a string that will prefix all resources created by Terraform.The combination of
subdomain
anddomain
will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will bewandb-aws.wandb.ml
and the DNSzone_id
where the FQDN record will be created. -
Create the file
versions.tf
This file will contain the Terraform and Terraform provider versions required to deploy W&B in AWS
terraform {
required_version = "~> 1.3"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.17"
}
}
}
Refer to the Terraform Official Documentation to configure the AWS provider.
Optionally, but highly recommended, you can add the remote backend configuration mentioned at the beginning of this documentation.
- Create the file
variables.tf
. For every option configured in theterraform.tfvars
Terraform requires a correspondent variable declaration.
variable "namespace" {
type = string
description = "String used for prefix resources."
}
variable "location" {
type = string
description = "Azure Resource Group location"
}
variable "domain_name" {
type = string
description = "Domain for accessing the Weights & Biases UI."
}
variable "subdomain" {
type = string
default = null
description = "Subdomain for accessing the Weights & Biases UI. Default creates record at Route53 Route."
}
variable "license" {
type = string
description = "Your wandb/local license"
}
Recommended deployment
This is the most straightforward deployment option configuration that will create all Mandatory
components and install in the Kubernetes Cluster
the latest version of W&B
.
- Create the
main.tf
In the same directory where you created the files in theGeneral Steps
, create a filemain.tf
with the following content:
provider "azurerm" {
features {}
}
provider "kubernetes" {
host = module.wandb.cluster_host
cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate)
client_key = base64decode(module.wandb.cluster_client_key)
client_certificate = base64decode(module.wandb.cluster_client_certificate)
}
provider "helm" {
kubernetes {
host = module.wandb.cluster_host
cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate)
client_key = base64decode(module.wandb.cluster_client_key)
client_certificate = base64decode(module.wandb.cluster_client_certificate)
}
}
# Spin up all required services
module "wandb" {
source = "wandb/wandb/azurerm"
version = "~> 1.2"
namespace = var.namespace
location = var.location
license = var.license
domain_name = var.domain_name
subdomain = var.subdomain
deletion_protection = false
tags = {
"Example" : "PublicDns"
}
}
output "address" {
value = module.wandb.address
}
output "url" {
value = module.wandb.url
}
-
Deploy to W&B To deploy W&B, execute the following commands:
terraform init terraform apply -var-file=terraform.tfvars
Deployment with REDIS Cache
Another deployment option uses Redis
to cache the SQL queries and speed up the application response when loading the metrics for the experiments.
You must add the option create_redis = true
to the same main.tf
file that you used in recommended deployment to enable the cache.
# Spin up all required services
module "wandb" {
source = "wandb/wandb/azurerm"
version = "~> 1.2"
namespace = var.namespace
location = var.location
license = var.license
domain_name = var.domain_name
subdomain = var.subdomain
create_redis = true # Create Redis
[...]
Deployment with External Queue
Deployment option 3 consists of enabling the external message broker
. This is optional because the W&B brings embedded a broker. This option doesn’t bring a performance improvement.
The Azure resource that provides the message broker is the Azure Event Grid
, and to enable it, you must add the option use_internal_queue = false
to the same main.tf
that you used in the recommended deployment
# Spin up all required services
module "wandb" {
source = "wandb/wandb/azurerm"
version = "~> 1.2"
namespace = var.namespace
location = var.location
license = var.license
domain_name = var.domain_name
subdomain = var.subdomain
use_internal_queue = false # Enable Azure Event Grid
[...]
}
Other deployment options
You can combine all three deployment options adding all configurations to the same file. The Terraform Module provides several options that you can combine along with the standard options and the minimal configuration found in recommended deployment
3.4 - Deploy W&B Platform On-premises
Reach out to the W&B Sales Team for related question: contact@wandb.com.
Infrastructure guidelines
Before you start deploying W&B, refer to the reference architecture, especially the infrastructure requirements.
MySQL database
MySQL 8
versions 8.0.28
and above.There are a number of enterprise services that make operating a scalable MySQL database simpler. W&B recommends looking into one of the following solutions:
https://www.percona.com/software/mysql-database/percona-server
https://github.com/mysql/mysql-operator
Satisfy the conditions below if you run W&B Server MySQL 8.0 or when you upgrade from MySQL 5.7 to 8.0:
binlog_format = 'ROW'
innodb_online_alter_log_max_size = 268435456
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
binlog_row_image = 'MINIMAL'
Due to some changes in the way that MySQL 8.0 handles sort_buffer_size
, you might need to update the sort_buffer_size
parameter from its default value of 262144
. The recommendation is to set the value to 67108864
(64MiB) to ensure that MySQL works efficiently with W&B. MySQL supports this configuration starting with v8.0.28.
Database considerations
Create a database and a user with the following SQL query. Replace SOME_PASSWORD
with password of your choice:
CREATE USER 'wandb_local'@'%' IDENTIFIED BY 'SOME_PASSWORD';
CREATE DATABASE wandb_local CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
GRANT ALL ON wandb_local.* TO 'wandb_local'@'%' WITH GRANT OPTION;
Parameter group configuration
Ensure that the following parameter groups are set to tune the database performance:
binlog_format = 'ROW'
innodb_online_alter_log_max_size = 268435456
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
binlog_row_image = 'MINIMAL'
sort_buffer_size = 67108864
Object storage
The object store can be externally hosted on a Minio cluster, or any Amazon S3 compatible object store that has support for signed URLs. Run the following script to check if your object store supports signed URLs.
Additionally, the following CORS policy needs to be applied to the object store.
<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>http://YOUR-W&B-SERVER-IP</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<AllowedMethod>PUT</AllowedMethod>
<AllowedMethod>HEAD</AllowedMethod>
<AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>
You can specify your credentials in a connection string when you connect to an Amazon S3 compatible object store. For example, you can specify the following:
s3://$ACCESS_KEY:$SECRET_KEY@$HOST/$BUCKET_NAME
You can optionally tell W&B to only connect over TLS if you configure a trusted SSL certificate for your object store. To do so, add the tls
query parameter to the URL. For example, the following URL example demonstrates how to add the TLS query parameter to an Amazon S3 URI:
s3://$ACCESS_KEY:$SECRET_KEY@$HOST/$BUCKET_NAME?tls=true
Set BUCKET_QUEUE
to internal://
if you use third-party object stores. This tells the W&B server to manage all object notifications internally instead of depending on an external SQS queue or equivalent.
The most important things to consider when running your own object store are:
- Storage capacity and performance. It’s fine to use magnetic disks, but you should be monitoring the capacity of these disks. Average W&B usage results in 10’s to 100’s of Gigabytes. Heavy usage could result in Petabytes of storage consumption.
- Fault tolerance. At a minimum, the physical disk storing the objects should be on a RAID array. If you use minio, consider running it in distributed mode.
- Availability. Monitoring should be configured to ensure the storage is available.
There are many enterprise alternatives to running your own object storage service such as:
MinIO set up
If you use minio, you can run the following commands to create a bucket.
mc config host add local http://$MINIO_HOST:$MINIO_PORT "$MINIO_ACCESS_KEY" "$MINIO_SECRET_KEY" --api s3v4
mc mb --region=us-east1 local/local-files
Deploy W&B Server application to Kubernetes
The recommended installation method is with the official W&B Helm chart. Follow this section to deploy the W&B Server application.
OpenShift
W&B supports operating from within an OpenShift Kubernetes cluster.
Run the container as an un-privileged user
By default, containers use a $UID
of 999. Specify $UID
>= 100000 and a $GID
of 0 if your orchestrator requires the container run with a non-root user.
$GID=0
) for file system permissions to function properly.An example security context for Kubernetes looks similar to the following:
spec:
securityContext:
runAsUser: 100000
runAsGroup: 0
Networking
Load balancer
Run a load balancer that stop network requests at the appropriate network boundary.
Common load balancers include:
Ensure that all machines used to execute machine learning payloads, and the devices used to access the service through web browsers, can communicate to this endpoint.
SSL / TLS
W&B Server does not stop SSL. If your security policies require SSL communication within your trusted networks consider using a tool like Istio and side car containers. The load balancer itself should terminate SSL with a valid certificate. Using self-signed certificates is not supported and will cause a number of challenges for users. If possible using a service like Let’s Encrypt is a great way to provided trusted certificates to your load balancer. Services like Caddy and Cloudflare manage SSL for you.
Example nginx configuration
The following is an example configuration using nginx as a reverse proxy.
events {}
http {
# If we receive X-Forwarded-Proto, pass it through; otherwise, pass along the
# scheme used to connect to this server
map $http_x_forwarded_proto $proxy_x_forwarded_proto {
default $http_x_forwarded_proto;
'' $scheme;
}
# Also, in the above case, force HTTPS
map $http_x_forwarded_proto $sts {
default '';
"https" "max-age=31536000; includeSubDomains";
}
# If we receive X-Forwarded-Host, pass it though; otherwise, pass along $http_host
map $http_x_forwarded_host $proxy_x_forwarded_host {
default $http_x_forwarded_host;
'' $http_host;
}
# If we receive X-Forwarded-Port, pass it through; otherwise, pass along the
# server port the client connected to
map $http_x_forwarded_port $proxy_x_forwarded_port {
default $http_x_forwarded_port;
'' $server_port;
}
# If we receive Upgrade, set Connection to "upgrade"; otherwise, delete any
# Connection header that may have been passed to this server
map $http_upgrade $proxy_connection {
default upgrade;
'' close;
}
server {
listen 443 ssl;
server_name www.example.com;
ssl_certificate www.example.com.crt;
ssl_certificate_key www.example.com.key;
proxy_http_version 1.1;
proxy_buffering off;
proxy_set_header Host $http_host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $proxy_connection;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $proxy_x_forwarded_proto;
proxy_set_header X-Forwarded-Host $proxy_x_forwarded_host;
location / {
proxy_pass http://$YOUR_UPSTREAM_SERVER_IP:8080/;
}
keepalive_timeout 10;
}
}
Verify your installation
Very your W&B Server is configured properly. Run the following commands in your terminal:
pip install wandb
wandb login --host=https://YOUR_DNS_DOMAIN
wandb verify
Check log files to view any errors the W&B Server hits at startup. Run the following commands:
docker logs wandb-local
kubectl get pods
kubectl logs wandb-XXXXX-XXXXX
Contact W&B Support if you encounter errors.
3.5 - Update W&B license and version
Update your W&B Server Version and License with the same method you installed W&B Server with. The following table lists how to update your license and version based on different deployment methods:
Release Type | Description |
---|---|
Terraform | W&B supports three public Terraform modules for cloud deployment: AWS, GCP, and Azure. |
Helm | You can use the Helm Chart to install W&B into an existing Kubernetes cluster. |
Update with Terraform
Update your license and version with Terraform. The proceeding table lists W&B managed Terraform modules based cloud platform.
Cloud provider | Terraform module |
---|---|
AWS | AWS Terraform module |
GCP | GCP Terraform module |
Azure | Azure Terraform module |
-
First, navigate to the W&B maintained Terraform module for your appropriate cloud provider. See the preceding table to find the appropriate Terraform module based on your cloud provider.
-
Within your Terraform configuration, update
wandb_version
andlicense
in your Terraformwandb_app
module configuration:module "wandb_app" { source = "wandb/wandb/<cloud-specific-module>" version = "new_version" license = "new_license_key" # Your new license key wandb_version = "new_wandb_version" # Desired W&B version ... }
-
Apply the Terraform configuration with
terraform plan
andterraform apply
.terraform init terraform apply
-
(Optional) If you use a
terraform.tfvars
or other.tfvars
file.Update or create a
terraform.tfvars
file with the new W&B version and license key.terraform plan -var-file="terraform.tfvars"
Apply the configuration. In your Terraform workspace directory execute:
terraform apply -var-file="terraform.tfvars"
Update with Helm
Update W&B with spec
-
Specify a new version by modifying the
image.tag
and/orlicense
values in your Helm chart*.yaml
configuration file:license: 'new_license' image: repository: wandb/local tag: 'new_version'
-
Execute the Helm upgrade with the following command:
helm repo update helm upgrade --namespace=wandb --create-namespace \ --install wandb wandb/wandb --version ${chart_version} \ -f ${wandb_install_spec.yaml}
Update license and version directly
-
Set the new license key and image tag as environment variables:
export LICENSE='new_license' export TAG='new_version'
-
Upgrade your Helm release with the command below, merging the new values with the existing configuration:
helm repo update helm upgrade --namespace=wandb --create-namespace \ --install wandb wandb/wandb --version ${chart_version} \ --reuse-values --set license=$LICENSE --set image.tag=$TAG
For more details, see the upgrade guide in the public repository.
Update with admin UI
This method is only works for updating licenses that are not set with an environment variable in the W&B server container, typically in self-hosted Docker installations.
- Obtain a new license from the W&B Deployment Page, ensuring it matches the correct organization and deployment ID for the deployment you are looking to upgrade.
- Access the W&B Admin UI at
<host-url>/system-settings
. - Navigate to the license management section.
- Enter the new license key and save your changes.