1 - Deploy W&B Platform on AWS

Hosting W&B Server on AWS.

W&B recommends using the W&B Server AWS Terraform Module to deploy the platform on AWS.

Before you start, W&B recommends that you choose one of the remote backends available for Terraform to store the State File.

The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components.

The Terraform Module deploys the following mandatory components:

  • Load Balancer
  • AWS Identity & Access Management (IAM)
  • AWS Key Management System (KMS)
  • Amazon Aurora MySQL
  • Amazon VPC
  • Amazon S3
  • Amazon Route53
  • Amazon Certificate Manager (ACM)
  • Amazon Elastic Load Balancing (ALB)
  • Amazon Secrets Manager

Other deployment options can also include the following optional components:

  • Elastic Cache for Redis
  • SQS

Pre-requisite permissions

The account that runs Terraform needs to be able to create all components described in the Introduction and permission to create IAM Policies and IAM Roles and assign roles to resources.

General steps

The steps on this topic are common for any deployment option covered by this documentation.

  1. Prepare the development environment.

    • Install Terraform
    • W&B recommend creating a Git repository for version control.
  2. Create the terraform.tfvars file.

    The tvfars file content can be customized according to the installation type, but the minimum recommended will look like the example below.

    namespace                  = "wandb"
    license                    = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz"
    subdomain                  = "wandb-aws"
    domain_name                = "wandb.ml"
    zone_id                    = "xxxxxxxxxxxxxxxx"
    allowed_inbound_cidr       = ["0.0.0.0/0"]
    allowed_inbound_ipv6_cidr  = ["::/0"]
    

    Ensure to define variables in your tvfars file before you deploy because the namespace variable is a string that prefixes all resources created by Terraform.

    The combination of subdomain and domain will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be wandb-aws.wandb.ml and the DNS zone_id where the FQDN record will be created.

    Both allowed_inbound_cidr and allowed_inbound_ipv6_cidr also require setting. In the module, this is a mandatory input. The proceeding example permits access from any source to the W&B installation.

  3. Create the file versions.tf

    This file will contain the Terraform and Terraform provider versions required to deploy W&B in AWS

    provider "aws" {
      region = "eu-central-1"
    
      default_tags {
        tags = {
          GithubRepo = "terraform-aws-wandb"
          GithubOrg  = "wandb"
          Enviroment = "Example"
          Example    = "PublicDnsExternal"
        }
      }
    }
    

    Refer to the Terraform Official Documentation to configure the AWS provider.

    Optionally, but highly recommended, add the remote backend configuration mentioned at the beginning of this documentation.

  4. Create the file variables.tf

    For every option configured in the terraform.tfvars Terraform requires a correspondent variable declaration.

    variable "namespace" {
      type        = string
      description = "Name prefix used for resources"
    }
    
    variable "domain_name" {
      type        = string
      description = "Domain name used to access instance."
    }
    
    variable "subdomain" {
      type        = string
      default     = null
      description = "Subdomain for accessing the Weights & Biases UI."
    }
    
    variable "license" {
      type = string
    }
    
    variable "zone_id" {
      type        = string
      description = "Domain for creating the Weights & Biases subdomain on."
    }
    
    variable "allowed_inbound_cidr" {
     description = "CIDRs allowed to access wandb-server."
     nullable    = false
     type        = list(string)
    }
    
    variable "allowed_inbound_ipv6_cidr" {
     description = "CIDRs allowed to access wandb-server."
     nullable    = false
     type        = list(string)
    }
    

This is the most straightforward deployment option configuration that creates all Mandatory components and installs in the Kubernetes Cluster the latest version of W&B.

  1. Create the main.tf

    In the same directory where you created the files in the General Steps, create a file main.tf with the following content:

    module "wandb_infra" {
      source  = "wandb/wandb/aws"
      version = "~>2.0"
    
      namespace   = var.namespace
      domain_name = var.domain_name
      subdomain   = var.subdomain
      zone_id     = var.zone_id
    
      allowed_inbound_cidr           = var.allowed_inbound_cidr
      allowed_inbound_ipv6_cidr      = var.allowed_inbound_ipv6_cidr
    
      public_access                  = true
      external_dns                   = true
      kubernetes_public_access       = true
      kubernetes_public_access_cidrs = ["0.0.0.0/0"]
    }
    
    data "aws_eks_cluster" "app_cluster" {
      name = module.wandb_infra.cluster_id
    }
    
    data "aws_eks_cluster_auth" "app_cluster" {
      name = module.wandb_infra.cluster_id
    }
    
    provider "kubernetes" {
      host                   = data.aws_eks_cluster.app_cluster.endpoint
      cluster_ca_certificate = base64decode(data.aws_eks_cluster.app_cluster.certificate_authority.0.data)
      token                  = data.aws_eks_cluster_auth.app_cluster.token
    }
    
    module "wandb_app" {
      source  = "wandb/wandb/kubernetes"
      version = "~>1.0"
    
      license                    = var.license
      host                       = module.wandb_infra.url
      bucket                     = "s3://${module.wandb_infra.bucket_name}"
      bucket_aws_region          = module.wandb_infra.bucket_region
      bucket_queue               = "internal://"
      database_connection_string = "mysql://${module.wandb_infra.database_connection_string}"
    
      # TF attempts to deploy while the work group is
      # still spinning up if you do not wait
      depends_on = [module.wandb_infra]
    }
    
    output "bucket_name" {
      value = module.wandb_infra.bucket_name
    }
    
    output "url" {
      value = module.wandb_infra.url
    }
    
  2. Deploy W&B

    To deploy W&B, execute the following commands:

    terraform init
    terraform apply -var-file=terraform.tfvars
    

Enable REDIS

Another deployment option uses Redis to cache the SQL queries and speed up the application response when loading the metrics for the experiments.

You need to add the option create_elasticache_subnet = true to the same main.tf file described in the Recommended deployment section to enable the cache.

module "wandb_infra" {
  source  = "wandb/wandb/aws"
  version = "~>2.0"

  namespace   = var.namespace
  domain_name = var.domain_name
  subdomain   = var.subdomain
  zone_id     = var.zone_id
	**create_elasticache_subnet = true**
}
[...]

Enable message broker (queue)

Deployment option 3 consists of enabling the external message broker. This is optional because the W&B brings embedded a broker. This option doesn’t bring a performance improvement.

The AWS resource that provides the message broker is the SQS, and to enable it, you will need to add the option use_internal_queue = false to the same main.tf described in the Recommended deployment section.

module "wandb_infra" {
  source  = "wandb/wandb/aws"
  version = "~>2.0"

  namespace   = var.namespace
  domain_name = var.domain_name
  subdomain   = var.subdomain
  zone_id     = var.zone_id
  **use_internal_queue = false**

[...]
}

Other deployment options

You can combine all three deployment options adding all configurations to the same file. The Terraform Module provides several options that can be combined along with the standard options and the minimal configuration found in Deployment - Recommended

Manual configuration

To use an Amazon S3 bucket as a file storage backend for W&B, you will need to:

you’ll need to create a bucket, along with an SQS queue configured to receive object creation notifications from that bucket. Your instance will need permissions to read from this queue.

Create an S3 Bucket and Bucket Notifications

Follow the procedure bellow to create an Amazon S3 bucket and enable bucket notifications.

  1. Navigate to Amazon S3 in the AWS Console.
  2. Select Create bucket.
  3. Within the Advanced settings, select Add notification within the Events section.
  4. Configure all object creation events to be sent to the SQS Queue you configured earlier.
Enterprise file storage settings

Enable CORS access. Your CORS configuration should look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
    <AllowedOrigin>http://YOUR-W&B-SERVER-IP</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
    <AllowedMethod>PUT</AllowedMethod>
    <AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>

Create an SQS Queue

Follow the procedure below to create an SQS Queue:

  1. Navigate to Amazon SQS in the AWS Console.
  2. Select Create queue.
  3. From the Details section, select a Standard queue type.
  4. Within the Access policy section, add permission to the following principals:
  • SendMessage
  • ReceiveMessage
  • ChangeMessageVisibility
  • DeleteMessage
  • GetQueueUrl

Optionally add an advanced access policy in the Access Policy section. For example, the policy for accessing Amazon SQS with a statement is as follows:

{
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Effect" : "Allow",
        "Principal" : "*",
        "Action" : ["sqs:SendMessage"],
        "Resource" : "<sqs-queue-arn>",
        "Condition" : {
          "ArnEquals" : { "aws:SourceArn" : "<s3-bucket-arn>" }
        }
      }
    ]
}

Grant permissions to node that runs W&B

The node where W&B server is running must be configured to permit access to Amazon S3 and Amazon SQS. Depending on the type of server deployment you have opted for, you may need to add the following policy statements to your node role:

{
   "Statement":[
      {
         "Sid":"",
         "Effect":"Allow",
         "Action":"s3:*",
         "Resource":"arn:aws:s3:::<WANDB_BUCKET>"
      },
      {
         "Sid":"",
         "Effect":"Allow",
         "Action":[
            "sqs:*"
         ],
         "Resource":"arn:aws:sqs:<REGION>:<ACCOUNT>:<WANDB_QUEUE>"
      }
   ]
}

Configure W&B server

Finally, configure your W&B Server.

  1. Navigate to the W&B settings page at http(s)://YOUR-W&B-SERVER-HOST/system-admin.
  2. Enable the **Use an external file storage backend option
  3. Provide information about your Amazon S3 bucket, region, and Amazon SQS queue in the following format:
  • File Storage Bucket: s3://<bucket-name>
  • File Storage Region (AWS only): <region>
  • Notification Subscription: sqs://<queue-name>
  1. Select Update settings to apply the new settings.

Upgrade your W&B version

Follow the steps outlined here to update W&B:

  1. Add wandb_version to your configuration in your wandb_app module. Provide the version of W&B you want to upgrade to. For example, the following line specifies W&B version 0.48.1:
module "wandb_app" {
    source  = "wandb/wandb/kubernetes"
    version = "~>1.0"

    license       = var.license
    wandb_version = "0.48.1"
  1. After you update your configuration, complete the steps described in the Recommended deployment section.

Migrate to operator-based AWS Terraform modules

This section details the steps required to upgrade from pre-operator to post-operator environments using the terraform-aws-wandb module.

Before and after architecture

Previously, the W&B architecture used:

module "wandb_infra" {
  source  = "wandb/wandb/aws"
  version = "1.16.10"
  ...
}

to control the infrastructure:

pre-operator-infra

and this module to deploy the W&B Server:

module "wandb_app" {
  source  = "wandb/wandb/kubernetes"
  version = "1.12.0"
}
pre-operator-k8s

Post-transition, the architecture uses:

module "wandb_infra" {
  source  = "wandb/wandb/aws"
  version = "4.7.2"
  ...
}

to manage both the installation of infrastructure and the W&B Server to the Kubernetes cluster, thus eliminating the need for the module "wandb_app" in post-operator.tf.

post-operator-k8s

This architectural shift enables additional features (like OpenTelemetry, Prometheus, HPAs, Kafka, and image updates) without requiring manual Terraform operations by SRE/Infrastructure teams.

To commence with a base installation of the W&B Pre-Operator, ensure that post-operator.tf has a .disabled file extension and pre-operator.tf is active (that does not have a .disabled extension). Those files can be found here.

Prerequisites

Before initiating the migration process, ensure the following prerequisites are met:

  • Egress: The deployment can’t be airgapped. It needs access to deploy.wandb.ai to get the latest spec for the Release Channel.
  • AWS Credentials: Proper AWS credentials configured to interact with your AWS resources.
  • Terraform Installed: The latest version of Terraform should be installed on your system.
  • Route53 Hosted Zone: An existing Route53 hosted zone corresponding to the domain under which the application will be served.
  • Pre-Operator Terraform Files: Ensure pre-operator.tf and associated variable files like pre-operator.tfvars are correctly set up.

Pre-Operator set up

Execute the following Terraform commands to initialize and apply the configuration for the Pre-Operator setup:

terraform init -upgrade
terraform apply -var-file=./pre-operator.tfvars

pre-operator.tf should look something like this:

namespace     = "operator-upgrade"
domain_name   = "sandbox-aws.wandb.ml"
zone_id       = "Z032246913CW32RVRY0WU"
subdomain     = "operator-upgrade"
wandb_license = "ey..."
wandb_version = "0.51.2"

The pre-operator.tf configuration calls two modules:

module "wandb_infra" {
  source  = "wandb/wandb/aws"
  version = "1.16.10"
  ...
}

This module spins up the infrastructure.

module "wandb_app" {
  source  = "wandb/wandb/kubernetes"
  version = "1.12.0"
}

This module deploys the application.

Post-Operator Setup

Make sure that pre-operator.tf has a .disabled extension, and post-operator.tf is active.

The post-operator.tfvars includes additional variables:

...
# wandb_version = "0.51.2" is now managed via the Release Channel or set in the User Spec.

# Required Operator Variables for Upgrade:
size                 = "small"
enable_dummy_dns     = true
enable_operator_alb  = true
custom_domain_filter = "sandbox-aws.wandb.ml"

Run the following commands to initialize and apply the Post-Operator configuration:

terraform init -upgrade
terraform apply -var-file=./post-operator.tfvars

The plan and apply steps will update the following resources:

actions:
  create:
    - aws_efs_backup_policy.storage_class
    - aws_efs_file_system.storage_class
    - aws_efs_mount_target.storage_class["0"]
    - aws_efs_mount_target.storage_class["1"]
    - aws_eks_addon.efs
    - aws_iam_openid_connect_provider.eks
    - aws_iam_policy.secrets_manager
    - aws_iam_role_policy_attachment.ebs_csi
    - aws_iam_role_policy_attachment.eks_efs
    - aws_iam_role_policy_attachment.node_secrets_manager
    - aws_security_group.storage_class_nfs
    - aws_security_group_rule.nfs_ingress
    - random_pet.efs
    - aws_s3_bucket_acl.file_storage
    - aws_s3_bucket_cors_configuration.file_storage
    - aws_s3_bucket_ownership_controls.file_storage
    - aws_s3_bucket_server_side_encryption_configuration.file_storage
    - helm_release.operator
    - helm_release.wandb
    - aws_cloudwatch_log_group.this[0]
    - aws_iam_policy.default
    - aws_iam_role.default
    - aws_iam_role_policy_attachment.default
    - helm_release.external_dns
    - aws_default_network_acl.this[0]
    - aws_default_route_table.default[0]
    - aws_iam_policy.default
    - aws_iam_role.default
    - aws_iam_role_policy_attachment.default
    - helm_release.aws_load_balancer_controller

  update_in_place:
    - aws_iam_policy.node_IMDSv2
    - aws_iam_policy.node_cloudwatch
    - aws_iam_policy.node_kms
    - aws_iam_policy.node_s3
    - aws_iam_policy.node_sqs
    - aws_eks_cluster.this[0]
    - aws_elasticache_replication_group.default
    - aws_rds_cluster.this[0]
    - aws_rds_cluster_instance.this["1"]
    - aws_default_security_group.this[0]
    - aws_subnet.private[0]
    - aws_subnet.private[1]
    - aws_subnet.public[0]
    - aws_subnet.public[1]
    - aws_launch_template.workers["primary"]

  destroy:
    - kubernetes_config_map.config_map
    - kubernetes_deployment.wandb
    - kubernetes_priority_class.priority
    - kubernetes_secret.secret
    - kubernetes_service.prometheus
    - kubernetes_service.service
    - random_id.snapshot_identifier[0]

  replace:
    - aws_autoscaling_attachment.autoscaling_attachment["primary"]
    - aws_route53_record.alb
    - aws_eks_node_group.workers["primary"]

You should see something like this:

post-operator-apply

Note that in post-operator.tf, there is a single:

module "wandb_infra" {
  source  = "wandb/wandb/aws"
  version = "4.7.2"
  ...
}

Changes in the post-operator configuration:

  1. Update Required Providers: Change required_providers.aws.version from 3.6 to 4.0 for provider compatibility.
  2. DNS and Load Balancer Configuration: Integrate enable_dummy_dns and enable_operator_alb to manage DNS records and AWS Load Balancer setup through an Ingress.
  3. License and Size Configuration: Transfer the license and size parameters directly to the wandb_infra module to match new operational requirements.
  4. Custom Domain Handling: If necessary, use custom_domain_filter to troubleshoot DNS issues by checking the External DNS pod logs within the kube-system namespace.
  5. Helm Provider Configuration: Enable and configure the Helm provider to manage Kubernetes resources effectively:
provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.app_cluster.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.app_cluster.certificate_authority[0].data)
    token                  = data.aws_eks_cluster_auth.app_cluster.token
    exec {
      api_version = "client.authentication.k8s.io/v1beta1"
      args        = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.app_cluster.name]
      command     = "aws"
    }
  }
}

This comprehensive setup ensures a smooth transition from the Pre-Operator to the Post-Operator configuration, leveraging new efficiencies and capabilities enabled by the operator model.

2 - Deploy W&B Platform on GCP

Hosting W&B Server on GCP.

If you’ve determined to self-managed W&B Server, W&B recommends using the W&B Server GCP Terraform Module to deploy the platform on GCP.

The module documentation is extensive and contains all available options that can be used.

Before you start, W&B recommends that you choose one of the remote backends available for Terraform to store the State File.

The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components.

The Terraform Module will deploy the following mandatory components:

  • VPC
  • Cloud SQL for MySQL
  • Cloud Storage Bucket
  • Google Kubernetes Engine
  • KMS Crypto Key
  • Load Balancer

Other deployment options can also include the following optional components:

  • Memory store for Redis
  • Pub/Sub messages system

Pre-requisite permissions

The account that will run the terraform need to have the role roles/owner in the GCP project used.

General steps

The steps on this topic are common for any deployment option covered by this documentation.

  1. Prepare the development environment.

    • Install Terraform
    • We recommend creating a Git repository with the code that will be used, but you can keep your files locally.
    • Create a project in Google Cloud Console
    • Authenticate with GCP (make sure to install gcloud before) gcloud auth application-default login
  2. Create the terraform.tfvars file.

    The tvfars file content can be customized according to the installation type, but the minimum recommended will look like the example below.

    project_id  = "wandb-project"
    region      = "europe-west2"
    zone        = "europe-west2-a"
    namespace   = "wandb"
    license     = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz"
    subdomain   = "wandb-gcp"
    domain_name = "wandb.ml"
    

    The variables defined here need to be decided before the deployment because. The namespace variable will be a string that will prefix all resources created by Terraform.

    The combination of subdomain and domain will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be wandb-gcp.wandb.ml

  3. Create the file variables.tf

    For every option configured in the terraform.tfvars Terraform requires a correspondent variable declaration.

    variable "project_id" {
      type        = string
      description = "Project ID"
    }
    
    variable "region" {
      type        = string
      description = "Google region"
    }
    
    variable "zone" {
      type        = string
      description = "Google zone"
    }
    
    variable "namespace" {
      type        = string
      description = "Namespace prefix used for resources"
    }
    
    variable "domain_name" {
      type        = string
      description = "Domain name for accessing the Weights & Biases UI."
    }
    
    variable "subdomain" {
      type        = string
      description = "Subdomain for access the Weights & Biases UI."
    }
    
    variable "license" {
      type        = string
      description = "W&B License"
    }
    

This is the most straightforward deployment option configuration that will create all Mandatory components and install in the Kubernetes Cluster the latest version of W&B.

  1. Create the main.tf

    In the same directory where you created the files in the General Steps, create a file main.tf with the following content:

    provider "google" {
     project = var.project_id
     region  = var.region
     zone    = var.zone
    }
    
    provider "google-beta" {
     project = var.project_id
     region  = var.region
     zone    = var.zone
    }
    
    data "google_client_config" "current" {}
    
    provider "kubernetes" {
      host                   = "https://${module.wandb.cluster_endpoint}"
      cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate)
      token                  = data.google_client_config.current.access_token
    }
    
    # Spin up all required services
    module "wandb" {
      source  = "wandb/wandb/google"
      version = "~> 5.0"
    
      namespace   = var.namespace
      license     = var.license
      domain_name = var.domain_name
      subdomain   = var.subdomain
    }
    
    # You'll want to update your DNS with the provisioned IP address
    output "url" {
      value = module.wandb.url
    }
    
    output "address" {
      value = module.wandb.address
    }
    
    output "bucket_name" {
      value = module.wandb.bucket_name
    }
    
  2. Deploy W&B

    To deploy W&B, execute the following commands:

    terraform init
    terraform apply -var-file=terraform.tfvars
    

Deployment with REDIS Cache

Another deployment option uses Redis to cache the SQL queries and speedup the application response when loading the metrics for the experiments.

You need to add the option create_redis = true to the same main.tf file specified in the recommended Deployment option section to enable the cache.

[...]

module "wandb" {
  source  = "wandb/wandb/google"
  version = "~> 1.0"

  namespace    = var.namespace
  license      = var.license
  domain_name  = var.domain_name
  subdomain    = var.subdomain
  allowed_inbound_cidrs = ["*"]
  #Enable Redis
  create_redis = true

}
[...]

Deployment with External Queue

Deployment option 3 consists of enabling the external message broker. This is optional because the W&B brings embedded a broker. This option doesn’t bring a performance improvement.

The GCP resource that provides the message broker is the Pub/Sub, and to enable it, you will need to add the option use_internal_queue = false to the same main.tf specified in the recommended Deployment option section

[...]

module "wandb" {
  source  = "wandb/wandb/google"
  version = "~> 1.0"

  namespace          = var.namespace
  license            = var.license
  domain_name        = var.domain_name
  subdomain          = var.subdomain
  allowed_inbound_cidrs = ["*"]
  #Create and use Pub/Sub
  use_internal_queue = false

}

[...]

Other deployment options

You can combine all three deployment options adding all configurations to the same file. The Terraform Module provides several options that can be combined along with the standard options and the minimal configuration found in Deployment - Recommended

Manual configuration

To use a GCP Storage bucket as a file storage backend for W&B, you will need to create a:

Create PubSub Topic and Subscription

Follow the procedure below to create a PubSub topic and subscription:

  1. Navigate to the Pub/Sub service within the GCP Console
  2. Select Create Topic and provide a name for your topic.
  3. At the bottom of the page, select Create subscription. Ensure Delivery Type is set to Pull.
  4. Click Create.

Make sure the service account or account that your instance is running has the pubsub.admin role on this subscription. For details, see https://cloud.google.com/pubsub/docs/access-control#console.

Create Storage Bucket

  1. Navigate to the Cloud Storage Buckets page.
  2. Select Create bucket and provide a name for your bucket. Ensure you choose a Standard storage class.

Ensure that the service account or account that your instance is running has both:

  1. Enable CORS access. This can only be done using the command line. First, create a JSON file with the following CORS configuration.
cors:
- maxAgeSeconds: 3600
  method:
   - GET
   - PUT
     origin:
   - '<YOUR_W&B_SERVER_HOST>'
     responseHeader:
   - Content-Type

Note that the scheme, host, and port of the values for the origin must match exactly.

  1. Make sure you have gcloud installed, and logged into the correct GCP Project.
  2. Next, run the following:
gcloud storage buckets update gs://<BUCKET_NAME> --cors-file=<CORS_CONFIG_FILE>

Create PubSub Notification

Follow the procedure below in your command line to create a notification stream from the Storage Bucket to the Pub/Sub topic.

  1. Log into your GCP Project.
  2. Run the following in your terminal:
gcloud pubsub topics list  # list names of topics for reference
gcloud storage ls          # list names of buckets for reference

# create bucket notification
gcloud storage buckets notifications create gs://<BUCKET_NAME> --topic=<TOPIC_NAME>

Further reference is available on the Cloud Storage website.

Configure W&B server

  1. Finally, navigate to the W&B System Connections page at http(s)://YOUR-W&B-SERVER-HOST/console/settings/system.
  2. Select the provider Google Cloud Storage (gcs),
  3. Provide the name of the GCS bucket
  1. Press Update settings to apply the new settings.

Upgrade W&B Server

Follow the steps outlined here to update W&B:

  1. Add wandb_version to your configuration in your wandb_app module. Provide the version of W&B you want to upgrade to. For example, the following line specifies W&B version 0.48.1:
module "wandb_app" {
    source  = "wandb/wandb/kubernetes"
    version = "~>5.0"

    license       = var.license
    wandb_version = "0.58.1"
  1. After you update your configuration, complete the steps described in the Deployment option section.

3 - Deploy W&B Platform on Azure

Hosting W&B Server on Azure.

If you’ve determined to self-managed W&B Server, W&B recommends using the W&B Server Azure Terraform Module to deploy the platform on Azure.

The module documentation is extensive and contains all available options that can be used. We will cover some deployment options in this document.

Before you start, we recommend you choose one of the remote backends available for Terraform to store the State File.

The State File is the necessary resource to roll out upgrades or make changes in your deployment without recreating all components.

The Terraform Module will deploy the following mandatory components:

  • Azure Resource Group
  • Azure Virtual Network (VPC)
  • Azure MySQL Fliexible Server
  • Azure Storage Account & Blob Storage
  • Azure Kubernetes Service
  • Azure Application Gateway

Other deployment options can also include the following optional components:

  • Azure Cache for Redis
  • Azure Event Grid

Pre-requisite permissions

The simplest way to get the AzureRM provider configured is via Azure CLI but the incase of automation using Azure Service Principal can also be useful. Regardless the authentication method used, the account that will run the Terraform needs to be able to create all components described in the Introduction.

General steps

The steps on this topic are common for any deployment option covered by this documentation.

  1. Prepare the development environment.
  • Install Terraform
  • We recommend creating a Git repository with the code that will be used, but you can keep your files locally.
  1. Create the terraform.tfvars file The tvfars file content can be customized according to the installation type, but the minimum recommended will look like the example below.

     namespace     = "wandb"
     wandb_license = "xxxxxxxxxxyyyyyyyyyyyzzzzzzz"
     subdomain     = "wandb-aws"
     domain_name   = "wandb.ml"
     location      = "westeurope"
    

    The variables defined here need to be decided before the deployment because. The namespace variable will be a string that will prefix all resources created by Terraform.

    The combination of subdomain and domain will form the FQDN that W&B will be configured. In the example above, the W&B FQDN will be wandb-aws.wandb.ml and the DNS zone_id where the FQDN record will be created.

  2. Create the file versions.tf This file will contain the Terraform and Terraform provider versions required to deploy W&B in AWS

terraform {
  required_version = "~> 1.3"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.17"
    }
  }
}

Refer to the Terraform Official Documentation to configure the AWS provider.

Optionally, but highly recommended, you can add the remote backend configuration mentioned at the beginning of this documentation.

  1. Create the file variables.tf. For every option configured in the terraform.tfvars Terraform requires a correspondent variable declaration.
  variable "namespace" {
    type        = string
    description = "String used for prefix resources."
  }

  variable "location" {
    type        = string
    description = "Azure Resource Group location"
  }

  variable "domain_name" {
    type        = string
    description = "Domain for accessing the Weights & Biases UI."
  }

  variable "subdomain" {
    type        = string
    default     = null
    description = "Subdomain for accessing the Weights & Biases UI. Default creates record at Route53 Route."
  }

  variable "license" {
    type        = string
    description = "Your wandb/local license"
  }

This is the most straightforward deployment option configuration that will create all Mandatory components and install in the Kubernetes Cluster the latest version of W&B.

  1. Create the main.tf In the same directory where you created the files in the General Steps, create a file main.tf with the following content:
provider "azurerm" {
  features {}
}

provider "kubernetes" {
  host                   = module.wandb.cluster_host
  cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate)
  client_key             = base64decode(module.wandb.cluster_client_key)
  client_certificate     = base64decode(module.wandb.cluster_client_certificate)
}

provider "helm" {
  kubernetes {
    host                   = module.wandb.cluster_host
    cluster_ca_certificate = base64decode(module.wandb.cluster_ca_certificate)
    client_key             = base64decode(module.wandb.cluster_client_key)
    client_certificate     = base64decode(module.wandb.cluster_client_certificate)
  }
}

# Spin up all required services
module "wandb" {
  source  = "wandb/wandb/azurerm"
  version = "~> 1.2"

  namespace   = var.namespace
  location    = var.location
  license     = var.license
  domain_name = var.domain_name
  subdomain   = var.subdomain

  deletion_protection = false

  tags = {
    "Example" : "PublicDns"
  }
}

output "address" {
  value = module.wandb.address
}

output "url" {
  value = module.wandb.url
}
  1. Deploy to W&B To deploy W&B, execute the following commands:

    terraform init
    terraform apply -var-file=terraform.tfvars
    

Deployment with REDIS Cache

Another deployment option uses Redis to cache the SQL queries and speed up the application response when loading the metrics for the experiments.

You must add the option create_redis = true to the same main.tf file that you used in recommended deployment to enable the cache.

# Spin up all required services
module "wandb" {
  source  = "wandb/wandb/azurerm"
  version = "~> 1.2"


  namespace   = var.namespace
  location    = var.location
  license     = var.license
  domain_name = var.domain_name
  subdomain   = var.subdomain

  create_redis       = true # Create Redis
  [...]

Deployment with External Queue

Deployment option 3 consists of enabling the external message broker. This is optional because the W&B brings embedded a broker. This option doesn’t bring a performance improvement.

The Azure resource that provides the message broker is the Azure Event Grid, and to enable it, you must add the option use_internal_queue = false to the same main.tf that you used in the recommended deployment

# Spin up all required services
module "wandb" {
  source  = "wandb/wandb/azurerm"
  version = "~> 1.2"


  namespace   = var.namespace
  location    = var.location
  license     = var.license
  domain_name = var.domain_name
  subdomain   = var.subdomain

  use_internal_queue       = false # Enable Azure Event Grid
  [...]
}

Other deployment options

You can combine all three deployment options adding all configurations to the same file. The Terraform Module provides several options that you can combine along with the standard options and the minimal configuration found in recommended deployment

4 - Reference Architecture

W&B Reference Architecture

This page describes a reference architecture for a Weights & Biases deployment and outlines the recommended infrastructure and resources to support a production deployment of the platform.

Depending on your chosen deployment environment for Weights & Biases (W&B), various services can help to enhance the resiliency of your deployment.

For instance, major cloud providers offer robust managed database services which help to reduce the complexity of database configuration, maintenance, high availability, and resilience.

This reference architecture addresses some common deployment scenarios and shows how you can integrate your W&B deployment with cloud vendor services for optimal performance and reliability.

Before you start

Running any application in production comes with its own set of challenges, and W&B is no exception. While we aim to streamline the process, certain complexities may arise depending on your unique architecture and design decisions. Typically, managing a production deployment involves overseeing various components, including hardware, operating systems, networking, storage, security, the W&B platform itself, and other dependencies. This responsibility extends to both the initial setup of the environment and its ongoing maintenance.

Consider carefully whether a self-managed approach with W&B is suitable for your team and specific requirements.

A strong understanding of how to run and maintain production-grade application is an important prerequisite before you deploy self-managed W&B. If your team needs assistance, our Professional Services team and partners offer support for implementation and optimization.

To learn more about managed solutions for running W&B instead of managing it yourself, refer to W&B Multi-tenant Cloud and W&B Dedicated Cloud.

Infrastructure

W&B infrastructure diagram

Application layer

The application layer consists of a multi-node Kubernetes cluster, with resilience against node failures. The Kubernetes cluster runs and maintains W&B’s pods.

Storage layer

The storage layer consists of a MySQL database and object storage. The MySQL database stores metadata and the object storage stores artifacts such as models and datasets.

Infrastructure requirements

Kubernetes

The W&B Server application is deployed as a Kubernetes Operator that deploys multiple Pods. For this reason, W&B requires a Kubernetes cluster with:

  • A fully configured and functioning Ingress controller
  • The capability to provision Persistent Volumes.

MySQL

W&B stores metadata in a MySQL database. The database’s performance and storage requirements depend on the shapes of the model parameters and related metadata. For example, the database grows in size as you track more training runs, and load on the database increases based on queries in run tables, user workspaces, and reports.

Consider the following when you deploy a self-managed MySQL database:

  • Backups. You should periodically back up the database to a separate facility. W&B recommends daily backups with at least 1 week of retention.
  • Performance. The disk the server is running on should be fast. W&B recommends running the database on an SSD or accelerated NAS.
  • Monitoring. The database should be monitored for load. If CPU usage is sustained at > 40% of the system for more than 5 minutes it is likely a good indication the server is resource starved.
  • Availability. Depending on your availability and durability requirements you might want to configure a hot standby on a separate machine that streams all updates in realtime from the primary server and can be used to failover to in the event that the primary server crashes or become corrupted.

Object storage

W&B requires object storage with Pre-signed URL and CORS support, deployed in Amazon S3, Azure Cloud Storage, Google Cloud Storage, or a storage service compatible with Amazon S3.service)

Versions

  • Kubernetes: at least version 1.29.
  • MySQL: at least 8.0.

Networking

In a deployment connected a public or private network, egress to the following endpoints is required during installation and during runtime: * https://deploy.wandb.ai * https://charts.wandb.ai * https://docker.io * https://quay.io * https://gcr.io

Access to W&B and to the object storage is required for the training infrastructure and for each system that tracks the needs of experiments.

DNS

The fully qualified domain name (FQDN) of the W&B deployment must resolve to the IP address of the ingress/load balancer using an A record.

SSL/TLS

W&B requires a valid signed SSL/TLS certificate for secure communication between clients and the server. SSL/TLS termination must occur on the ingress/load balancer. The W&B Server application does not terminate SSL or TLS connections.

Please note: W&B does not recommend the use self-signed certificates and custom CAs.

Supported CPU architectures

W&B runs on the Intel (x86) CPU architecture. ARM is not supported.

Infrastructure provisioning

Terraform is the recommended way to deploy W&B for production. Using Terraform, you define the required resources, their references to other resources, and their dependencies. W&B provides Terraform modules for the major cloud providers. For details, refer to Deploy W&B Server within self managed cloud accounts.

Sizing

Use the following general guidelines as a starting point when planning a deployment. W&B recommends that you monitor all components of a new deployment closely and that you make adjustments based on observed usage patterns. Continue to monitor production deployments over time and make adjustments as needed to maintain optimal performance.

Models only

Kubernetes

Environment CPU Memory Disk
Test/Dev 2 cores 16 GB 100 GB
Production 8 cores 64 GB 100 GB

Numbers are per Kubernetes worker node.

MySQL

Environment CPU Memory Disk
Test/Dev 2 cores 16 GB 100 GB
Production 8 cores 64 GB 500 GB

Numbers are per MySQL node.

Weave only

Kubernetes

Environment CPU Memory Disk
Test/Dev 4 cores 32 GB 100 GB
Production 12 cores 96 GB 100 GB

Numbers are per Kubernetes worker node.

MySQL

Environment CPU Memory Disk
Test/Dev 2 cores 16 GB 100 GB
Production 8 cores 64 GB 500 GB

Numbers are per MySQL node.

Models and Weave

Kubernetes

Environment CPU Memory Disk
Test/Dev 4 cores 32 GB 100 GB
Production 16 cores 128 GB 100 GB

Numbers are per Kubernetes worker node.

MySQL

Environment CPU Memory Disk
Test/Dev 2 cores 16 GB 100 GB
Production 8 cores 64 GB 500 GB

Numbers are per MySQL node.

Cloud provider instance recommendations

Services

Cloud Kubernetes MySQL Object Storage
AWS EKS RDS Aurora S3
GCP GKE Google Cloud SQL - Mysql Google Cloud Storage (GCS)
Azure AKS Azure Database for Mysql Azure Blob Storage

Machine types

These recommendations apply to each node of a self-managed deployment of W&B in cloud infrastructure.

AWS

Environment K8s (Models only) K8s (Weave only) K8s (Models&Weave) MySQL
Test/Dev r6i.large r6i.xlarge r6i.xlarge db.r6g.large
Production r6i.2xlarge r6i.4xlarge r6i.4xlarge db.r6g.2xlarge

GCP

Environment K8s (Models only) K8s (Weave only) K8s (Models&Weave) MySQL
Test/Dev n2-highmem-2 n2-highmem-4 n2-highmem-4 db-n1-highmem-2
Production n2-highmem-8 n2-highmem-16 n2-highmem-16 db-n1-highmem-8

Azure

Environment K8s (Models only) K8s (Weave only) K8s (Models&Weave) MySQL
Test/Dev Standard_E2_v5 Standard_E4_v5 Standard_E4_v5 MO_Standard_E2ds_v4
Production Standard_E8_v5 Standard_E16_v5 Standard_E16_v5 MO_Standard_E8ds_v4