Kubernetes operator for air-gapped instances
Deploy W&B Platform with Kubernetes Operator (Airgapped)
16 minute read
Use the W&B Kubernetes Operator to simplify deploying, administering, troubleshooting, and scaling your W&B Server deployments on Kubernetes. You can think of the operator as a smart assistant for your W&B instance.
The W&B Server architecture and design continuously evolves to expand AI developer tooling capabilities, and to provide appropriate primitives for high performance, better scalability, and easier administration. That evolution applies to the compute services, relevant storage and the connectivity between them. To help facilitate continuous updates and improvements across deployment types, W&B users a Kubernetes operator.
For more information about Kubernetes operators, see Operator pattern in the Kubernetes documentation.
Historically, the W&B application was deployed as a single deployment and pod within a Kubernetes Cluster or a single Docker container. W&B has, and continues to recommend, to externalize the Database and Object Store. Externalizing the Database and Object store decouples the application’s state.
As the application grew, the need to evolve from a monolithic container to a distributed system (microservices) was apparent. This change facilitates backend logic handling and seamlessly introduces built-in Kubernetes infrastructure capabilities. Distributed systems also supports deploying new services essential for additional features that W&B relies on.
Before 2024, any Kubernetes-related change required manually updating the terraform-kubernetes-wandb Terraform module. Updating the Terraform module ensures compatibility across cloud providers, configuring necessary Terraform variables, and executing a Terraform apply for each backend or Kubernetes-level change.
This process was not scalable since W&B Support had to assist each customer with upgrading their Terraform module.
The solution was to implement an operator that connects to a central deploy.wandb.ai server to request the latest specification changes for a given release channel and apply them. Updates are received as long as the license is valid. Helm is used as both the deployment mechanism for the W&B operator and the means for the operator to handle all configuration templating of the W&B Kubernetes stack, Helm-ception.
You can install the operator with helm or from the source. See charts/operator for detailed instructions.
The installation process creates a deployment called controller-manager
and uses a custom resource definition named weightsandbiases.apps.wandb.com
(shortName: wandb
), that takes a single spec
and applies it to the cluster:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: weightsandbiases.apps.wandb.com
The controller-manager
installs charts/operator-wandb based on the spec of the custom resource, release channel, and a user defined config. The configuration specification hierarchy enables maximum configuration flexibility at the user end and enables W&B to release new images, configurations, features, and Helm updates automatically.
Refer to the configuration specification hierarchy and configuration reference for configuration options.
Configuration specifications follow a hierarchical model where higher-level specifications override lower-level ones. Here’s how it works:
This hierarchical model ensures that configurations are flexible and customizable to meet varying needs while maintaining a manageable and systematic approach to upgrades and changes.
Satisfy the following requirements to deploy W&B with the W&B Kubernetes operator:
Refer to the reference architecture. In addition, obtain a valid W&B Server license.
See this guide for a detailed explanation on how to set up and configure a self-managed installation.
Depending on the installation method, you might need to meet the following requirements:
See the Deploy W&B in airgapped environment with Kubernetes tutorial on how to install the W&B Kubernetes Operator in an airgapped environment.
This section describes different ways to deploy the W&B Kubernetes operator.
Choose one of the following:
W&B provides a Helm Chart to deploy the W&B Kubernetes operator to a Kubernetes cluster. This approach allows you to deploy W&B Server with Helm CLI or a continuous delivery tool like ArgoCD. Make sure that the above mentioned requirements are in place.
Follow those steps to install the W&B Kubernetes Operator with Helm CLI:
helm repo add wandb https://charts.wandb.ai
helm repo update
helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace
Configure the W&B operator custom resource to trigger the W&B Server installation. Create an operator.yaml file to customize the W&B Operator deployment, specifying your custom configuration. See Configuration Reference for details.
Once you have the specification YAML created and filled with your values, run the following and the operator applies the configuration and install the W&B Server application based on your configuration.
kubectl apply -f operator.yaml
Wait until the deployment completes. This takes a few minutes.
To verify the installation using the web UI, create the first admin user account, then follow the verification steps outlined in Verify the installation.
This method allows for customized deployments tailored to specific requirements, leveraging Terraform’s infrastructure-as-code approach for consistency and repeatability. The official W&B Helm-based Terraform Module is located here.
The following code can be used as a starting point and includes all necessary configuration options for a production grade deployment.
module "wandb" {
source = "wandb/wandb/helm"
spec = {
values = {
global = {
host = "https://<HOST_URI>"
license = "eyJhbGnUzaH...j9ZieKQ2x5GGfw"
bucket = {
<details depend on the provider>
}
mysql = {
<redacted>
}
}
ingress = {
annotations = {
"a" = "b"
"x" = "y"
}
}
}
}
}
Note that the configuration options are the same as described in Configuration Reference, but that the syntax has to follow the HashiCorp Configuration Language (HCL). The Terraform module creates the W&B custom resource definition (CRD).
To see how W&B&Biases themselves use the Helm Terraform module to deploy “Dedicated cloud” installations for customers, follow those links:
W&B provides a set of Terraform Modules for AWS, GCP and Azure. Those modules deploy entire infrastructures including Kubernetes clusters, load balancers, MySQL databases and so on as well as the W&B Server application. The W&B Kubernetes Operator is already pre-baked with those official W&B cloud-specific Terraform Modules with the following versions:
Terraform Registry | Source Code | Version |
---|---|---|
AWS | https://github.com/wandb/terraform-aws-wandb | v4.0.0+ |
Azure | https://github.com/wandb/terraform-azurerm-wandb | v2.0.0+ |
GCP | https://github.com/wandb/terraform-google-wandb | v2.0.0+ |
This integration ensures that W&B Kubernetes Operator is ready to use for your instance with minimal setup, providing a streamlined path to deploying and managing W&B Server in your cloud environment.
For a detailed description on how to use these modules, refer to this section to self-managed installations section in the docs.
To verify the installation, W&B recommends using the W&B CLI. The verify command executes several tests that verify all components and configurations.
Follow these steps to verify the installation:
Install the W&B CLI:
pip install wandb
Log in to W&B:
wandb login --host=https://YOUR_DNS_DOMAIN
For example:
wandb login --host=https://wandb.company-name.com
Verify the installation:
wandb verify
A successful installation and fully working W&B deployment shows the following output:
Default host selected: https://wandb.company-name.com
Find detailed logs for this test at: /var/folders/pn/b3g3gnc11_sbsykqkm3tx5rh0000gp/T/tmpdtdjbxua/wandb
Checking if logged in...................................................✅
Checking signed URL upload..............................................✅
Checking ability to send large payloads through proxy...................✅
Checking requests to base url...........................................✅
Checking requests made over signed URLs.................................✅
Checking CORs configuration of the bucket...............................✅
Checking wandb package version is up to date............................✅
Checking logged metrics, saving and downloading a file..................✅
Checking artifact save and download workflows...........................✅
The W&B Kubernetes operator comes with a management console. It is located at ${HOST_URI}/console
, for example https://wandb.company-name.com/
console.
There are two ways to log in to the management console:
Open the W&B application in the browser and login. Log in to the W&B application with ${HOST_URI}/
, for example https://wandb.company-name.com/
Access the console. Click on the icon in the top right corner and then click System console. Only users with admin privileges can see the System console entry.
kubectl get secret wandb-password -o jsonpath='{.data.password}' | base64 -d
This section describes how to update the W&B Kubernetes operator.
Copy and paste the code snippets below into your terminal.
First, update the repo with helm repo update
:
helm repo update
Next, update the Helm chart with helm upgrade
:
helm upgrade operator wandb/operator -n wandb-cr --reuse-values
You no longer need to update W&B Server application if you use the W&B Kubernetes operator.
The operator automatically updates your W&B Server application when a new version of the software of W&B is released.
The proceeding section describe how to migrate from self-managing your own W&B Server installation to using the W&B Operator to do this for you. The migration process depends on how you installed W&B Server:
For a detailed description of the migration process, continue here.
Reach out to Customer Support or your W&B team if you have any questions or need assistance.
Reach out to Customer Support or your W&B team if you have any questions or need assistance.
Follow these steps to migrate to the Operator-based Helm chart:
Get the current W&B configuration. If W&B was deployed with an non-operator-based version of the Helm chart, export the values like this:
helm get values wandb
If W&B was deployed with Kubernetes manifests, export the values like this:
kubectl get deployment wandb -o yaml
You now have all the configuration values you need for the next step.
Create a file called operator.yaml
. Follow the format described in the Configuration Reference. Use the values from step 1.
Scale the current deployment to 0 pods. This step is stops the current deployment.
kubectl scale --replicas=0 deployment wandb
Update the Helm chart repo:
helm repo update
Install the new Helm chart:
helm upgrade --install operator wandb/operator -n wandb-cr --create-namespace
Configure the new helm chart and trigger W&B application deployment. Apply the new configuration.
kubectl apply -f operator.yaml
The deployment takes a few minutes to complete.
Verify the installation. Make sure that everything works by following the steps in Verify the installation.
Remove to old installation. Uninstall the old helm chart or delete the resources that were created with manifests.
Follow these steps to migrate to the Operator-based Helm chart:
This section describes the configuration options for W&B Server application. The application receives its configuration as custom resource definition named WeightsAndBiases. Some configuration options are exposed with the below configuration, some need to be set as environment variables.
The documentation has two lists of environment variables: basic and advanced. Only use environment variables if the configuration option that you need are not exposed using Helm Chart.
The W&B Server application configuration file for a production deployment requires the following contents. This YAML file defines the desired state of your W&B deployment, including the version, environment variables, external resources like databases, and other necessary settings.
apiVersion: apps.wandb.com/v1
kind: WeightsAndBiases
metadata:
labels:
app.kubernetes.io/name: weightsandbiases
app.kubernetes.io/instance: wandb
name: wandb
namespace: default
spec:
values:
global:
host: https://<HOST_URI>
license: eyJhbGnUzaH...j9ZieKQ2x5GGfw
bucket:
<details depend on the provider>
mysql:
<redacted>
ingress:
annotations:
<redacted>
Find the full set of values in the W&B Helm repository, and change only those values you need to override.
This is an example configuration that uses GCP Kubernetes with GCP Ingress and GCS (GCP Object storage):
apiVersion: apps.wandb.com/v1
kind: WeightsAndBiases
metadata:
labels:
app.kubernetes.io/name: weightsandbiases
app.kubernetes.io/instance: wandb
name: wandb
namespace: default
spec:
values:
global:
host: https://abc-wandb.sandbox-gcp.wandb.ml
bucket:
name: abc-wandb-moving-pipefish
provider: gcs
mysql:
database: wandb_local
host: 10.218.0.2
name: wandb_local
password: 8wtX6cJHizAZvYScjDzZcUarK4zZGjpV
port: 3306
user: wandb
license: eyJhbGnUzaHgyQjQyQWhEU3...ZieKQ2x5GGfw
ingress:
annotations:
ingress.gcp.kubernetes.io/pre-shared-cert: abc-wandb-cert-creative-puma
kubernetes.io/ingress.class: gce
kubernetes.io/ingress.global-static-ip-name: abc-wandb-operator-address
# Provide the FQDN with protocol
global:
# example host name, replace with your own
host: https://abc-wandb.sandbox-gcp.wandb.ml
AWS
global:
bucket:
provider: "s3"
name: ""
kmsKey: ""
region: ""
GCP
global:
bucket:
provider: "gcs"
name: ""
Azure
global:
bucket:
provider: "az"
name: ""
secretKey: ""
Other providers (Minio, Ceph, etc.)
For other S3 compatible providers, set the bucket configuration as a environment variable as follows:
global:
extraEnv:
"BUCKET": "s3://wandb:changeme@mydb.com/wandb?tls=true"
The variable contains a connection string in this form:
s3://$ACCESS_KEY:$SECRET_KEY@$HOST/$BUCKET_NAME
You can optionally tell W&B to only connect over TLS if you configure a trusted SSL certificate for your object store. To do so, add the tls
query parameter to the url:
s3://$ACCESS_KEY:$SECRET_KEY@$HOST/$BUCKET_NAME?tls=true
global:
mysql:
# Example values, replace with your own
database: wandb_local
host: 10.218.0.2
name: wandb_local
password: 8wtX6cJH...ZcUarK4zZGjpV
port: 3306
user: wandb
global:
# Example license, replace with your own
license: eyJhbGnUzaHgyQjQy...VFnPS_KETXg1hi
To identify the ingress class, see this FAQ entry.
Without TLS
global:
# IMPORTANT: Ingress is on the same level in the YAML as ‘global’ (not a child)
ingress:
class: ""
With TLS
Create a secret that contains the certificate
kubectl create secret tls wandb-ingress-tls --key wandb-ingress-tls.key --cert wandb-ingress-tls.crt
Reference the secret in the ingress configuration
global:
# IMPORTANT: Ingress is on the same level in the YAML as ‘global’ (not a child)
ingress:
class: ""
annotations:
{}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
tls:
- secretName: wandb-ingress-tls
hosts:
- <HOST_URI>
In case of Nginx you might have to add the following annotation:
ingress:
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: 64m
Specify custom Kubernetes service accounts to run the W&B pods.
The following snippet creates a service account as part of the deployment with the specified name:
app:
serviceAccount:
name: custom-service-account
create: true
parquet:
serviceAccount:
name: custom-service-account
create: true
global:
...
The subsystems “app” and “parquet” run under the specified service account. The other subsystems run under the default service account.
If the service account already exists on the cluster, set create: false
:
app:
serviceAccount:
name: custom-service-account
create: false
parquet:
serviceAccount:
name: custom-service-account
create: false
global:
...
You can specify service accounts on different subsystems such as app, parquet, console, and others:
app:
serviceAccount:
name: custom-service-account
create: true
console:
serviceAccount:
name: custom-service-account
create: true
global:
...
The service accounts can be different between the subsystems:
app:
serviceAccount:
name: custom-service-account
create: false
console:
serviceAccount:
name: another-custom-service-account
create: true
global:
...
redis:
install: false
global:
redis:
host: ""
port: 6379
password: ""
parameters: {}
caCert: ""
Alternatively with redis password in a Kubernetes secret:
kubectl create secret generic redis-secret --from-literal=redis-password=supersecret
Reference it in below configuration:
redis:
install: false
global:
redis:
host: redis.example
port: 9001
auth:
enabled: true
secret: redis-secret
key: redis-password
Without TLS
global:
ldap:
enabled: true
# LDAP server address including "ldap://" or "ldaps://"
host:
# LDAP search base to use for finding users
baseDN:
# LDAP user to bind with (if not using anonymous bind)
bindDN:
# Secret name and key with LDAP password to bind with (if not using anonymous bind)
bindPW:
# LDAP attribute for email and group ID attribute names as comma separated string values.
attributes:
# LDAP group allow list
groupAllowList:
# Enable LDAP TLS
tls: false
With TLS
The LDAP TLS cert configuration requires a config map pre-created with the certificate content.
To create the config map you can use the following command:
kubectl create configmap ldap-tls-cert --from-file=certificate.crt
And use the config map in the YAML like the example below
global:
ldap:
enabled: true
# LDAP server address including "ldap://" or "ldaps://"
host:
# LDAP search base to use for finding users
baseDN:
# LDAP user to bind with (if not using anonymous bind)
bindDN:
# Secret name and key with LDAP password to bind with (if not using anonymous bind)
bindPW:
# LDAP attribute for email and group ID attribute names as comma separated string values.
attributes:
# LDAP group allow list
groupAllowList:
# Enable LDAP TLS
tls: true
# ConfigMap name and key with CA certificate for LDAP server
tlsCert:
configMap:
name: "ldap-tls-cert"
key: "certificate.crt"
global:
auth:
sessionLengthHours: 720
oidc:
clientId: ""
secret: ""
authMethod: ""
issuer: ""
global:
email:
smtp:
host: ""
port: 587
user: ""
password: ""
global:
extraEnv:
GLOBAL_ENV: "example"
customCACerts
is a list and can take many certificates. Certificate authorities specified in customCACerts
only apply to the W&B Server application.
global:
customCACerts:
- |
-----BEGIN CERTIFICATE-----
MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG
SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg
P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0=
-----END CERTIFICATE-----
- |
-----BEGIN CERTIFICATE-----
MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU
SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU
aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ==
-----END CERTIFICATE-----
This section describes configuration options for W&B Kubernetes operator (wandb-controller-manager
). The operator receives its configuration in the form of a YAML file.
By default, the W&B Kubernetes operator does not need a configuration file. Create a configuration file if required. For example, you might need a configuration file to specify custom certificate authorities, deploy in an air gap environment and so forth.
Find the full list of spec customization in the Helm repository.
A custom certificate authority (customCACerts
), is a list and can take many certificates. Those certificate authorities when added only apply to the W&B Kubernetes operator (wandb-controller-manager
).
customCACerts:
- |
-----BEGIN CERTIFICATE-----
MIIBnDCCAUKgAwIBAg.....................fucMwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9tZU.....................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMFoXDT.....................oNWYggsMo8O+0mWLYMAoGCCqG
SM49BAMCA0gAMEUCIQ.....................hwuJgyQRaqMI149div72V2QIg
P5GD+5I+02yEp58Cwxd5Bj2CvyQwTjTO4hiVl1Xd0M0=
-----END CERTIFICATE-----
- |
-----BEGIN CERTIFICATE-----
MIIBxTCCAWugAwIB.......................qaJcwCgYIKoZIzj0EAwIwLDEQ
MA4GA1UEChMHSG9t.......................tZUxhYiBSb290IENBMB4XDTI0
MDQwMTA4MjgzMVoX.......................UK+moK4nZYvpNpqfvz/7m5wKU
SAAwRQIhAIzXZMW4.......................E8UFqsCcILdXjAiA7iTluM0IU
aIgJYVqKxXt25blH/VyBRzvNhViesfkNUQ==
-----END CERTIFICATE-----
See Accessing the W&B Kubernetes Operator Management Console.
Execute the following command on a host that can reach the Kubernetes cluster:
kubectl port-forward svc/wandb-console 8082
Access the console in the browser with https://localhost:8082/
console.
See Accessing the W&B Kubernetes Operator Management Console on how to get the password (Option 2).
The application pod is named wandb-app-xxx.
kubectl get pods
kubectl logs wandb-XXXXX-XXXXX
You can get the ingress class installed in your cluster by running
kubectl get ingressclass
Deploy W&B Platform with Kubernetes Operator (Airgapped)
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.