> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Reference Architecture

> Review the reference architecture for self-managed W&B deployments covering Kubernetes, MySQL, object storage, and networking.

This page describes a reference architecture for a W\&B deployment and outlines the recommended infrastructure and resources to support a production deployment of the platform.

Depending on your chosen deployment environment for W\&B, various services can help to enhance the resiliency of your deployment.

For instance, major cloud providers offer robust managed database services which help to reduce the complexity of database configuration, maintenance, high availability, and resilience.

This reference architecture addresses some common deployment scenarios and shows how you can integrate your W\&B deployment with cloud vendor services for optimal performance and reliability.

## Before you start

Running any application in production comes with its own set of challenges, and W\&B is no exception. While we aim to streamline the process, certain complexities may arise depending on your unique architecture and design decisions. Typically, managing a production deployment involves overseeing various components, including hardware, operating systems, networking, storage, security, the W\&B platform itself, and other dependencies. This responsibility extends to both the initial setup of the environment and its ongoing maintenance.

Consider carefully whether a Self-Managed approach with W\&B is suitable for your team and specific requirements.

A strong understanding of how to run and maintain production-grade application is an important prerequisite before you deploy Self-Managed W\&B. If your team needs assistance, our Professional Services team and partners offer support for implementation and optimization.

To learn more about managed solutions for running W\&B instead of managing it yourself, refer to [W\&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud) and [W\&B Dedicated Cloud](/platform/hosting/hosting-options/dedicated-cloud).

## Infrastructure

<Frame>
  <img src="https://mintcdn.com/wb-21fd5541/7mSicW8MfO9qZmb2/images/hosting/reference_architecture.png?fit=max&auto=format&n=7mSicW8MfO9qZmb2&q=85&s=d79b176eccdb655d806fc58d08f412cb" alt="W&B infrastructure diagram" width="851" height="1151" data-path="images/hosting/reference_architecture.png" />
</Frame>

### Application layer

The application layer consists of a multi-node Kubernetes cluster, with resilience against node failures. The Kubernetes cluster runs and maintains W\&B's pods.

### Storage layer

The storage layer consists of a MySQL database and object storage. The MySQL database stores metadata and the object storage stores artifacts such as models and datasets.

## Infrastructure requirements

The following sections detail requirements for various aspects of a W\&B deployment, including Kubernetes cluster details, MySQL, Redis, object storage, software versions, networking, DNS, load balancer and ingress, SSL/TLS, and supported CPU architectures.

### Kubernetes

The W\&B Server application is deployed as a [Kubernetes Operator](/platform/hosting/self-managed/operator) that deploys multiple pods. For this reason, W\&B requires a Kubernetes cluster with:

* A fully configured and functioning Ingress controller.
* The capability to provision Persistent Volumes.

W\&B supports deployment on [OpenShift Kubernetes clusters](https://www.redhat.com/en/technologies/cloud-computing/openshift) in cloud, on-premises, and air-gapped environments. For specific configuration instructions, see the [OpenShift section](/platform/hosting/self-managed/operator#openshift-kubernetes-clusters) in the Operator guide.

### MySQL

W\&B stores metadata in a MySQL database. The database's performance and storage requirements depend on the shapes of the model parameters and related metadata. For example, the database grows in size as you track more training runs, and load on the database increases based on queries in run tables, user workspaces, and reports.

**W\&B strongly recommends using managed database services** (such as AWS RDS Aurora MySQL, Google Cloud SQL for MySQL, or Azure Database for MySQL) for production deployments. Managed services provide automated backups, monitoring, high availability, patching, and significantly reduce operational complexity. See the [Cloud provider instance recommendations](#cloud-provider-instance-recommendations) section below for specific service recommendations.

If you choose to deploy a self-managed MySQL database, consider the following:

* **Backups**: You should periodically back up the database to a separate facility. W\&B recommends daily backups with at least 1 week of retention.
* **Performance**: The database requires fast storage hardware, such as SSD or accelerated NAS.
* **Monitoring**: The database requires adequate CPU resources. Monitor the database server's CPU load. If CPU usage is sustained at > 90% of the system for more than 5 minutes, consider adding CPU capacity.
* **Availability**: To meet your availability and durability requirements, W\&B recommends configuring a hot standby deployment on a separate machine that streams all updates in realtime from the primary deployment, and is ready to fail over if the primary server crashes, becomes corrupted, or experiences sustained downtime.

#### MySQL topology

For production, a managed MySQL service is the simplest path to high availability because the cloud provider handles failover, backups, and patching. Use the provider's high availability option, for example Aurora Multi-AZ on AWS.

If you run self-managed MySQL, use a primary database with a hot standby that receives a realtime replication stream and can take over on failure. W\&B does not support a multi-master topology or read-only replicas for the application database.

#### MySQL database creation

For instructions to manually create the MySQL database and user, see the [bare-metal guide MySQL database section](/platform/hosting/self-managed/operator#mysql-database).

#### MySQL configuration parameters

If you are running your own MySQL instance, configure MySQL with these settings:

```
binlog_format = 'ROW'
binlog_row_image = 'MINIMAL'
innodb_flush_log_at_trx_commit = 1
innodb_online_alter_log_max_size = 268435456
max_prepared_stmt_count = 1048576
sort_buffer_size = '67108864'
sync_binlog = 1
```

These settings have been validated by W\&B for optimal performance and reliability.

### Redis

W\&B depends on a single-node Redis 7.x deployment used by W\&B's components for job queuing and data caching. For convenience during testing and development of proofs of concept, W\&B Self-Managed includes a local Redis deployment that is not appropriate for production deployments.

W\&B can connect to a Redis instance in the following environments:

* [AWS Elasticache](https://aws.amazon.com/elasticache/)
* [Google Cloud Memory Store](https://cloud.google.com/memorystore?hl=en)
* [Azure Cache for Redis](https://azure.microsoft.com/en-us/products/cache)
* Redis deployment hosted in your cloud or on-premise infrastructure

### Object storage

W\&B requires object storage with pre-signed URL and CORS support, deployed in one of:

* [CoreWeave AI Object Storage](https://docs.coreweave.com/products/storage/object-storage) is a high-performance, S3-compatible object storage service optimized for AI workloads.
* [Amazon S3](https://aws.amazon.com/s3/) is an object storage service offering industry-leading scalability, data availability, security, and performance.
* [Google Cloud Storage](https://cloud.google.com/storage) is a managed service for storing unstructured data at scale.
* [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs) is a cloud-based object storage solution for storing massive amounts of unstructured data like text, binary data, images, videos, and logs.
* S3-compatible storage such as [MinIO Enterprise (AIStor)](https://www.min.io/product/aistor), [NetApp StorageGRID](https://www.netapp.com/data-storage/storagegrid/), or other enterprise-grade solutions hosted in your cloud or on-premises infrastructure.

### Versions

| Software   | Minimum version                                                                                                                 |
| ---------- | ------------------------------------------------------------------------------------------------------------------------------- |
| Kubernetes | v1.32 or newer ([Supported Kubernetes versions](https://kubernetes.io/releases/patch-releases/))                                |
| Helm       | v3.x                                                                                                                            |
| MySQL      | v8.0.x is required, v8.0.32 or newer; v8.0.44 or newer is recommended.<br />Aurora MySQL 3.x releases, must be v3.05.2 or newer |
| Redis      | v7.x                                                                                                                            |

### Networking

For a networked deployment, egress to these endpoints is required during *both* installation and runtime:

* [https://deploy.wandb.ai](https://deploy.wandb.ai)
* [https://charts.wandb.ai](https://charts.wandb.ai)
* [https://quay.io](https://quay.io) (used for Prometheus images)

<Note>
  Additional container registries may be required depending on your deployment configuration:

  * `https://gcr.io` is needed when deploying Bufstream and etcd for Weave online evaluations.
</Note>

To learn about air-gapped deployments, refer to [Kubernetes operator for air-gapped instances](/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped).

Access to W\&B and to the object storage is required for the training infrastructure and for each system that tracks the needs of experiments.

### DNS

The fully qualified domain name (FQDN) of the W\&B deployment must resolve to the IP address of the ingress/load balancer using an A record.

### Load balancer and ingress

The W\&B Kubernetes Operator can expose services using a Kubernetes ingress controller, which routes to service endpoints based on URL paths with different ports. The ingress controller must be accessible by all machines that execute machine learning payloads or access the service through web browsers.

#### Ingress controller requirements

Your Kubernetes cluster must have an `IngressClass` available. Common ingress controller options include:

* [Nginx Ingress Controller](https://kubernetes.github.io/ingress-nginx/)
* [Istio](https://istio.io)
* [Traefik](https://traefik.io/)
* Cloud provider ingress controllers (AWS ALB, GCP Ingress, Azure Application Gateway)

#### W\&B service routing

The W\&B Operator automatically routes requests to multiple backend services based on path:

| Path        | Service             | Default port | Purpose                            |
| ----------- | ------------------- | ------------ | ---------------------------------- |
| `/`         | `wandb-app`         | 8080         | Main web application UI            |
| `/api`      | `wandb-api`         | 8081         | API service                        |
| `/graphql`  | `wandb-api`         | 8081         | GraphQL API endpoint               |
| `/graphql2` | `wandb-api`         | 8081         | GraphQL API v2 endpoint            |
| `/console`  | `wandb-console`     | 8082         | System Console                     |
| `/traces`   | `wandb-weave-trace` | 8722         | Weave tracing service (if enabled) |

#### Example ingress configuration

The following shows an example ingress resource created by the W\&B Operator:

```yaml theme={null}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: wandb
  namespace: wandb
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
spec:
  ingressClassName: nginx
  rules:
  - host: wandb.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: wandb-app
            port:
              number: 8080
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: wandb-api
            port:
              number: 8081
      - path: /graphql
        pathType: Prefix
        backend:
          service:
            name: wandb-api
            port:
              number: 8081
      - path: /graphql2
        pathType: Prefix
        backend:
          service:
            name: wandb-api
            port:
              number: 8081
      - path: /console
        pathType: Prefix
        backend:
          service:
            name: wandb-console
            port:
              number: 8082
  tls:
  - hosts:
    - wandb.example.com
    secretName: wandb-tls
```

<Note>
  The W\&B Operator creates and manages the ingress configuration automatically. You typically do not need to create ingress resources manually. Ensure your cluster has a functioning ingress controller and the appropriate `IngressClass` configured.
</Note>

### SSL/TLS

W\&B requires a valid signed SSL/TLS certificate for secure communication between clients and the server. SSL/TLS termination must occur on the ingress/load balancer. The W\&B Server application does not terminate SSL or TLS connections.

**Important**: W\&B does not support self-signed certificates and custom CAs. Using self-signed certificates will cause challenges for users and is not supported.

If possible, using a service like [Let's Encrypt](https://letsencrypt.org) is a great way to provide trusted certificates to your load balancer. Services like Caddy and Cloudflare manage SSL for you.

If your security policies require SSL communication within your trusted networks, consider using a tool like Istio and [side car containers](https://istio.io/latest/docs/reference/config/networking/sidecar/).

### Supported CPU architectures

W\&B runs on Intel and AMD 64-bit architecture. ARM is not supported.

## Deployment method

### Recommended: W\&B Kubernetes Operator with Helm

The recommended installation method for W\&B Self-Managed is using the **W\&B Kubernetes Operator**, deployed via Helm. This approach provides:

* Automated updates and management of W\&B components
* Simplified configuration and deployment
* Support for all deployment scenarios (cloud, on-premises, air-gapped)

For detailed installation instructions, see:

* [Deploy W\&B Platform On-premises](/platform/hosting/self-managed/operator) - Primary installation guide
* [Kubernetes operator for air-gapped instances](/platform/hosting/self-managed/on-premises-deployments/kubernetes-airgapped) - For disconnected environments

### Infrastructure provisioning

Terraform is the recommended way to provision infrastructure for W\&B production deployments. Using Terraform, you define the required resources, their references to other resources, and their dependencies. W\&B provides Terraform modules for the major cloud providers. For details, refer to [Deploy W\&B Server within Self-Managed cloud accounts](/platform/hosting/hosting-options/self-managed#deploy-wb-server-within-Self-Managed-cloud-accounts).

## Sizing

Use the following general guidelines as a starting point when planning a deployment. W\&B recommends that you monitor all components of a new deployment closely and that you make adjustments based on observed usage patterns. Continue to monitor production deployments over time and make adjustments as needed to maintain optimal performance.

When you plan capacity, you size two core components: a Kubernetes cluster for the W\&B Operator workload and a MySQL database for metadata. Recommendations vary by **environment** (Test/Dev or Production) and, for Kubernetes only, by **product mix** (Models only, Weave only, or Models and Weave). W\&B recommends starting with a minimum of 3 worker nodes for both Test/Dev and Production, with cluster autoscaling enabled in Production.

### Kubernetes sizing

<Tabs>
  <Tab title="Models only">
    | Environment | CPU     | Memory | Disk   |
    | ----------- | ------- | ------ | ------ |
    | Test/Dev    | 2 cores | 16 GB  | 100 GB |
    | Production  | 8 cores | 64 GB  | 100 GB |

    Numbers are per Kubernetes worker node.
  </Tab>

  <Tab title="Weave only">
    | Environment | CPU      | Memory | Disk   |
    | ----------- | -------- | ------ | ------ |
    | Test/Dev    | 4 cores  | 32 GB  | 100 GB |
    | Production  | 12 cores | 96 GB  | 100 GB |

    Numbers are per Kubernetes worker node.
  </Tab>

  <Tab title="Models and Weave">
    | Environment | CPU      | Memory | Disk   |
    | ----------- | -------- | ------ | ------ |
    | Test/Dev    | 4 cores  | 32 GB  | 100 GB |
    | Production  | 16 cores | 128 GB | 100 GB |

    Numbers are per Kubernetes worker node.
  </Tab>
</Tabs>

### MySQL sizing

These recommendations do not vary by product mix. For topology and availability guidance, see [MySQL topology](#mysql-topology) under [MySQL](#mysql).

| Environment | CPU     | Memory | Disk   |
| ----------- | ------- | ------ | ------ |
| Test/Dev    | 2 cores | 16 GB  | 100 GB |
| Production  | 8 cores | 64 GB  | 500 GB |

Numbers are per MySQL node.

## Cloud provider instance recommendations

These recommendations apply to each node of a Self-Managed deployment of W\&B in cloud infrastructure.

<Tabs>
  <Tab title="AWS">
    **Recommended managed services**

    * **Kubernetes**: Amazon EKS
    * **MySQL**: Amazon RDS Aurora
    * **Object storage**: Amazon S3

    | Environment | K8s (Models only) | K8s (Weave only) | K8s (Models\&Weave) | MySQL          |
    | ----------- | ----------------- | ---------------- | ------------------- | -------------- |
    | Test/Dev    | r6i.large         | r6i.xlarge       | r6i.xlarge          | db.r6g.large   |
    | Production  | r6i.2xlarge       | r6i.4xlarge      | r6i.4xlarge         | db.r6g.2xlarge |
  </Tab>

  <Tab title="Google Cloud">
    **Recommended managed services**

    * **Kubernetes**: Google Kubernetes Engine (GKE)
    * **MySQL**: Google Cloud SQL for MySQL
    * **Object storage**: Google Cloud Storage (GCS)

    | Environment | K8s (Models only) | K8s (Weave only) | K8s (Models\&Weave) | MySQL           |
    | ----------- | ----------------- | ---------------- | ------------------- | --------------- |
    | Test/Dev    | n2-highmem-2      | n2-highmem-4     | n2-highmem-4        | db-n1-highmem-2 |
    | Production  | n2-highmem-8      | n2-highmem-16    | n2-highmem-16       | db-n1-highmem-8 |
  </Tab>

  <Tab title="Azure">
    **Recommended managed services**

    * **Kubernetes**: Azure Kubernetes Service (AKS)
    * **MySQL**: Azure Database for MySQL
    * **Object storage**: Azure Blob Storage

    | Environment | K8s (Models only) | K8s (Weave only)  | K8s (Models\&Weave) | MySQL                  |
    | ----------- | ----------------- | ----------------- | ------------------- | ---------------------- |
    | Test/Dev    | Standard\_E2\_v5  | Standard\_E4\_v5  | Standard\_E4\_v5    | MO\_Standard\_E2ds\_v4 |
    | Production  | Standard\_E8\_v5  | Standard\_E16\_v5 | Standard\_E16\_v5   | MO\_Standard\_E8ds\_v4 |
  </Tab>
</Tabs>
