> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt
> Use this file to discover all available pages before exploring further.

> Metrics automatically logged by W&B.

# System Metrics Reference

This page provides detailed information about the system metrics that are tracked by the W\&B SDK.

<Note>
  `wandb` automatically logs system metrics every 15 seconds.
</Note>

## View system metrics

You can view and monitor system metrics using the W\&B App or the `wandb beta leet` terminal UI.

<Tabs>
  <Tab title="App">
    To view system metrics in the W\&B App:

    1. Navigate to your project in the W\&B App.
    2. Select a run from the **Runs** table.
    3. In the workspace, look for the **System** section which displays charts for:
       * GPU utilization and memory
       * CPU usage
       * Memory usage
       * Disk I/O
       * Network traffic

    You can customize which system metrics to display by adding panels to your workspace. See [Panels](/models/app/features/panels/) for more information on creating and customizing visualizations.
  </Tab>

  <Tab title="LEET">
    To view system metrics for a run in your terminal using the `wandb beta leet` terminal UI:

    1. If you started the run locally from a script, navigate to the directory where you ran your code. It contains a `wandb/` directory with a subdirectory per run and a `latest-run/` symbolic link. Each run directory contains a transaction log named in the format `run-<run-ID>.wandb`.

       If you did not start the run locally but downloaded a `.wandb` transaction log file instead, make a note of its location.
    2. Start `wandb beta leet` using one of these commands:

       ```bash theme={null}
       # View the latest run, stored in ./wandb/latest-run/
       wandb beta leet

       # Specify a run directory
       wandb beta leet ./wandb/run-20250813_124246-n67z9ude

       # Specify a .wandb file
       wandb beta leet ./wandb/run-20250813_124246-n67z9ude/run-n67z9ude.wandb
       ```

    LEET displays system metrics in the **right sidebar**, showing:

    * GPU utilization (%) and memory usage (GB)
    * CPU usage
    * RAM usage (GB)
    * Disk I/O
    * Network activity

    Get started with these keyboard shortcuts:

    * `h` or `?` - View all keyboard shortcuts
    * `/` - Filter metrics by pattern
    * `[` / `]` - Toggle left/right sidebars
    * `n` / `N` - Navigate between metric pages
    * `q` / `CMD+C` - Quit

    See [`wandb beta leet`](/models/ref/cli/wandb-beta/wandb-beta-leet) for more details.
  </Tab>
</Tabs>

## CPU

### Process CPU Percent (CPU)

Percentage of CPU usage by the process, normalized by the number of available CPUs.

W\&B assigns a `cpu` tag to this metric.

### Process CPU Threads

The number of threads utilized by the process.

W\&B assigns a `proc.cpu.threads` tag to this metric.

## Disk

By default, the usage metrics are collected for the `/` path. To configure the paths to be monitored, use the following setting:

```python theme={null}
run = wandb.init(
    settings=wandb.Settings(
        x_stats_disk_paths=("/System/Volumes/Data", "/home", "/mnt/data"),
    ),
)
```

### Disk Usage Percent

Represents the total system disk usage in percentage for specified paths.

W\&B assigns a `disk.{path}.usagePercent` tag to this metric.

### Disk Usage

Represents the total system disk usage in gigabytes (GB) for specified paths.
The paths that are accessible are sampled, and the disk usage (in GB) for each path is appended to the samples.

W\&B assigns a `disk.{path}.usageGB` tag to this metric.

### Disk In

Indicates the total system disk read in megabytes (MB).
The initial disk read bytes are recorded when the first sample is taken. Subsequent samples calculate the difference between the current read bytes and the initial value.

W\&B assigns a `disk.in` tag to this metric.

### Disk Out

Represents the total system disk write in megabytes (MB).
Similar to [Disk In](), the initial disk write bytes are recorded when the first sample is taken. Subsequent samples calculate the difference between the current write bytes and the initial value.

W\&B assigns a `disk.out` tag to this metric.

## Memory

### Process Memory RSS

Represents the Memory Resident Set Size (RSS) in megabytes (MB) for the process. RSS is the portion of memory occupied by a process that is held in main memory (RAM).

W\&B assigns a `proc.memory.rssMB` tag to this metric.

### Process Memory Percent

Indicates the memory usage of the process as a percentage of the total available memory.

W\&B assigns a `proc.memory.percent` tag to this metric.

### Memory Percent

Represents the total system memory usage as a percentage of the total available memory.

W\&B assigns a `memory_percent` tag to this metric.

### Memory Available

Indicates the total available system memory in megabytes (MB).

W\&B assigns a `proc.memory.availableMB` tag to this metric.

## Network

### Network Sent

Represents the total bytes sent over the network.
The initial bytes sent are recorded when the metric is first initialized. Subsequent samples calculate the difference between the current bytes sent and the initial value.

W\&B assigns a `network.sent` tag to this metric.

### Network Received

Indicates the total bytes received over the network.
Similar to [Network Sent](), the initial bytes received are recorded when the metric is first initialized. Subsequent samples calculate the difference between the current bytes received and the initial value.

W\&B assigns a `network.recv` tag to this metric.

## NVIDIA GPU

In addition to the metrics described below, if the process and/or its descendants use a particular GPU, W\&B captures the corresponding metrics as `gpu.process.{gpu_index}.{metric_name}`

### GPU Memory Utilization

Represents the GPU memory utilization in percent for each GPU.

W\&B assigns a `gpu.{gpu_index}.memory` tag to this metric.

### GPU Memory Allocated

Indicates the GPU memory allocated as a percentage of the total available memory for each GPU.

W\&B assigns a `gpu.{gpu_index}.memoryAllocated` tag to this metric.

### GPU Memory Allocated Bytes

Specifies the GPU memory allocated in bytes for each GPU.

W\&B assigns a `gpu.{gpu_index}.memoryAllocatedBytes` tag to this metric.

### GPU Utilization

Reflects the GPU utilization in percent for each GPU.

W\&B assigns a `gpu.{gpu_index}.gpu` tag to this metric.

### GPU Temperature

The GPU temperature in Celsius for each GPU.

W\&B assigns a `gpu.{gpu_index}.temp` tag to this metric.

### GPU Power Usage Watts

Indicates the GPU power usage in Watts for each GPU.

W\&B assigns a `gpu.{gpu_index}.powerWatts` tag to this metric.

### GPU Power Usage Percent

Reflects the GPU power usage as a percentage of its power capacity for each GPU.

W\&B assigns a `gpu.{gpu_index}.powerPercent` tag to this metric.

### GPU SM Clock Speed

Represents the clock speed of the Streaming Multiprocessor (SM) on the GPU in MHz. This metric is indicative of the processing speed within the GPU cores responsible for computation tasks.

W\&B assigns a `gpu.{gpu_index}.smClock` tag to this metric.

### GPU Memory Clock Speed

Represents the clock speed of the GPU memory in MHz, which influences the rate of data transfer between the GPU memory and processing cores.

W\&B assigns a `gpu.{gpu_index}.memoryClock` tag to this metric.

### GPU Graphics Clock Speed

Represents the base clock speed for graphics rendering operations on the GPU, expressed in MHz. This metric often reflects performance during visualization or rendering tasks.

W\&B assigns a `gpu.{gpu_index}.graphicsClock` tag to this metric.

### GPU Corrected Memory Errors

Tracks the count of memory errors on the GPU that W\&B automatically corrects by error-checking protocols, indicating recoverable hardware issues.

W\&B assigns a `gpu.{gpu_index}.correctedMemoryErrors` tag to this metric.

### GPU Uncorrected Memory Errors

Tracks the count of memory errors on the GPU that W\&B uncorrected, indicating non-recoverable errors which can impact processing reliability.

W\&B assigns a `gpu.{gpu_index}.unCorrectedMemoryErrors` tag to this metric.

### GPU Encoder Utilization

Represents the percentage utilization of the GPU's video encoder, indicating its load when encoding tasks (for example, video rendering) are running.

W\&B assigns a `gpu.{gpu_index}.encoderUtilization` tag to this metric.

## AMD GPU

W\&B extracts metrics from the output of the `rocm-smi` tool supplied by AMD (`rocm-smi -a --json`).

ROCm [6.x (latest)](https://rocm.docs.amd.com/en/latest/) and [5.x](https://rocm.docs.amd.com/en/docs-5.6.0/) formats are supported. Learn more about ROCm formats in the [AMD ROCm documentation](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html). The newer format includes more details.

### AMD GPU Utilization

Represents the GPU utilization in percent for each AMD GPU device.

W\&B assigns a `gpu.{gpu_index}.gpu` tag to this metric.

### AMD GPU Memory Allocated

Indicates the GPU memory allocated as a percentage of the total available memory for each AMD GPU device.

W\&B assigns a `gpu.{gpu_index}.memoryAllocated` tag to this metric.

### AMD GPU Temperature

The GPU temperature in Celsius for each AMD GPU device.

W\&B assigns a `gpu.{gpu_index}.temp` tag to this metric.

### AMD GPU Power Usage Watts

The GPU power usage in Watts for each AMD GPU device.

W\&B assigns a `gpu.{gpu_index}.powerWatts` tag to this metric.

### AMD GPU Power Usage Percent

Reflects the GPU power usage as a percentage of its power capacity for each AMD GPU device.

W\&B assigns a `gpu.{gpu_index}.powerPercent` to this metric.

## Apple ARM Mac GPU

### Apple GPU Utilization

Indicates the GPU utilization in percent for Apple GPU devices, specifically on ARM Macs.

W\&B assigns a `gpu.0.gpu` tag to this metric.

### Apple GPU Memory Allocated

The GPU memory allocated as a percentage of the total available memory for Apple GPU devices on ARM Macs.

W\&B assigns a `gpu.0.memoryAllocated` tag to this metric.

### Apple GPU Temperature

The GPU temperature in Celsius for Apple GPU devices on ARM Macs.

W\&B assigns a `gpu.0.temp` tag to this metric.

### Apple GPU Power Usage Watts

The GPU power usage in Watts for Apple GPU devices on ARM Macs.

W\&B assigns a `gpu.0.powerWatts` tag to this metric.

### Apple GPU Power Usage Percent

The GPU power usage as a percentage of its power capacity for Apple GPU devices on ARM Macs.

W\&B assigns a `gpu.0.powerPercent` tag to this metric.

## Graphcore IPU

Graphcore IPUs (Intelligence Processing Units) are unique hardware accelerators designed specifically for machine intelligence tasks.

### IPU Device Metrics

These metrics represent various statistics for a specific IPU device. Each metric has a device ID (`device_id`) and a metric key (`metric_key`) to identify it. W\&B assigns a `ipu.{device_id}.{metric_key}` tag to this metric.

Metrics are extracted using the proprietary `gcipuinfo` library, which interacts with Graphcore's `gcipuinfo` binary. The `sample` method fetches these metrics for each IPU device associated with the process ID (`pid`). Only the metrics that change over time, or the first time a device's metrics are fetched, are logged to avoid logging redundant data.

For each metric, the method `parse_metric` is used to extract the metric's value from its raw string representation. The metrics are then aggregated across multiple samples using the `aggregate` method.

The following lists available metrics and their units:

* **Average Board Temperature** (`average board temp (C)`): Temperature of the IPU board in Celsius.
* **Average Die Temperature** (`average die temp (C)`): Temperature of the IPU die in Celsius.
* **Clock Speed** (`clock (MHz)`): The clock speed of the IPU in MHz.
* **IPU Power** (`ipu power (W)`): Power consumption of the IPU in Watts.
* **IPU Utilization** (`ipu utilisation (%)`): Percentage of IPU utilization.
* **IPU Session Utilization** (`ipu utilisation (session) (%)`): IPU utilization percentage specific to the current session.
* **Data Link Speed** (`speed (GT/s)`): Speed of data transmission in Giga-transfers per second.

## Google Cloud TPU

Tensor Processing Units (TPUs) are Google's custom-developed ASICs (Application Specific Integrated Circuits) used to accelerate machine learning workloads.

<Note>
  The exact metrics reported depend on the TPU chip generation, such as v4, v5e, v5p, v6e, or 7x and on the runtime environment.
</Note>

See [Supported metrics in Google's Cloud TPU documentation](https://docs.cloud.google.com/tpu/docs/tpu-monitoring-library) for more information about the available metrics and their definitions.

### TPU Compute metrics

* `tpu.{tpu_index}.tensorcoreUtilization`: TensorCore utilization percentage per device. It is the most direct measure of whether TensorCore compute units are being fully utilized.

* `tpu.{tpu_index}.dutyCycle`: The percentage of time during the sample period when the accelerator TensorCore was actively processing. This metric is reported per chip and fanned out to all devices on multi-device chips. Higher values indicate better TensorCore utilization.

### TPU Memory metrics

* `tpu.{tpu_index}.hbmCapacityTotal`: Total High Bandwidth Memory (HBM) capacity in bytes per device.

* `tpu.{tpu_index}.hbmCapacityUsage`: Current HBM usage in bytes per device.

### TPU Interconnect health

* `tpu.{tpu_index}.iciLinkHealth`: Inter-Chip Interconnect (ICI) link health per device. This metric is available only through the libtpu SDK path.

### TPU Transfer latency

Latency distributions in microseconds. Reported statistics can include mean, p50, p90, p95, and p999. These metrics are most relevant for multi-slice TPU pod configurations.

* `tpu.bufferTransferLatency.{label}.{stat}Us`: DCN (Data Center Network) buffer transfer latencies between slices.

* `tpu.inboundBufferTransferLatency.{label}.{stat}Us`: Inbound DCN buffer transfer latencies.

* `tpu.hostToDeviceTransferLatency.{label}.{stat}Us`: Host-to-device data transfer latencies.

* `tpu.deviceToHostTransferLatency.{label}.{stat}Us`: Device-to-host data transfer latencies.

### TPU Collective communication

* `tpu.collectiveE2ELatency.{label}.{stat}Us`: End-to-end latency for collective operations, such as all-reduce and all-gather.

* `tpu.hostComputeLatency.{label}.{stat}Us`: Host-side compute latencies, including MXLA compute latencies.

### TPU Network metrics

* `tpu.grpcTcpMinRtt.{stat}Us`: Minimum TCP round-trip time for gRPC connections.

* `tpu.grpcTcpDeliveryRate.{stat}Mbps`: TCP delivery rate for gRPC connections, in megabits per second.

### TPU HLO execution metrics

* `tpu.hloExecTiming.{label}.{stat}Us`: HLO (High Level Operations) execution timing distributions in microseconds. This metric reports per-operation execution time.

* `tpu.hloQueueSize.{label}`: Current size of the HLO execution queue. This metric indicates how many operations are queued for execution.

## AWS Trainium

[AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) is a specialized hardware platform offered by AWS that focuses on accelerating machine learning workloads. The `neuron-monitor` tool from AWS is used to capture the AWS Trainium metrics.

### Trainium Neuron Core Utilization

The utilization percentage of each NeuronCore, reported on a per-core basis.

W\&B assigns a `trn.{core_index}.neuroncore_utilization` tag to this metric.

### Trainium Host Memory Usage, Total

The total memory consumption on the host in bytes.

W\&B assigns a `trn.host_total_memory_usage` tag to this metric.

### Trainium Neuron Device Total Memory Usage

The total memory usage on the Neuron device in bytes.

W\&B assigns a  `trn.neuron_device_total_memory_usage)` tag to this metric.

### Trainium Host Memory Usage Breakdown:

The following is a breakdown of memory usage on the host:

* **Application Memory** (`trn.host_total_memory_usage.application_memory`): Memory used by the application.
* **Constants** (`trn.host_total_memory_usage.constants`): Memory used for constants.
* **DMA Buffers** (`trn.host_total_memory_usage.dma_buffers`): Memory used for Direct Memory Access buffers.
* **Tensors** (`trn.host_total_memory_usage.tensors`): Memory used for tensors.

### Trainium Neuron Core Memory Usage Breakdown

Detailed memory usage information for each NeuronCore:

* **Constants** (`trn.{core_index}.neuroncore_memory_usage.constants`)
* **Model Code** (`trn.{core_index}.neuroncore_memory_usage.model_code`)
* **Model Shared Scratchpad** (`trn.{core_index}.neuroncore_memory_usage.model_shared_scratchpad`)
* **Runtime Memory** (`trn.{core_index}.neuroncore_memory_usage.runtime_memory`)
* **Tensors** (`trn.{core_index}.neuroncore_memory_usage.tensors`)

## OpenMetrics

Capture and log metrics from external endpoints that expose OpenMetrics / Prometheus-compatible data with support for custom regex-based metric filters to be applied to the consumed endpoints.

Refer to [Monitoring GPU cluster performance in W\&B](https://wandb.ai/dimaduev/dcgm/reports/Monitoring-GPU-cluster-performance-with-NVIDIA-DCGM-Exporter-and-Weights-Biases--Vmlldzo0MDYxMTA1) for a detailed example of how to use this feature in a particular case of monitoring GPU cluster performance with the [NVIDIA DCGM-Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
