This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

W&B Inference

Access open-source foundation models through W&B Weave and an OpenAI-compatible API

W&B Inference gives you access to leading open-source foundation models through W&B Weave and an OpenAI-compatible API. You can:

  • Build AI applications and agents without signing up for a hosting provider or self-hosting a model
  • Try supported models in the W&B Weave Playground

With Weave, you can trace, evaluate, monitor, and improve your W&B Inference-powered applications.

Quickstart

Here’s a simple example using Python:

import openai

client = openai.OpenAI(
    # The custom base URL points to W&B Inference
    base_url='https://api.inference.wandb.ai/v1',
    
    # Get your API key from https://wandb.ai/authorize
    api_key="<your-api-key>"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."}
    ],
)

print(response.choices[0].message.content)

Next steps

  1. Review the available models and usage information and limits
  2. Set up your account using the prerequisites
  3. Use the service through the API or UI
  4. Try the usage examples

Usage details

1 - Prerequisites

Set up your environment to use W&B Inference

Complete these steps before using the W&B Inference service through the API or UI.

Set up your W&B account and project

You need these items to access W&B Inference:

  1. A W&B account
    Sign up at W&B

  2. A W&B API key
    Get your API key at https://wandb.ai/authorize

  3. A W&B project
    Create a project in your W&B account to track usage

Set up your environment (Python)

To use the Inference API with Python, you also need to:

  1. Complete the general requirements above

  2. Install the required libraries:

    pip install openai weave
    

Next steps

After completing the prerequisites:

2 - Available Models

Browse the foundation models available through W&B Inference

W&B Inference provides access to several open-source foundation models. Each model has different strengths and use cases.

Model comparison

Model Model ID (for API usage) Type Context Window Parameters Description
OpenAI GPT OSS 120B openai/gpt-oss-120b Text 131,000 5.1B-117B (Active-Total) Efficient Mixture-of-Experts model designed for high-reasoning, agentic and general-purpose use cases.
OpenAI GPT OSS 20B openai/gpt-oss-20b Text 131,000 3.6B-20B (Active-Total) Lower latency Mixture-of-Experts model trained on OpenAI’s Harmony response format with reasoning capabilities.
Qwen3 235B A22B Thinking-2507 Qwen/Qwen3-235B-A22B-Thinking-2507 Text 262K 22B-235B (Active-Total) High-performance Mixture-of-Experts model optimized for structured reasoning, math, and long-form generation
Qwen3 235B A22B-2507 Qwen/Qwen3-235B-A22B-Instruct-2507 Text 262K 22B-235B (Active-Total) Efficient multilingual, Mixture-of-Experts, instruction-tuned model, optimized for logical reasoning
Qwen3 Coder 480B A35B Qwen/Qwen3-Coder-480B-A35B-Instruct Text 262K 35B-480B (Active-Total) Mixture-of-Experts model optimized for coding tasks such as function calling, tooling use, and long-context reasoning
MoonshotAI Kimi K2 moonshotai/Kimi-K2-Instruct Text 128K 32B-1T (Active-Total) Mixture-of-Experts model optimized for complex tool use, reasoning, and code synthesis
DeepSeek R1-0528 deepseek-ai/DeepSeek-R1-0528 Text 161K 37B-680B (Active-Total) Optimized for precise reasoning tasks including complex coding, math, and structured document analysis
DeepSeek V3-0324 deepseek-ai/DeepSeek-V3-0324 Text 161K 37B-680B (Active-Total) Robust Mixture-of-Experts model tailored for high-complexity language processing and comprehensive document analysis
Meta Llama 3.1 8B meta-llama/Llama-3.1-8B-Instruct Text 128K 8B (Total) Efficient conversational model optimized for responsive multilingual chatbot interactions
Meta Llama 3.3 70B meta-llama/Llama-3.3-70B-Instruct Text 128K 70B (Total) Multilingual model excelling in conversational tasks, detailed instruction-following, and coding
Meta Llama 4 Scout meta-llama/Llama-4-Scout-17B-16E-Instruct Text, Vision 64K 17B-109B (Active-Total) Multi-modal model integrating text and image understanding, ideal for visual tasks and combined analysis
Microsoft Phi 4 Mini 3.8B microsoft/Phi-4-mini-instruct Text 128K 3.8B (Active-Total) Compact, efficient model ideal for fast responses in resource-constrained environments

Using model IDs

When using the API, specify the model using its ID from the table above. For example:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[...]
)

Next steps

3 - Usage Information and Limits

Understand pricing, usage limits, and account restrictions for W&B Inference

Learn about pricing, limits, and other important usage information before using W&B Inference.

Pricing

For detailed model pricing information, visit W&B Inference pricing.

Purchase more credits

W&B Inference credits come with Free, Pro, and Academic plans for a limited time. Enterprise availability may vary. When credits run out:

Account tiers and default usage caps

Each account tier has a default spending cap to help manage costs and prevent unexpected charges. W&B requires prepayment for paid Inference access.

Some users may need to change their cap. Contact your account executive or support to adjust your limit.

Account Tier Default Cap How to Change Limit
Pro $6,000/month Contact your account executive or support for manual review
Enterprise $700,000/year Contact your account executive or support for manual review

Concurrency limits

If you exceed the rate limit, the API returns a 429 Concurrency limit reached for requests response. To fix this error, reduce the number of concurrent requests. For detailed troubleshooting, see W&B Inference support articles.

W&B applies rate limits per W&B project. For example, if you have 3 projects in a team, each project has its own rate limit quota.

Personal entities unsupported

Personal accounts (personal entities) don’t support W&B Inference. To access W&B Inference, switch to a non-personal account by creating a Team.

Geographic restrictions

The Inference service is only available from supported geographic locations. For more information, see the Terms of Service.

Next steps

4 - API Reference

Complete API reference for W&B Inference service

Learn how to use the W&B Inference API to access foundation models programmatically.

Endpoint

Access the Inference service at:

https://api.inference.wandb.ai/v1

Available methods

The Inference API supports these methods:

Chat completions

Create a chat completion using the /chat/completions endpoint. This endpoint follows the OpenAI format for sending messages and receiving responses.

To create a chat completion, provide:

  • The Inference service base URL: https://api.inference.wandb.ai/v1
  • Your W&B API key: <your-api-key>
  • Your W&B entity and project: <your-team>/<your-project>
  • A model ID from the available models
curl https://api.inference.wandb.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "OpenAI-Project: <your-team>/<your-project>" \
  -d '{
    "model": "<model-id>",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Tell me a joke." }
    ]
  }'
import openai

client = openai.OpenAI(
    # The custom base URL points to W&B Inference
    base_url='https://api.inference.wandb.ai/v1',

    # Get your API key from https://wandb.ai/authorize
    # Consider setting it in the environment as OPENAI_API_KEY instead for safety
    api_key="<your-api-key>"
)

# Replace <model-id> with any model ID from the available models list
response = client.chat.completions.create(
    model="<model-id>",
    messages=[
        {"role": "system", "content": "<your-system-prompt>"},
        {"role": "user", "content": "<your-prompt>"}
    ],
)

print(response.choices[0].message.content)

List supported models

Get all available models and their IDs. Use this to select models dynamically or check what’s available.

curl https://api.inference.wandb.ai/v1/models \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "OpenAI-Project: <your-team>/<your-project>" 
import openai

client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key="<your-api-key>"
)

response = client.models.list()

for model in response.data:
    print(model.id)

Response format

The API returns responses in OpenAI-compatible format:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a joke for you..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 50,
    "total_tokens": 75
  }
}

Next steps

5 - Usage Examples

Learn how to use W&B Inference with practical code examples

These examples show how to use W&B Inference with Weave for tracing, evaluation, and comparison.

Basic example: Trace Llama 3.1 8B with Weave

This example shows how to send a prompt to the Llama 3.1 8B model and trace the call with Weave. Tracing captures the full input and output of the LLM call, monitors performance, and lets you analyze results in the Weave UI.

In this example:

  • You define a @weave.op()-decorated function that makes a chat completion request
  • Your traces are recorded and linked to your W&B entity and project
  • The function is automatically traced, logging inputs, outputs, latency, and metadata
  • The result prints in the terminal, and the trace appears in your Traces tab at https://wandb.ai

Before running this example, complete the prerequisites.

import weave
import openai

# Set the Weave team and project for tracing
weave.init("<your-team>/<your-project>")

client = openai.OpenAI(
    base_url='https://api.inference.wandb.ai/v1',

    # Get your API key from https://wandb.ai/authorize
    api_key="<your-api-key>",
)

# Trace the model call in Weave
@weave.op()
def run_chat():
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a joke."}
        ],
    )
    return response.choices[0].message.content

# Run and log the traced call
output = run_chat()
print(output)

After running the code, view the trace in Weave by:

  1. Clicking the link printed in the terminal (for example: https://wandb.ai/<your-team>/<your-project>/r/call/01977f8f-839d-7dda-b0c2-27292ef0e04g)
  2. Or navigating to https://wandb.ai and selecting the Traces tab

Advanced example: Use Weave Evaluations and Leaderboards

Besides tracing model calls, you can also evaluate performance and publish leaderboards. This example compares two models on a question-answer dataset, and sets a custom project name in the client initialization, specifying where to send logs.

Before running this example, complete the prerequisites.

import os
import asyncio
import openai
import weave
from weave.flow import leaderboard
from weave.trace.ref_util import get_ref

# Set the Weave team and project for tracing
weave.init("<your-team>/<your-project>")

dataset = [
    {"input": "What is 2 + 2?", "target": "4"},
    {"input": "Name a primary color.", "target": "red"},
]

@weave.op
def exact_match(target: str, output: str) -> float:
    return float(target.strip().lower() == output.strip().lower())

class WBInferenceModel(weave.Model):
    model: str

    @weave.op
    def predict(self, prompt: str) -> str:
        client = openai.OpenAI(
            base_url="https://api.inference.wandb.ai/v1",
            # Get your API key from https://wandb.ai/authorize
            api_key="<your-api-key>",
            # Optional: Customizes the logs destination
            project="<your-team>/<your-project>"
        )
        resp = client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
        )
        return resp.choices[0].message.content

llama = WBInferenceModel(model="meta-llama/Llama-3.1-8B-Instruct")
deepseek = WBInferenceModel(model="deepseek-ai/DeepSeek-V3-0324")

def preprocess_model_input(example):
    return {"prompt": example["input"]}

evaluation = weave.Evaluation(
    name="QA",
    dataset=dataset,
    scorers=[exact_match],
    preprocess_model_input=preprocess_model_input,
)

async def run_eval():
    await evaluation.evaluate(llama)
    await evaluation.evaluate(deepseek)

asyncio.run(run_eval())

spec = leaderboard.Leaderboard(
    name="Inference Leaderboard",
    description="Compare models on a QA dataset",
    columns=[
        leaderboard.LeaderboardColumn(
            evaluation_object_ref=get_ref(evaluation).uri(),
            scorer_name="exact_match",
            summary_metric_path="mean",
        )
    ],
)

weave.publish(spec)

After running this code, go to your W&B account at https://wandb.ai/ and:

View your model evaluations View your leaderboard

Next steps

6 - UI Guide

Access W&B Inference models through the web interface

Learn how to use the W&B Inference service through the web UI. Before using the UI, complete the prerequisites.

Access the Inference service

You can access the Inference service from three places:

Go to https://wandb.ai/inference.

From the Inference tab

  1. Go to your W&B account at https://wandb.ai/
  2. Select Inference from the left sidebar
  3. A page displays with available models and model information
Using an Inference model in the Playground

From the Playground tab

  1. Select Playground from the left sidebar. The Playground chat UI appears
  2. Hover over W&B Inference in the LLM dropdown list. A dropdown with available models appears on the right
  3. From the models dropdown, you can:
The Inference models dropdown in Playground

Try a model in the Playground

After selecting a model, you can test it in the Playground. Available actions include:

Compare multiple models

You can compare Inference models side by side in the Playground. Access the Compare view from two places:

From the Inference tab

  1. Select Inference from the left sidebar. The available models page appears
  2. Click anywhere on a model card (except the model name) to select it. The card border turns blue
  3. Repeat for each model you want to compare
  4. Click Compare N models in the Playground on any selected card. N shows the number of models selected
  5. The comparison view opens

Now you can compare models and use all features from Try a model in the Playground.

Select multiple models to compare in Playground

From the Playground tab

  1. Select Playground from the left sidebar. The Playground chat UI appears
  2. Hover over W&B Inference in the LLM dropdown list. The models dropdown appears on the right
  3. Select Compare from the dropdown. The Inference tab appears
  4. Click anywhere on a model card (except the model name) to select it. The card border turns blue
  5. Repeat for each model you want to compare
  6. Click Compare N models in the Playground on any selected card. The comparison view opens

Now you can compare models and use all features from Try a model in the Playground.

View billing and usage information

Organization admins can track credit balance, usage history, and upcoming bills from the W&B UI:

  1. Go to the W&B Billing page in the UI
  2. Find the Inference billing information card in the bottom right corner
  3. From here you can:
    • Click View usage to see your usage over time
    • View upcoming inference charges (for paid plans)

Next steps

7 - Support

Find answers to common W&B Inference questions