This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
W&B Inference
Access open-source foundation models through W&B Weave and an OpenAI-compatible API
W&B Inference gives you access to leading open-source foundation models through W&B Weave and an OpenAI-compatible API. You can:
- Build AI applications and agents without signing up for a hosting provider or self-hosting a model
- Try supported models in the W&B Weave Playground
With Weave, you can trace, evaluate, monitor, and improve your W&B Inference-powered applications.
Quickstart
Here’s a simple example using Python:
import openai
client = openai.OpenAI(
# The custom base URL points to W&B Inference
base_url='https://api.inference.wandb.ai/v1',
# Get your API key from https://wandb.ai/authorize
api_key="<your-api-key>"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."}
],
)
print(response.choices[0].message.content)
Next steps
- Review the available models and usage information and limits
- Set up your account using the prerequisites
- Use the service through the API or UI
- Try the usage examples
Usage details
Important
W&B Inference credits come with Free, Pro, and Academic plans for a limited time. Availability may vary for Enterprise accounts. When credits run out:
To learn more, visit the pricing page or see model-specific costs.
1 - Prerequisites
Set up your environment to use W&B Inference
Complete these steps before using the W&B Inference service through the API or UI.
Set up your W&B account and project
You need these items to access W&B Inference:
-
A W&B account
Sign up at W&B
-
A W&B API key
Get your API key at https://wandb.ai/authorize
-
A W&B project
Create a project in your W&B account to track usage
Set up your environment (Python)
To use the Inference API with Python, you also need to:
-
Complete the general requirements above
-
Install the required libraries:
Note
The weave
library is optional but recommended. It lets you trace your LLM applications. Learn more in the Weave Quickstart.
See usage examples for code samples using W&B Inference with Weave.
Next steps
After completing the prerequisites:
2 - Available Models
Browse the foundation models available through W&B Inference
W&B Inference provides access to several open-source foundation models. Each model has different strengths and use cases.
Model comparison
Model |
Model ID (for API usage) |
Type |
Context Window |
Parameters |
Description |
OpenAI GPT OSS 120B |
openai/gpt-oss-120b |
Text |
131,000 |
5.1B-117B (Active-Total) |
Efficient Mixture-of-Experts model designed for high-reasoning, agentic and general-purpose use cases. |
OpenAI GPT OSS 20B |
openai/gpt-oss-20b |
Text |
131,000 |
3.6B-20B (Active-Total) |
Lower latency Mixture-of-Experts model trained on OpenAI’s Harmony response format with reasoning capabilities. |
Qwen3 235B A22B Thinking-2507 |
Qwen/Qwen3-235B-A22B-Thinking-2507 |
Text |
262K |
22B-235B (Active-Total) |
High-performance Mixture-of-Experts model optimized for structured reasoning, math, and long-form generation |
Qwen3 235B A22B-2507 |
Qwen/Qwen3-235B-A22B-Instruct-2507 |
Text |
262K |
22B-235B (Active-Total) |
Efficient multilingual, Mixture-of-Experts, instruction-tuned model, optimized for logical reasoning |
Qwen3 Coder 480B A35B |
Qwen/Qwen3-Coder-480B-A35B-Instruct |
Text |
262K |
35B-480B (Active-Total) |
Mixture-of-Experts model optimized for coding tasks such as function calling, tooling use, and long-context reasoning |
MoonshotAI Kimi K2 |
moonshotai/Kimi-K2-Instruct |
Text |
128K |
32B-1T (Active-Total) |
Mixture-of-Experts model optimized for complex tool use, reasoning, and code synthesis |
DeepSeek R1-0528 |
deepseek-ai/DeepSeek-R1-0528 |
Text |
161K |
37B-680B (Active-Total) |
Optimized for precise reasoning tasks including complex coding, math, and structured document analysis |
DeepSeek V3-0324 |
deepseek-ai/DeepSeek-V3-0324 |
Text |
161K |
37B-680B (Active-Total) |
Robust Mixture-of-Experts model tailored for high-complexity language processing and comprehensive document analysis |
Meta Llama 3.1 8B |
meta-llama/Llama-3.1-8B-Instruct |
Text |
128K |
8B (Total) |
Efficient conversational model optimized for responsive multilingual chatbot interactions |
Meta Llama 3.3 70B |
meta-llama/Llama-3.3-70B-Instruct |
Text |
128K |
70B (Total) |
Multilingual model excelling in conversational tasks, detailed instruction-following, and coding |
Meta Llama 4 Scout |
meta-llama/Llama-4-Scout-17B-16E-Instruct |
Text, Vision |
64K |
17B-109B (Active-Total) |
Multi-modal model integrating text and image understanding, ideal for visual tasks and combined analysis |
Microsoft Phi 4 Mini 3.8B |
microsoft/Phi-4-mini-instruct |
Text |
128K |
3.8B (Active-Total) |
Compact, efficient model ideal for fast responses in resource-constrained environments |
Using model IDs
When using the API, specify the model using its ID from the table above. For example:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[...]
)
Next steps
3 - Usage Information and Limits
Understand pricing, usage limits, and account restrictions for W&B Inference
Learn about pricing, limits, and other important usage information before using W&B Inference.
Pricing
For detailed model pricing information, visit W&B Inference pricing.
Purchase more credits
W&B Inference credits come with Free, Pro, and Academic plans for a limited time. Enterprise availability may vary. When credits run out:
Account tiers and default usage caps
Each account tier has a default spending cap to help manage costs and prevent unexpected charges. W&B requires prepayment for paid Inference access.
Some users may need to change their cap. Contact your account executive or support to adjust your limit.
Account Tier |
Default Cap |
How to Change Limit |
Pro |
$6,000/month |
Contact your account executive or support for manual review |
Enterprise |
$700,000/year |
Contact your account executive or support for manual review |
Concurrency limits
If you exceed the rate limit, the API returns a 429 Concurrency limit reached for requests
response. To fix this error, reduce the number of concurrent requests. For detailed troubleshooting, see W&B Inference support articles.
W&B applies rate limits per W&B project. For example, if you have 3 projects in a team, each project has its own rate limit quota.
Personal entities unsupported
Note
W&B deprecated personal entities in May 2024, so this only applies to legacy accounts.
Personal accounts (personal entities) don’t support W&B Inference. To access W&B Inference, switch to a non-personal account by creating a Team.
Geographic restrictions
The Inference service is only available from supported geographic locations. For more information, see the Terms of Service.
Next steps
4 - API Reference
Complete API reference for W&B Inference service
Learn how to use the W&B Inference API to access foundation models programmatically.
Endpoint
Access the Inference service at:
https://api.inference.wandb.ai/v1
Important
To use this endpoint, you need:
- A W&B account with Inference credits
- A valid W&B API key
- A W&B entity (team) and project
In code samples, these appear as <your-team>/<your-project>
.
Available methods
The Inference API supports these methods:
Chat completions
Create a chat completion using the /chat/completions
endpoint. This endpoint follows the OpenAI format for sending messages and receiving responses.
To create a chat completion, provide:
- The Inference service base URL:
https://api.inference.wandb.ai/v1
- Your W&B API key:
<your-api-key>
- Your W&B entity and project:
<your-team>/<your-project>
- A model ID from the available models
curl https://api.inference.wandb.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-key>" \
-H "OpenAI-Project: <your-team>/<your-project>" \
-d '{
"model": "<model-id>",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Tell me a joke." }
]
}'
import openai
client = openai.OpenAI(
# The custom base URL points to W&B Inference
base_url='https://api.inference.wandb.ai/v1',
# Get your API key from https://wandb.ai/authorize
# Consider setting it in the environment as OPENAI_API_KEY instead for safety
api_key="<your-api-key>"
)
# Replace <model-id> with any model ID from the available models list
response = client.chat.completions.create(
model="<model-id>",
messages=[
{"role": "system", "content": "<your-system-prompt>"},
{"role": "user", "content": "<your-prompt>"}
],
)
print(response.choices[0].message.content)
List supported models
Get all available models and their IDs. Use this to select models dynamically or check what’s available.
curl https://api.inference.wandb.ai/v1/models \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-key>" \
-H "OpenAI-Project: <your-team>/<your-project>"
import openai
client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key="<your-api-key>"
)
response = client.models.list()
for model in response.data:
print(model.id)
The API returns responses in OpenAI-compatible format:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1234567890,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here's a joke for you..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 50,
"total_tokens": 75
}
}
Next steps
5 - Usage Examples
Learn how to use W&B Inference with practical code examples
These examples show how to use W&B Inference with Weave for tracing, evaluation, and comparison.
Basic example: Trace Llama 3.1 8B with Weave
This example shows how to send a prompt to the Llama 3.1 8B model and trace the call with Weave. Tracing captures the full input and output of the LLM call, monitors performance, and lets you analyze results in the Weave UI.
In this example:
- You define a
@weave.op()
-decorated function that makes a chat completion request
- Your traces are recorded and linked to your W&B entity and project
- The function is automatically traced, logging inputs, outputs, latency, and metadata
- The result prints in the terminal, and the trace appears in your Traces tab at https://wandb.ai
Before running this example, complete the prerequisites.
import weave
import openai
# Set the Weave team and project for tracing
weave.init("<your-team>/<your-project>")
client = openai.OpenAI(
base_url='https://api.inference.wandb.ai/v1',
# Get your API key from https://wandb.ai/authorize
api_key="<your-api-key>",
)
# Trace the model call in Weave
@weave.op()
def run_chat():
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."}
],
)
return response.choices[0].message.content
# Run and log the traced call
output = run_chat()
print(output)
After running the code, view the trace in Weave by:
- Clicking the link printed in the terminal (for example:
https://wandb.ai/<your-team>/<your-project>/r/call/01977f8f-839d-7dda-b0c2-27292ef0e04g
)
- Or navigating to https://wandb.ai and selecting the Traces tab
Advanced example: Use Weave Evaluations and Leaderboards
Besides tracing model calls, you can also evaluate performance and publish leaderboards.
This example compares two models on a question-answer dataset, and sets a custom project
name in the client initialization, specifying where to send logs.
Before running this example, complete the prerequisites.
import os
import asyncio
import openai
import weave
from weave.flow import leaderboard
from weave.trace.ref_util import get_ref
# Set the Weave team and project for tracing
weave.init("<your-team>/<your-project>")
dataset = [
{"input": "What is 2 + 2?", "target": "4"},
{"input": "Name a primary color.", "target": "red"},
]
@weave.op
def exact_match(target: str, output: str) -> float:
return float(target.strip().lower() == output.strip().lower())
class WBInferenceModel(weave.Model):
model: str
@weave.op
def predict(self, prompt: str) -> str:
client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
# Get your API key from https://wandb.ai/authorize
api_key="<your-api-key>",
# Optional: Customizes the logs destination
project="<your-team>/<your-project>"
)
resp = client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
)
return resp.choices[0].message.content
llama = WBInferenceModel(model="meta-llama/Llama-3.1-8B-Instruct")
deepseek = WBInferenceModel(model="deepseek-ai/DeepSeek-V3-0324")
def preprocess_model_input(example):
return {"prompt": example["input"]}
evaluation = weave.Evaluation(
name="QA",
dataset=dataset,
scorers=[exact_match],
preprocess_model_input=preprocess_model_input,
)
async def run_eval():
await evaluation.evaluate(llama)
await evaluation.evaluate(deepseek)
asyncio.run(run_eval())
spec = leaderboard.Leaderboard(
name="Inference Leaderboard",
description="Compare models on a QA dataset",
columns=[
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluation).uri(),
scorer_name="exact_match",
summary_metric_path="mean",
)
],
)
weave.publish(spec)
After running this code, go to your W&B account at https://wandb.ai/ and:
Next steps
6 - UI Guide
Access W&B Inference models through the web interface
Learn how to use the W&B Inference service through the web UI. Before using the UI, complete the prerequisites.
Access the Inference service
You can access the Inference service from three places:
Direct link
Go to https://wandb.ai/inference.
From the Inference tab
- Go to your W&B account at https://wandb.ai/
- Select Inference from the left sidebar
- A page displays with available models and model information
From the Playground tab
- Select Playground from the left sidebar. The Playground chat UI appears
- Hover over W&B Inference in the LLM dropdown list. A dropdown with available models appears on the right
- From the models dropdown, you can:
Try a model in the Playground
After selecting a model, you can test it in the Playground. Available actions include:
Compare multiple models
You can compare Inference models side by side in the Playground. Access the Compare view from two places:
From the Inference tab
- Select Inference from the left sidebar. The available models page appears
- Click anywhere on a model card (except the model name) to select it. The card border turns blue
- Repeat for each model you want to compare
- Click Compare N models in the Playground on any selected card.
N
shows the number of models selected
- The comparison view opens
Now you can compare models and use all features from Try a model in the Playground.
From the Playground tab
- Select Playground from the left sidebar. The Playground chat UI appears
- Hover over W&B Inference in the LLM dropdown list. The models dropdown appears on the right
- Select Compare from the dropdown. The Inference tab appears
- Click anywhere on a model card (except the model name) to select it. The card border turns blue
- Repeat for each model you want to compare
- Click Compare N models in the Playground on any selected card. The comparison view opens
Now you can compare models and use all features from Try a model in the Playground.
Organization admins can track credit balance, usage history, and upcoming bills from the W&B UI:
- Go to the W&B Billing page in the UI
- Find the Inference billing information card in the bottom right corner
- From here you can:
- Click View usage to see your usage over time
- View upcoming inference charges (for paid plans)
Next steps
7 - Support
Find answers to common W&B Inference questions