> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Available models

> Browse the foundation models available through Serverless Inference


Serverless Inference provides access to several open source foundation models. Each model has different strengths and use cases.

## Generally available models

The following models are [generally available](/inference/lifecycle#model-lifecycle-stages):

| Model                         | Model ID (for API usage)                       | Type         | Context Window | Parameters                | Description                                                                                                                                       |
| ----------------------------- | ---------------------------------------------- | ------------ | -------------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| DeepSeek V4-Flash             | `deepseek-ai/DeepSeek-V4-Flash`                | Text         | 1049k          | 13B-284B (Active-Total)   | DeepSeek V4-Flash is an MoE model with 1M context length great for coding, reasoning, and agentic workloads.                                      |
| DeepSeek V4-Pro               | `deepseek-ai/DeepSeek-V4-Pro`                  | Text         | 1049k          | 49B-1.6T (Active-Total)   | DeepSeek V4-Pro is a 1.6T-parameter MoE model with 49B active parameters excelling at advanced reasoning, coding, and complex agentic workloads.  |
| DeepSeek V3.1                 | `deepseek-ai/DeepSeek-V3.1`                    | Text         | 161k           | 37B-671B (Active-Total)   | A large hybrid model that supports both thinking and non-thinking modes via prompt templates.                                                     |
| Google Gemma 4 31B            | `google/gemma-4-31B-it`                        | Text, Vision | 262k           | 31B (Total)               | Gemma 4 31B Dense is designed for advanced reasoning, agentic workflows, and longer context and is natively trained on 140+ languages.            |
| IBM Granite 4.1 8B            | `ibm-granite/granite-4.1-8b`                   | Text         | 131k           | 8B (Total)                | Granite 4.1 8B is a long-context instruct model capable of enhanced tool calling, instruction following, and chat capabilities.                   |
| JetBrains Mellum2 12B A2.5B   | `JetBrains/Mellum2-12B-A2.5B-Instruct`         | Text         | 131k           | 2.5B-12B (Active-Total)   | Mellum2-12B-A2.5B-Instruct is a fast MoE model with 131K context built for coding, tool use, and low-latency AI workflows.                        |
| Meta Llama 3.3 70B            | `meta-llama/Llama-3.3-70B-Instruct`            | Text         | 128k           | 70B (Total)               | Multilingual model excelling in conversational tasks, detailed instruction-following, and coding.                                                 |
| Meta Llama 3.1 70B            | `meta-llama/Llama-3.1-70B-Instruct`            | Text         | 128k           | 70B (Total)               | Efficient conversational model optimized for responsive multilingual chatbot interactions.                                                        |
| Meta Llama 3.1 8B             | `meta-llama/Llama-3.1-8B-Instruct`             | Text         | 128k           | 8B (Total)                | Efficient conversational model optimized for responsive multilingual chatbot interactions.                                                        |
| Microsoft Phi 4 Mini 3.8B     | `microsoft/Phi-4-mini-instruct`                | Text         | 128k           | 3.8B (Total)              | Compact, efficient model ideal for fast responses in resource-constrained environments.                                                           |
| MiniMax M2.5                  | `MiniMaxAI/MiniMax-M2.5`                       | Text         | 197k           | 10B-230B (Active-Total)   | MoE model with a highly sparse architecture designed for high-throughput and low latency with strong coding capabilities.                         |
| Moonshot AI Kimi K2.6         | `moonshotai/Kimi-K2.6`                         | Text, Vision | 262k           | 32B-1T (Active-Total)     | Kimi K2.6 is a multimodal Mixture-of-Experts language model featuring 32 billion activated parameters and a total of 1 trillion parameters.       |
| Moonshot AI Kimi K2.5         | `moonshotai/Kimi-K2.5`                         | Text, Vision | 262k           | 32B-1T (Active-Total)     | Kimi K2.5 is a multimodal Mixture-of-Experts language model featuring 32 billion activated parameters and a total of 1 trillion parameters.       |
| NVIDIA Nemotron 3 Super 120B  | `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8` | Text         | 262k           | 12B-120B (Active-Total)   | Nemotron 3 is a LatentMoE model designed to deliver strong agentic, reasoning, and conversational capabilities.                                   |
| NVIDIA Nemotron 3 Ultra       | `nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B`     | Text         | 262k           | 55B-550B (Active-Total)   | Nemotron 3 Ultra is a powerful MoE model designed for long-running agents across coding, deep research, and enterprise automation.                |
| OpenAI GPT OSS 120B           | `openai/gpt-oss-120b`                          | Text         | 131k           | 5.1B-117B (Active-Total)  | Efficient Mixture-of-Experts model designed for high-reasoning, agentic and general-purpose use cases.                                            |
| OpenAI GPT OSS 20B            | `openai/gpt-oss-20b`                           | Text         | 131k           | 3.6B-20B (Active-Total)   | Lower latency Mixture-of-Experts model trained on OpenAI's Harmony response format with reasoning capabilities.                                   |
| OpenPipe Qwen3 14B Instruct   | `OpenPipe/Qwen3-14B-Instruct`                  | Text         | 32.8k          | 14.8B (Total)             | An efficient multilingual, dense, instruction-tuned model, optimized by OpenPipe for building agents with finetuning.                             |
| Qwen3.6 35B A3B               | `Qwen/Qwen3.6-35B-A3B`                         | Text, Vision | 262k           | 3B-35B (Active-Total)     | Qwen3.6-35B-A3B is an MoE multimodal model with 262K context optimized for agentic coding workflows.                                              |
| Qwen3.6 27B                   | `Qwen/Qwen3.6-27B`                             | Text, Vision | 262k           | 27B (Total)               | Qwen3.6-27B is a 27B dense multimodal model with 262K context built for flagship-level agentic coding.                                            |
| Qwen3.5 35B A3B               | `Qwen/Qwen3.5-35B-A3B`                         | Text, Vision | 262k           | 3B-35B (Active-Total)     | Qwen3.5-35B-A3B is an open-weights multimodal MoE model built for efficient, high-throughput inference across chat, reasoning, and agentic tasks. |
| Qwen3 235B A22B Thinking-2507 | `Qwen/Qwen3-235B-A22B-Thinking-2507`           | Text         | 262k           | 22B-235B (Active-Total)   | High-performance Mixture-of-Experts model optimized for structured reasoning, math, and long-form generation.                                     |
| Qwen3 235B A22B-2507          | `Qwen/Qwen3-235B-A22B-Instruct-2507`           | Text         | 262k           | 22B-235B (Active-Total)   | Efficient multilingual, Mixture-of-Experts, instruction-tuned model, optimized for logical reasoning.                                             |
| Qwen3 30B A3B                 | `Qwen/Qwen3-30B-A3B-Instruct-2507`             | Text         | 262k           | 3.3B-30.5B (Active-Total) | Qwen3-30B-A3B-Instruct-2507 is a 30.5B MoE instruction-tuned model with enhanced reasoning, coding, and long-context understanding.               |
| Qwen3 Coder 480B A35B         | `Qwen/Qwen3-Coder-480B-A35B-Instruct`          | Text         | 262k           | 35B-480B (Active-Total)   | Mixture-of-Experts model optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning.                       |
| Z.AI GLM 5.1                  | `zai-org/GLM-5.1`                              | Text         | 203k           | 40B-744B (Active-Total)   | Powerful MoE model for long-horizon agentic engineering and advanced reasoning.                                                                   |

## Experimental models

The following models are [experimental](/inference/lifecycle#model-lifecycle-stages):

| Model       | Model ID (for API usage) | Type         | Context Window | Parameters  | Description                                                                                                         |
| ----------- | ------------------------ | ------------ | -------------- | ----------- | ------------------------------------------------------------------------------------------------------------------- |
| Qwen3.5 27B | `Qwen/Qwen3.5-27B`       | Text, Vision | 262k           | 27B (Total) | Qwen3.5-27B is a dense model from the Qwen3.5 family built for high performance across a large range of benchmarks. |

## Deprecated models

The following models are [deprecated](/inference/lifecycle#model-lifecycle-stages):

*None currently*

## Use model IDs

To specify a model when calling the API, use its `Model ID` from the preceding tables. For example:

```python theme={null}
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[...]
)
```

## Next steps

After you've chosen a model, continue with one of the following resources:

* Check [usage limits and pricing](/inference/usage-limits/) for each model.
* See the [API reference](/inference/api-reference/) for how to use these models.
* Try models in the [W\&B Playground](/inference/ui-guide/).