> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation benchmark catalog

> Browse the evaluation benchmarks available through LLM Evaluation Jobs


<Note>
  LLM Evaluation Jobs is in **Preview** for [W\&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud). Compute is free during the preview period. [Learn more](/models/launch#pricing)
</Note>

This page lists the evaluation benchmarks [LLM Evaluation Jobs](/models/launch) provides by category.

To run certain benchmarks, a team admin must add the required API keys as [team-scoped secrets](/platform/secrets#add-a-secret). Any team member can specify the secret when configuring an evaluation job.

* If a benchmark has `true` in the **OpenAI Model Scorer** column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark with this requirement, set the **Scorer API key** field to the secret.
* If a benchmark has a link in the **Gated Hugging Face Dataset** column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face, create a Hugging Face user access token, and configure a team secret with the access key. When you configure a benchmark with this requirement, set the **Hugging Face Token** field to the secret.

## Knowledge

Evaluate factual knowledge across various domains like science, language, and general reasoning.

| Evaluation                                                                                    | Task ID             | <div className="!w-[100px]">OpenAI Scorer</div> | Gated Hugging Face Dataset | Description                                                 |
| --------------------------------------------------------------------------------------------- | ------------------- | ----------------------------------------------- | -------------------------- | ----------------------------------------------------------- |
| [BoolQ](https://github.com/google-research-datasets/boolean-questions)                        | `boolq`             |                                                 |                            | Boolean yes/no questions from natural language queries      |
| [GPQA Diamond](https://arxiv.org/abs/2311.12022)                                              | `gpqa_diamond`      |                                                 |                            | Graduate-level science questions (highest quality subset)   |
| [HLE](https://arxiv.org/abs/2501.14249)                                                       | `hle`               |                                                 | Yes                        | Human-level evaluation benchmark                            |
| [Lingoly](https://arxiv.org/abs/2406.06196)                                                   | `lingoly`           |                                                 | Yes                        | Linguistics olympiad problems                               |
| [Lingoly Too](https://arxiv.org/abs/2503.02972)                                               | `lingoly_too`       |                                                 | Yes                        | Extended linguistics challenge problems                     |
| [MMIU](https://arxiv.org/abs/2408.02718)                                                      | `mmiu`              |                                                 |                            | Massive Multitask Language Understanding benchmark          |
| [MMLU (0-shot)](https://github.com/hendrycks/test)                                            | `mmlu_0_shot`       |                                                 |                            | Massive Multitask Language Understanding without examples   |
| [MMLU (5-shot)](https://github.com/hendrycks/test)                                            | `mmlu_5_shot`       |                                                 |                            | Massive Multitask Language Understanding with 5 examples    |
| [MMLU-Pro](https://arxiv.org/abs/2406.01574)                                                  | `mmlu_pro`          |                                                 |                            | More challenging version of MMLU                            |
| [ONET M6](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/onet) | `onet_m6`           |                                                 |                            | Occupational knowledge benchmark                            |
| [PAWS](https://github.com/google-research-datasets/paws)                                      | `paws`              |                                                 |                            | Paraphrase adversarial word substitution                    |
| [SevenLLM MCQ (English)](https://arxiv.org/abs/2405.03446)                                    | `sevenllm_mcq_en`   |                                                 |                            | Multiple choice questions in English                        |
| [SevenLLM MCQ (Chinese)](https://arxiv.org/abs/2405.03446)                                    | `sevenllm_mcq_zh`   |                                                 |                            | Multiple choice questions in Chinese                        |
| [SevenLLM QA (English)](https://arxiv.org/abs/2405.03446)                                     | `sevenllm_qa_en`    |                                                 |                            | Question answering in English                               |
| [SevenLLM QA (Chinese)](https://arxiv.org/abs/2405.03446)                                     | `sevenllm_qa_zh`    |                                                 |                            | Question answering in Chinese                               |
| [SimpleQA](https://openai.com/index/introducing-simpleqa/)                                    | `simpleqa`          | Yes                                             |                            | Straightforward factual question answering                  |
| [SimpleQA Verified](https://openai.com/index/introducing-simpleqa/)                           | `simpleqa_verified` |                                                 |                            | Verified subset of SimpleQA with validated answers          |
| [WorldSense](https://github.com/facebookresearch/worldsense)                                  | `worldsense`        |                                                 |                            | Evaluates understanding of world knowledge and common sense |

## Reasoning

Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.

| Evaluation                                                          | Task ID          | OpenAI Scorer | Gated HF Dataset | Description                                                                 |
| ------------------------------------------------------------------- | ---------------- | ------------- | ---------------- | --------------------------------------------------------------------------- |
| [AGIE AQUA-RAT](https://arxiv.org/abs/1705.04146)                   | `agie_aqua_rat`  |               |                  | Algebraic question answering with rationales                                |
| [AGIE LogiQA (English)](https://arxiv.org/abs/2007.08124)           | `agie_logiqa_en` |               |                  | Logical reasoning questions in English                                      |
| [AGIE LSAT Analytical Reasoning](https://www.lsac.org/)             | `agie_lsat_ar`   |               |                  | LSAT analytical reasoning (logic games) problems                            |
| [AGIE LSAT Logical Reasoning](https://www.lsac.org/)                | `agie_lsat_lr`   |               |                  | LSAT logical reasoning questions                                            |
| [ARC Challenge](https://huggingface.co/datasets/allenai/ai2_arc)    | `arc_challenge`  |               |                  | Challenging science questions requiring reasoning (AI2 Reasoning Challenge) |
| [ARC Easy](https://huggingface.co/datasets/allenai/ai2_arc)         | `arc_easy`       |               |                  | Easier set of science questions from the ARC dataset                        |
| [BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)                | `bbh`            |               |                  | BIG-Bench Hard: challenging tasks from BIG-Bench                            |
| [CoCoNot](https://arxiv.org/abs/2310.03697)                         | `coconot`        |               |                  | Counterfactual commonsense reasoning benchmark                              |
| [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) | `commonsense_qa` |               |                  | Commonsense reasoning questions                                             |
| [HellaSwag](https://arxiv.org/abs/1905.07830)                       | `hellaswag`      |               |                  | Commonsense natural language inference                                      |
| [MUSR](https://arxiv.org/abs/2310.16049)                            | `musr`           |               |                  | Multi-step reasoning benchmark                                              |
| [PIQA](https://yonatanbisk.com/piqa/)                               | `piqa`           |               |                  | Physical commonsense reasoning                                              |
| [WinoGrande](https://winogrande.allenai.org/)                       | `winogrande`     |               |                  | Commonsense reasoning via pronoun resolution                                |

## Math

Evaluate mathematical problem-solving at various difficulty levels, from grade school to competition-level problems.

| Evaluation                                                                              | Task ID                    | OpenAI Scorer | Gated HF Dataset | Description                                                      |
| --------------------------------------------------------------------------------------- | -------------------------- | ------------- | ---------------- | ---------------------------------------------------------------- |
| [AGIE Math](https://arxiv.org/abs/2410.12211)                                           | `agie_math`                |               |                  | Advanced mathematical reasoning from AGIE benchmark suite        |
| [AGIE SAT Math](https://satsuite.collegeboard.org/sat)                                  | `agie_sat_math`            |               |                  | SAT mathematics questions                                        |
| [AIME 2024](https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions) | `aime2024`                 |               |                  | American Invitational Mathematics Examination problems from 2024 |
| [AIME 2025](https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions) | `aime2025`                 |               |                  | American Invitational Mathematics Examination problems from 2025 |
| [GSM8K](https://github.com/openai/grade-school-math)                                    | `gsm8k`                    |               |                  | Grade School Math 8K: multi-step math word problems              |
| [InfiniteBench Math Calc](https://arxiv.org/abs/2402.13718)                             | `infinite_bench_math_calc` |               |                  | Mathematical calculations in long contexts                       |
| [InfiniteBench Math Find](https://arxiv.org/abs/2402.13718)                             | `infinite_bench_math_find` |               |                  | Finding mathematical patterns in long contexts                   |
| [MATH](https://github.com/hendrycks/math)                                               | `math`                     |               |                  | Competition-level mathematics problems                           |
| [MGSM](https://github.com/google-research/url-nlp/tree/main/mgsm)                       | `mgsm`                     |               |                  | Multilingual Grade School Math                                   |

## Code

Evaluate programming and software development capabilities like debugging, code execution prediction, and function calling.

| Evaluation                                                                                 | Task ID                     | OpenAI Scorer | Gated HF Dataset | Description                                                                             |
| ------------------------------------------------------------------------------------------ | --------------------------- | ------------- | ---------------- | --------------------------------------------------------------------------------------- |
| [BFCL](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html) | `bfcl`                      |               |                  | Berkeley Function Calling Leaderboard: tests function calling and tool use capabilities |
| [InfiniteBench Code Debug](https://arxiv.org/abs/2402.13718)                               | `infinite_bench_code_debug` |               |                  | Long-context code debugging tasks                                                       |
| [InfiniteBench Code Run](https://arxiv.org/abs/2402.13718)                                 | `infinite_bench_code_run`   |               |                  | Long-context code execution prediction                                                  |

## Reading

Evaluate reading comprehension and information extraction from complex texts.

| Evaluation                                                                                                  | Task ID                       | OpenAI Scorer | Gated HF Dataset | Description                                                                              |
| ----------------------------------------------------------------------------------------------------------- | ----------------------------- | ------------- | ---------------- | ---------------------------------------------------------------------------------------- |
| [AGIE LSAT Reading Comprehension](https://www.lsac.org/)                                                    | `agie_lsat_rc`                |               |                  | LSAT reading comprehension passages and questions                                        |
| [AGIE SAT English](https://satsuite.collegeboard.org/sat)                                                   | `agie_sat_en`                 |               |                  | SAT reading and writing questions with passages                                          |
| [AGIE SAT English (No Passage)](https://satsuite.collegeboard.org/sat)                                      | `agie_sat_en_without_passage` |               |                  | SAT English questions without accompanying passages                                      |
| [DROP](https://github.com/allenai/allennlp-reading-comprehension/blob/master/allennlp_rc/eval/drop_eval.py) | `drop`                        |               |                  | Discrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning  |
| [RACE-H](https://www.cs.cmu.edu/~glai1/data/race/)                                                          | `race_h`                      |               |                  | Reading comprehension from English exams (high difficulty)                               |
| [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)                                                        | `squad`                       |               |                  | Stanford Question Answering Dataset: extractive question answering on Wikipedia articles |

## Long context

Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.

| Evaluation                                                                  | Task ID                              | OpenAI Scorer | Gated HF Dataset | Description                                       |
| --------------------------------------------------------------------------- | ------------------------------------ | ------------- | ---------------- | ------------------------------------------------- |
| [InfiniteBench KV Retrieval](https://arxiv.org/abs/2402.13718)              | `infinite_bench_kv_retrieval`        |               |                  | Key-value retrieval in long contexts              |
| [InfiniteBench LongBook (English)](https://arxiv.org/abs/2402.13718)        | `infinite_bench_longbook_choice_eng` |               |                  | Multiple choice questions on long books           |
| [InfiniteBench LongDialogue QA (English)](https://arxiv.org/abs/2402.13718) | `infinite_bench_longdialogue_qa_eng` |               |                  | Question answering over long dialogues            |
| [InfiniteBench Number String](https://arxiv.org/abs/2402.13718)             | `infinite_bench_number_string`       |               |                  | Number pattern recognition in long sequences      |
| [InfiniteBench Passkey](https://arxiv.org/abs/2402.13718)                   | `infinite_bench_passkey`             |               |                  | Retrieval of information from long context        |
| [NIAH](https://arxiv.org/abs/2406.07230)                                    | `niah`                               |               |                  | Needle in a Haystack: long-context retrieval test |

## Safety

Evaluate alignment, bias detection, harmful content resistance, and truthfulness.

| Evaluation                                                                                                   | Task ID                         | OpenAI Scorer | Gated HF Dataset | Description                                                           |
| ------------------------------------------------------------------------------------------------------------ | ------------------------------- | ------------- | ---------------- | --------------------------------------------------------------------- |
| [AgentHarm](https://arxiv.org/abs/2410.09024)                                                                | `agentharm`                     | Yes           |                  | Tests model resistance to harmful agent behavior and misuse scenarios |
| [AgentHarm Benign](https://arxiv.org/abs/2410.09024)                                                         | `agentharm_benign`              | Yes           |                  | Benign baseline for AgentHarm to measure false positive rates         |
| [Agentic Misalignment](https://arxiv.org/abs/2510.05179)                                                     | `agentic_misalignment`          |               |                  | Evaluates potential misalignment in agentic behavior                  |
| [AHB](https://arxiv.org/abs/2503.04804)                                                                      | `ahb`                           |               |                  | Agent Harmful Behavior: tests resistance to harmful agentic actions   |
| [AIRBench](https://arxiv.org/abs/2410.02407)                                                                 | `air_bench`                     |               |                  | Tests adversarial instruction resistance                              |
| [BBEH](https://arxiv.org/abs/2502.19187)                                                                     | `bbeh`                          |               |                  | Bias Benchmark for Evaluating Harmful behavior                        |
| [BBEH Mini](https://arxiv.org/abs/2502.19187)                                                                | `bbeh_mini`                     |               |                  | Smaller version of BBEH benchmark                                     |
| [BBQ](https://arxiv.org/abs/2110.08193)                                                                      | `bbq`                           |               |                  | Bias Benchmark for Question Answering                                 |
| [BOLD](https://arxiv.org/abs/2101.11718)                                                                     | `bold`                          |               |                  | Bias in Open-Ended Language Generation Dataset                        |
| [CYSE3 Visual Prompt Injection](https://arxiv.org/abs/2408.01605)                                            | `cyse3_visual_prompt_injection` |               |                  | Tests resistance to visual prompt injection attacks                   |
| [Make Me Pay](https://arxiv.org/abs/2410.08691)                                                              | `make_me_pay`                   |               |                  | Tests resistance to financial scam and fraud scenarios                |
| [MASK](https://arxiv.org/abs/2503.03750)                                                                     | `mask`                          | Yes           | Yes              | Tests model's handling of sensitive information                       |
| [Personality BFI](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/personality) | `personality_BFI`               |               |                  | Big Five personality trait assessment                                 |
| [Personality TRAIT](https://arxiv.org/abs/2406.14703)                                                        | `personality_TRAIT`             |               | Yes              | Comprehensive personality trait evaluation                            |
| SOSBench                                                                                                     | `sosbench`                      | Yes           |                  | Safety and oversight stress test                                      |
| [StereoSet](https://github.com/moinnadeem/StereoSet)                                                         | `stereoset`                     |               |                  | Measures stereotypical biases in language models                      |
| [StrongREJECT](https://arxiv.org/abs/2402.10260)                                                             | `strong_reject`                 |               |                  | Tests model's ability to reject harmful requests                      |
| [Sycophancy](https://arxiv.org/abs/2310.13548)                                                               | `sycophancy`                    |               |                  | Evaluates tendency toward sycophantic behavior                        |
| [TruthfulQA](https://github.com/sylinrl/TruthfulQA)                                                          | `truthfulqa`                    |               |                  | Tests model truthfulness and resistance to falsehoods                 |
| [UCCB](https://huggingface.co/datasets/CraneAILabs/UCCB)                                                     | `uccb`                          |               |                  | Unsafe Content Classification Benchmark                               |
| [WMDP Bio](https://www.wmdp.ai/)                                                                             | `wmdp_bio`                      |               |                  | Tests hazardous knowledge in biology                                  |
| [WMDP Chem](https://www.wmdp.ai/)                                                                            | `wmdp_chem`                     |               |                  | Tests hazardous knowledge in chemistry                                |
| [WMDP Cyber](https://www.wmdp.ai/)                                                                           | `wmdp_cyber`                    |               |                  | Tests hazardous knowledge in cybersecurity                            |
| [XSTest](https://arxiv.org/abs/2308.01263)                                                                   | `xstest`                        | Yes           |                  | Exaggerated safety test for over-refusal detection                    |

## Domain-Specific

Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.

| Evaluation                                                     | Task ID                       | OpenAI Scorer | Gated HF Dataset | Description                                           |
| -------------------------------------------------------------- | ----------------------------- | ------------- | ---------------- | ----------------------------------------------------- |
| [ChemBench](https://arxiv.org/abs/2404.01475)                  | `chembench`                   |               |                  | Chemistry knowledge and problem-solving benchmark     |
| [HealthBench](https://arxiv.org/abs/2406.09746)                | `healthbench`                 | Yes           |                  | Healthcare and medical knowledge evaluation           |
| [HealthBench Consensus](https://arxiv.org/abs/2406.09746)      | `healthbench_consensus`       | Yes           |                  | Healthcare questions with expert consensus            |
| [HealthBench Hard](https://arxiv.org/abs/2406.09746)           | `healthbench_hard`            | Yes           |                  | Challenging healthcare scenarios                      |
| [LabBench Cloning Scenarios](https://arxiv.org/abs/2407.10362) | `lab_bench_cloning_scenarios` |               |                  | Laboratory experiment planning and cloning            |
| [LabBench DBQA](https://arxiv.org/abs/2407.10362)              | `lab_bench_dbqa`              |               |                  | Database question answering for lab scenarios         |
| [LabBench FigQA](https://arxiv.org/abs/2407.10362)             | `lab_bench_figqa`             |               |                  | Figure interpretation in scientific contexts          |
| [LabBench LitQA](https://arxiv.org/abs/2407.10362)             | `lab_bench_litqa`             |               |                  | Literature-based question answering for research      |
| [LabBench ProtocolQA](https://arxiv.org/abs/2407.10362)        | `lab_bench_protocolqa`        |               |                  | Laboratory protocol understanding                     |
| [LabBench SeqQA](https://arxiv.org/abs/2407.10362)             | `lab_bench_seqqa`             |               |                  | Biological sequence analysis questions                |
| [LabBench SuppQA](https://arxiv.org/abs/2407.10362)            | `lab_bench_suppqa`            |               |                  | Supplementary material interpretation                 |
| [LabBench TableQA](https://arxiv.org/abs/2407.10362)           | `lab_bench_tableqa`           |               |                  | Table interpretation in scientific papers             |
| [MedQA](https://github.com/jind11/MedQA)                       | `medqa`                       |               |                  | Medical licensing exam questions                      |
| [PubMedQA](https://pubmedqa.github.io/)                        | `pubmedqa`                    |               |                  | Biomedical question answering from research abstracts |
| [SEC-QA v1](https://arxiv.org/abs/2406.14806)                  | `sec_qa_v1`                   |               |                  | SEC filing question answering                         |
| [SEC-QA v1 (5-shot)](https://arxiv.org/abs/2406.14806)         | `sec_qa_v1_5_shot`            |               |                  | SEC-QA with 5 examples                                |
| [SEC-QA v2](https://arxiv.org/abs/2406.14806)                  | `sec_qa_v2`                   |               |                  | Updated SEC filing benchmark                          |
| [SEC-QA v2 (5-shot)](https://arxiv.org/abs/2406.14806)         | `sec_qa_v2_5_shot`            |               |                  | SEC-QA v2 with 5 examples                             |

## Multimodal

Evaluate vision and language understanding combining visual and textual inputs.

| Evaluation                                                              | Task ID                                      | OpenAI Scorer | Gated HF Dataset | Description                                                           |
| ----------------------------------------------------------------------- | -------------------------------------------- | ------------- | ---------------- | --------------------------------------------------------------------- |
| [DocVQA](https://www.docvqa.org/)                                       | `docvqa`                                     |               |                  | Document Visual Question Answering: questions about document images   |
| [MathVista](https://mathvista.github.io/)                               | `mathvista`                                  |               |                  | Mathematical reasoning with visual contexts combining vision and math |
| [MMMU Multiple Choice](https://mmmu-benchmark.github.io/)               | `mmmu_multiple_choice`                       |               |                  | Multimodal understanding with multiple choice format                  |
| [MMMU Open](https://mmmu-benchmark.github.io/)                          | `mmmu_open`                                  |               |                  | Multimodal understanding with open-ended responses                    |
| [V\*Star Bench Attribute Recognition](https://arxiv.org/abs/2411.10006) | `vstar_bench_attribute_recognition`          |               |                  | Visual attribute recognition tasks                                    |
| [V\*Star Bench Spatial Relationship](https://arxiv.org/abs/2411.10006)  | `vstar_bench_spatial_relationship_reasoning` |               |                  | Spatial reasoning with visual inputs                                  |

## Instruction following

Evaluate adherence to specific instructions and formatting requirements.

| Evaluation                                 | Task ID  | OpenAI Scorer | Gated HF Dataset | Description                                      |
| ------------------------------------------ | -------- | ------------- | ---------------- | ------------------------------------------------ |
| [IFEval](https://arxiv.org/abs/2311.07911) | `ifeval` |               |                  | Tests precise instruction-following capabilities |

## System

Basic system validation and pre-flight checks.

| Evaluation                                                                                 | Task ID      | OpenAI Scorer | Gated HF Dataset | Description                            |
| ------------------------------------------------------------------------------------------ | ------------ | ------------- | ---------------- | -------------------------------------- |
| [Pre-Flight](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/pre_flight/) | `pre_flight` |               |                  | Basic system check and validation test |

## Next steps

* [Evaluate a model checkpoint](/models/launch/evaluate-model-checkpoint)
* [Evaluate a hosted API model](/models/launch/evaluate-hosted-model)
* View details about specific benchmarks at [AISI Inspect Evals](https://inspect.aisi.org.uk/evals/)
