LLM Evaluation Jobs is in Preview for W&B Multi-tenant Cloud. Compute is free during the preview period. Learn more
- If a benchmark has
truein the OpenAI Model Scorer column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark with this requirement, set the Scorer API key field to the secret. - If a benchmark has a link in the Gated Hugging Face Dataset column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face, create a Hugging Face user access token, and configure a team secret with the access key. When you configure a benchmark with this requirement, set the Hugging Face Token field to the secret.
Knowledge
Evaluate factual knowledge across various domains like science, language, and general reasoning.| Evaluation | Task ID | OpenAI Model Scorer | Gated Hugging Face Dataset | Description |
|---|---|---|---|---|
| BoolQ | boolq | Boolean yes/no questions from natural language queries | ||
| GPQA Diamond | gpqa_diamond | Graduate-level science questions (highest quality subset) | ||
| HLE | hle | Yes | Human-level evaluation benchmark | |
| Lingoly | lingoly | Yes | Linguistics olympiad problems | |
| Lingoly Too | lingoly_too | Yes | Extended linguistics challenge problems | |
| MMIU | mmiu | Massive Multitask Language Understanding benchmark | ||
| MMLU (0-shot) | mmlu_0_shot | Massive Multitask Language Understanding without examples | ||
| MMLU (5-shot) | mmlu_5_shot | Massive Multitask Language Understanding with 5 examples | ||
| MMLU-Pro | mmlu_pro | More challenging version of MMLU | ||
| ONET M6 | onet_m6 | Occupational knowledge benchmark | ||
| PAWS | paws | Paraphrase adversarial word substitution | ||
| SevenLLM MCQ (English) | sevenllm_mcq_en | Multiple choice questions in English | ||
| SevenLLM MCQ (Chinese) | sevenllm_mcq_zh | Multiple choice questions in Chinese | ||
| SevenLLM QA (English) | sevenllm_qa_en | Question answering in English | ||
| SevenLLM QA (Chinese) | sevenllm_qa_zh | Question answering in Chinese | ||
| SimpleQA | simpleqa | Yes | Straightforward factual question answering | |
| SimpleQA Verified | simpleqa_verified | Verified subset of SimpleQA with validated answers | ||
| WorldSense | worldsense | Evaluates understanding of world knowledge and common sense |
Reasoning
Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| AGIE AQUA-RAT | agie_aqua_rat | Algebraic question answering with rationales | ||
| AGIE LogiQA (English) | agie_logiqa_en | Logical reasoning questions in English | ||
| AGIE LSAT Analytical Reasoning | agie_lsat_ar | LSAT analytical reasoning (logic games) problems | ||
| AGIE LSAT Logical Reasoning | agie_lsat_lr | LSAT logical reasoning questions | ||
| ARC Challenge | arc_challenge | Challenging science questions requiring reasoning (AI2 Reasoning Challenge) | ||
| ARC Easy | arc_easy | Easier set of science questions from the ARC dataset | ||
| BBH | bbh | BIG-Bench Hard: challenging tasks from BIG-Bench | ||
| CoCoNot | coconot | Counterfactual commonsense reasoning benchmark | ||
| CommonsenseQA | commonsense_qa | Commonsense reasoning questions | ||
| HellaSwag | hellaswag | Commonsense natural language inference | ||
| MUSR | musr | Multi-step reasoning benchmark | ||
| PIQA | piqa | Physical commonsense reasoning | ||
| WinoGrande | winogrande | Commonsense reasoning via pronoun resolution |
Math
Evaluate mathematical problem-solving at various difficulty levels, from grade school to competition-level problems.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| AGIE Math | agie_math | Advanced mathematical reasoning from AGIE benchmark suite | ||
| AGIE SAT Math | agie_sat_math | SAT mathematics questions | ||
| AIME 2024 | aime2024 | American Invitational Mathematics Examination problems from 2024 | ||
| AIME 2025 | aime2025 | American Invitational Mathematics Examination problems from 2025 | ||
| GSM8K | gsm8k | Grade School Math 8K: multi-step math word problems | ||
| InfiniteBench Math Calc | infinite_bench_math_calc | Mathematical calculations in long contexts | ||
| InfiniteBench Math Find | infinite_bench_math_find | Finding mathematical patterns in long contexts | ||
| MATH | math | Competition-level mathematics problems | ||
| MGSM | mgsm | Multilingual Grade School Math |
Code
Evaluate programming and software development capabilities like debugging, code execution prediction, and function calling.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| BFCL | bfcl | Berkeley Function Calling Leaderboard: tests function calling and tool use capabilities | ||
| InfiniteBench Code Debug | infinite_bench_code_debug | Long-context code debugging tasks | ||
| InfiniteBench Code Run | infinite_bench_code_run | Long-context code execution prediction |
Reading
Evaluate reading comprehension and information extraction from complex texts.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| AGIE LSAT Reading Comprehension | agie_lsat_rc | LSAT reading comprehension passages and questions | ||
| AGIE SAT English | agie_sat_en | SAT reading and writing questions with passages | ||
| AGIE SAT English (No Passage) | agie_sat_en_without_passage | SAT English questions without accompanying passages | ||
| DROP | drop | Discrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning | ||
| RACE-H | race_h | Reading comprehension from English exams (high difficulty) | ||
| SQuAD | squad | Stanford Question Answering Dataset: extractive question answering on Wikipedia articles |
Long context
Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| InfiniteBench KV Retrieval | infinite_bench_kv_retrieval | Key-value retrieval in long contexts | ||
| InfiniteBench LongBook (English) | infinite_bench_longbook_choice_eng | Multiple choice questions on long books | ||
| InfiniteBench LongDialogue QA (English) | infinite_bench_longdialogue_qa_eng | Question answering over long dialogues | ||
| InfiniteBench Number String | infinite_bench_number_string | Number pattern recognition in long sequences | ||
| InfiniteBench Passkey | infinite_bench_passkey | Retrieval of information from long context | ||
| NIAH | niah | Needle in a Haystack: long-context retrieval test |
Safety
Evaluate alignment, bias detection, harmful content resistance, and truthfulness.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| AgentHarm | agentharm | Yes | Tests model resistance to harmful agent behavior and misuse scenarios | |
| AgentHarm Benign | agentharm_benign | Yes | Benign baseline for AgentHarm to measure false positive rates | |
| Agentic Misalignment | agentic_misalignment | Evaluates potential misalignment in agentic behavior | ||
| AHB | ahb | Agent Harmful Behavior: tests resistance to harmful agentic actions | ||
| AIRBench | air_bench | Tests adversarial instruction resistance | ||
| BBEH | bbeh | Bias Benchmark for Evaluating Harmful behavior | ||
| BBEH Mini | bbeh_mini | Smaller version of BBEH benchmark | ||
| BBQ | bbq | Bias Benchmark for Question Answering | ||
| BOLD | bold | Bias in Open-Ended Language Generation Dataset | ||
| CYSE3 Visual Prompt Injection | cyse3_visual_prompt_injection | Tests resistance to visual prompt injection attacks | ||
| Make Me Pay | make_me_pay | Tests resistance to financial scam and fraud scenarios | ||
| MASK | mask | Yes | Yes | Tests model’s handling of sensitive information |
| Personality BFI | personality_BFI | Big Five personality trait assessment | ||
| Personality TRAIT | personality_TRAIT | Yes | Comprehensive personality trait evaluation | |
| SOSBench | sosbench | Yes | Safety and oversight stress test | |
| StereoSet | stereoset | Measures stereotypical biases in language models | ||
| StrongREJECT | strong_reject | Tests model’s ability to reject harmful requests | ||
| Sycophancy | sycophancy | Evaluates tendency toward sycophantic behavior | ||
| TruthfulQA | truthfulqa | Tests model truthfulness and resistance to falsehoods | ||
| UCCB | uccb | Unsafe Content Classification Benchmark | ||
| WMDP Bio | wmdp_bio | Tests hazardous knowledge in biology | ||
| WMDP Chem | wmdp_chem | Tests hazardous knowledge in chemistry | ||
| WMDP Cyber | wmdp_cyber | Tests hazardous knowledge in cybersecurity | ||
| XSTest | xstest | Yes | Exaggerated safety test for over-refusal detection |
Domain-Specific
Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| ChemBench | chembench | Chemistry knowledge and problem-solving benchmark | ||
| HealthBench | healthbench | Yes | Healthcare and medical knowledge evaluation | |
| HealthBench Consensus | healthbench_consensus | Yes | Healthcare questions with expert consensus | |
| HealthBench Hard | healthbench_hard | Yes | Challenging healthcare scenarios | |
| LabBench Cloning Scenarios | lab_bench_cloning_scenarios | Laboratory experiment planning and cloning | ||
| LabBench DBQA | lab_bench_dbqa | Database question answering for lab scenarios | ||
| LabBench FigQA | lab_bench_figqa | Figure interpretation in scientific contexts | ||
| LabBench LitQA | lab_bench_litqa | Literature-based question answering for research | ||
| LabBench ProtocolQA | lab_bench_protocolqa | Laboratory protocol understanding | ||
| LabBench SeqQA | lab_bench_seqqa | Biological sequence analysis questions | ||
| LabBench SuppQA | lab_bench_suppqa | Supplementary material interpretation | ||
| LabBench TableQA | lab_bench_tableqa | Table interpretation in scientific papers | ||
| MedQA | medqa | Medical licensing exam questions | ||
| PubMedQA | pubmedqa | Biomedical question answering from research abstracts | ||
| SEC-QA v1 | sec_qa_v1 | SEC filing question answering | ||
| SEC-QA v1 (5-shot) | sec_qa_v1_5_shot | SEC-QA with 5 examples | ||
| SEC-QA v2 | sec_qa_v2 | Updated SEC filing benchmark | ||
| SEC-QA v2 (5-shot) | sec_qa_v2_5_shot | SEC-QA v2 with 5 examples |
Multimodal
Evaluate vision and language understanding combining visual and textual inputs.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| DocVQA | docvqa | Document Visual Question Answering: questions about document images | ||
| MathVista | mathvista | Mathematical reasoning with visual contexts combining vision and math | ||
| MMMU Multiple Choice | mmmu_multiple_choice | Multimodal understanding with multiple choice format | ||
| MMMU Open | mmmu_open | Multimodal understanding with open-ended responses | ||
| V*Star Bench Attribute Recognition | vstar_bench_attribute_recognition | Visual attribute recognition tasks | ||
| V*Star Bench Spatial Relationship | vstar_bench_spatial_relationship_reasoning | Spatial reasoning with visual inputs |
Instruction Following
Evaluate adherence to specific instructions and formatting requirements.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| IFEval | ifeval | Tests precise instruction-following capabilities |
System
Basic system validation and pre-flight checks.| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|---|---|---|---|
| Pre-Flight | pre_flight | Basic system check and validation test |
Next steps
- Evaluate a model checkpoint
- Evaluate a hosted API model
- View details about specific benchmarks at AISI Inspect Evals