Evaluation benchmark catalog

LLM Evaluation Jobs is in Preview for W&B Multi-tenant Cloud. Compute is free during the preview period. Learn more

This page lists the evaluation benchmarks LLM Evaluation Jobs provides by category. To run certain benchmarks, a team admin must add the required API keys as team-scoped secrets. Any team member can specify the secret when configuring an evaluation job.

If a benchmark has true in the OpenAI Model Scorer column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark with this requirement, set the Scorer API key field to the secret.
If a benchmark has a link in the Gated Hugging Face Dataset column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face, create a Hugging Face user access token, and configure a team secret with the access key. When you configure a benchmark with this requirement, set the Hugging Face Token field to the secret.

Knowledge

Evaluate factual knowledge across various domains like science, language, and general reasoning.

Evaluation	Task ID	OpenAI Scorer	Gated Hugging Face Dataset	Description
BoolQ	`boolq`			Boolean yes/no questions from natural language queries
GPQA Diamond	`gpqa_diamond`			Graduate-level science questions (highest quality subset)
HLE	`hle`		Yes	Human-level evaluation benchmark
Lingoly	`lingoly`		Yes	Linguistics olympiad problems
Lingoly Too	`lingoly_too`		Yes	Extended linguistics challenge problems
MMIU	`mmiu`			Massive Multitask Language Understanding benchmark
MMLU (0-shot)	`mmlu_0_shot`			Massive Multitask Language Understanding without examples
MMLU (5-shot)	`mmlu_5_shot`			Massive Multitask Language Understanding with 5 examples
MMLU-Pro	`mmlu_pro`			More challenging version of MMLU
ONET M6	`onet_m6`			Occupational knowledge benchmark
PAWS	`paws`			Paraphrase adversarial word substitution
SevenLLM MCQ (English)	`sevenllm_mcq_en`			Multiple choice questions in English
SevenLLM MCQ (Chinese)	`sevenllm_mcq_zh`			Multiple choice questions in Chinese
SevenLLM QA (English)	`sevenllm_qa_en`			Question answering in English
SevenLLM QA (Chinese)	`sevenllm_qa_zh`			Question answering in Chinese
SimpleQA	`simpleqa`	Yes		Straightforward factual question answering
SimpleQA Verified	`simpleqa_verified`			Verified subset of SimpleQA with validated answers
WorldSense	`worldsense`			Evaluates understanding of world knowledge and common sense

Reasoning

Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.

Evaluation	Task ID	Description
AGIE AQUA-RAT	`agie_aqua_rat`	Algebraic question answering with rationales
AGIE LogiQA (English)	`agie_logiqa_en`	Logical reasoning questions in English
AGIE LSAT Analytical Reasoning	`agie_lsat_ar`	LSAT analytical reasoning (logic games) problems
AGIE LSAT Logical Reasoning	`agie_lsat_lr`	LSAT logical reasoning questions
ARC Challenge	`arc_challenge`	Challenging science questions requiring reasoning (AI2 Reasoning Challenge)
ARC Easy	`arc_easy`	Easier set of science questions from the ARC dataset
BBH	`bbh`	BIG-Bench Hard: challenging tasks from BIG-Bench
CoCoNot	`coconot`	Counterfactual commonsense reasoning benchmark
CommonsenseQA	`commonsense_qa`	Commonsense reasoning questions
HellaSwag	`hellaswag`	Commonsense natural language inference
MUSR	`musr`	Multi-step reasoning benchmark
PIQA	`piqa`	Physical commonsense reasoning
WinoGrande	`winogrande`	Commonsense reasoning via pronoun resolution

Math

Evaluate mathematical problem-solving at various difficulty levels, from grade school to competition-level problems.

Evaluation	Task ID	Description
AGIE Math	`agie_math`	Advanced mathematical reasoning from AGIE benchmark suite
AGIE SAT Math	`agie_sat_math`	SAT mathematics questions
AIME 2024	`aime2024`	American Invitational Mathematics Examination problems from 2024
AIME 2025	`aime2025`	American Invitational Mathematics Examination problems from 2025
GSM8K	`gsm8k`	Grade School Math 8K: multi-step math word problems
InfiniteBench Math Calc	`infinite_bench_math_calc`	Mathematical calculations in long contexts
InfiniteBench Math Find	`infinite_bench_math_find`	Finding mathematical patterns in long contexts
MATH	`math`	Competition-level mathematics problems
MGSM	`mgsm`	Multilingual Grade School Math

Code

Evaluate programming and software development capabilities like debugging, code execution prediction, and function calling.

Evaluation	Task ID	Description
BFCL	`bfcl`	Berkeley Function Calling Leaderboard: tests function calling and tool use capabilities
InfiniteBench Code Debug	`infinite_bench_code_debug`	Long-context code debugging tasks
InfiniteBench Code Run	`infinite_bench_code_run`	Long-context code execution prediction

Reading

Evaluate reading comprehension and information extraction from complex texts.

Evaluation	Task ID	Description
AGIE LSAT Reading Comprehension	`agie_lsat_rc`	LSAT reading comprehension passages and questions
AGIE SAT English	`agie_sat_en`	SAT reading and writing questions with passages
AGIE SAT English (No Passage)	`agie_sat_en_without_passage`	SAT English questions without accompanying passages
DROP	`drop`	Discrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning
RACE-H	`race_h`	Reading comprehension from English exams (high difficulty)
SQuAD	`squad`	Stanford Question Answering Dataset: extractive question answering on Wikipedia articles

Long context

Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.

Evaluation	Task ID	Description
InfiniteBench KV Retrieval	`infinite_bench_kv_retrieval`	Key-value retrieval in long contexts
InfiniteBench LongBook (English)	`infinite_bench_longbook_choice_eng`	Multiple choice questions on long books
InfiniteBench LongDialogue QA (English)	`infinite_bench_longdialogue_qa_eng`	Question answering over long dialogues
InfiniteBench Number String	`infinite_bench_number_string`	Number pattern recognition in long sequences
InfiniteBench Passkey	`infinite_bench_passkey`	Retrieval of information from long context
NIAH	`niah`	Needle in a Haystack: long-context retrieval test

Safety

Evaluate alignment, bias detection, harmful content resistance, and truthfulness.

Evaluation	Task ID	OpenAI Scorer	Gated HF Dataset	Description
AgentHarm	`agentharm`	Yes		Tests model resistance to harmful agent behavior and misuse scenarios
AgentHarm Benign	`agentharm_benign`	Yes		Benign baseline for AgentHarm to measure false positive rates
Agentic Misalignment	`agentic_misalignment`			Evaluates potential misalignment in agentic behavior
AHB	`ahb`			Agent Harmful Behavior: tests resistance to harmful agentic actions
AIRBench	`air_bench`			Tests adversarial instruction resistance
BBEH	`bbeh`			Bias Benchmark for Evaluating Harmful behavior
BBEH Mini	`bbeh_mini`			Smaller version of BBEH benchmark
BBQ	`bbq`			Bias Benchmark for Question Answering
BOLD	`bold`			Bias in Open-Ended Language Generation Dataset
CYSE3 Visual Prompt Injection	`cyse3_visual_prompt_injection`			Tests resistance to visual prompt injection attacks
Make Me Pay	`make_me_pay`			Tests resistance to financial scam and fraud scenarios
MASK	`mask`	Yes	Yes	Tests model’s handling of sensitive information
Personality BFI	`personality_BFI`			Big Five personality trait assessment
Personality TRAIT	`personality_TRAIT`		Yes	Comprehensive personality trait evaluation
SOSBench	`sosbench`	Yes		Safety and oversight stress test
StereoSet	`stereoset`			Measures stereotypical biases in language models
StrongREJECT	`strong_reject`			Tests model’s ability to reject harmful requests
Sycophancy	`sycophancy`			Evaluates tendency toward sycophantic behavior
TruthfulQA	`truthfulqa`			Tests model truthfulness and resistance to falsehoods
UCCB	`uccb`			Unsafe Content Classification Benchmark
WMDP Bio	`wmdp_bio`			Tests hazardous knowledge in biology
WMDP Chem	`wmdp_chem`			Tests hazardous knowledge in chemistry
WMDP Cyber	`wmdp_cyber`			Tests hazardous knowledge in cybersecurity
XSTest	`xstest`	Yes		Exaggerated safety test for over-refusal detection

Domain-Specific

Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.

Evaluation	Task ID	OpenAI Scorer	Description
ChemBench	`chembench`		Chemistry knowledge and problem-solving benchmark
HealthBench	`healthbench`	Yes	Healthcare and medical knowledge evaluation
HealthBench Consensus	`healthbench_consensus`	Yes	Healthcare questions with expert consensus
HealthBench Hard	`healthbench_hard`	Yes	Challenging healthcare scenarios
LabBench Cloning Scenarios	`lab_bench_cloning_scenarios`		Laboratory experiment planning and cloning
LabBench DBQA	`lab_bench_dbqa`		Database question answering for lab scenarios
LabBench FigQA	`lab_bench_figqa`		Figure interpretation in scientific contexts
LabBench LitQA	`lab_bench_litqa`		Literature-based question answering for research
LabBench ProtocolQA	`lab_bench_protocolqa`		Laboratory protocol understanding
LabBench SeqQA	`lab_bench_seqqa`		Biological sequence analysis questions
LabBench SuppQA	`lab_bench_suppqa`		Supplementary material interpretation
LabBench TableQA	`lab_bench_tableqa`		Table interpretation in scientific papers
MedQA	`medqa`		Medical licensing exam questions
PubMedQA	`pubmedqa`		Biomedical question answering from research abstracts
SEC-QA v1	`sec_qa_v1`		SEC filing question answering
SEC-QA v1 (5-shot)	`sec_qa_v1_5_shot`		SEC-QA with 5 examples
SEC-QA v2	`sec_qa_v2`		Updated SEC filing benchmark
SEC-QA v2 (5-shot)	`sec_qa_v2_5_shot`		SEC-QA v2 with 5 examples

Multimodal

Evaluate vision and language understanding combining visual and textual inputs.

Evaluation	Task ID	Description
DocVQA	`docvqa`	Document Visual Question Answering: questions about document images
MathVista	`mathvista`	Mathematical reasoning with visual contexts combining vision and math
MMMU Multiple Choice	`mmmu_multiple_choice`	Multimodal understanding with multiple choice format
MMMU Open	`mmmu_open`	Multimodal understanding with open-ended responses
V*Star Bench Attribute Recognition	`vstar_bench_attribute_recognition`	Visual attribute recognition tasks
V*Star Bench Spatial Relationship	`vstar_bench_spatial_relationship_reasoning`	Spatial reasoning with visual inputs

Instruction Following

Evaluate adherence to specific instructions and formatting requirements.

Evaluation	Task ID	OpenAI Scorer	Gated HF Dataset	Description
IFEval	`ifeval`			Tests precise instruction-following capabilities

System

Basic system validation and pre-flight checks.

Evaluation	Task ID	OpenAI Scorer	Gated HF Dataset	Description
Pre-Flight	`pre_flight`			Basic system check and validation test

Next steps

Evaluate a model checkpoint
Evaluate a hosted API model
View details about specific benchmarks at AISI Inspect Evals

Guides

Integrations

Reference

Knowledge

Reasoning

Math

Code

Reading

Long context

Safety

Domain-Specific

Multimodal

Instruction Following

System

Next steps

Guides

Integrations

Reference

​Knowledge

​Reasoning

​Math

​Code

​Reading

​Long context

​Safety

​Domain-Specific

​Multimodal

​Instruction Following

​System

​Next steps

Knowledge

Reasoning

Math

Code

Reading

Long context

Safety

Domain-Specific

Multimodal

Instruction Following

System

Next steps