Skip to main content
LLM Evaluation Jobs is in Preview for W&B Multi-tenant Cloud. Compute is free during the preview period. Learn more
This page lists the evaluation benchmarks LLM Evaluation Jobs provides by category. To run certain benchmarks, a team admin must add the required API keys as team-scoped secrets. Any team member can specify the secret when configuring an evaluation job.
  • If a benchmark has true in the OpenAI Model Scorer column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark with this requirement, set the Scorer API key field to the secret.
  • If a benchmark has a link in the Gated Hugging Face Dataset column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face, create a Hugging Face user access token, and configure a team secret with the access key. When you configure a benchmark with this requirement, set the Hugging Face Token field to the secret.

Knowledge

Evaluate factual knowledge across various domains like science, language, and general reasoning.
EvaluationTask IDOpenAI Model ScorerGated Hugging Face DatasetDescription
BoolQboolqBoolean yes/no questions from natural language queries
GPQA Diamondgpqa_diamondGraduate-level science questions (highest quality subset)
HLEhleYesHuman-level evaluation benchmark
LingolylingolyYesLinguistics olympiad problems
Lingoly Toolingoly_tooYesExtended linguistics challenge problems
MMIUmmiuMassive Multitask Language Understanding benchmark
MMLU (0-shot)mmlu_0_shotMassive Multitask Language Understanding without examples
MMLU (5-shot)mmlu_5_shotMassive Multitask Language Understanding with 5 examples
MMLU-Prommlu_proMore challenging version of MMLU
ONET M6onet_m6Occupational knowledge benchmark
PAWSpawsParaphrase adversarial word substitution
SevenLLM MCQ (English)sevenllm_mcq_enMultiple choice questions in English
SevenLLM MCQ (Chinese)sevenllm_mcq_zhMultiple choice questions in Chinese
SevenLLM QA (English)sevenllm_qa_enQuestion answering in English
SevenLLM QA (Chinese)sevenllm_qa_zhQuestion answering in Chinese
SimpleQAsimpleqaYesStraightforward factual question answering
SimpleQA Verifiedsimpleqa_verifiedVerified subset of SimpleQA with validated answers
WorldSenseworldsenseEvaluates understanding of world knowledge and common sense

Reasoning

Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AGIE AQUA-RATagie_aqua_ratAlgebraic question answering with rationales
AGIE LogiQA (English)agie_logiqa_enLogical reasoning questions in English
AGIE LSAT Analytical Reasoningagie_lsat_arLSAT analytical reasoning (logic games) problems
AGIE LSAT Logical Reasoningagie_lsat_lrLSAT logical reasoning questions
ARC Challengearc_challengeChallenging science questions requiring reasoning (AI2 Reasoning Challenge)
ARC Easyarc_easyEasier set of science questions from the ARC dataset
BBHbbhBIG-Bench Hard: challenging tasks from BIG-Bench
CoCoNotcoconotCounterfactual commonsense reasoning benchmark
CommonsenseQAcommonsense_qaCommonsense reasoning questions
HellaSwaghellaswagCommonsense natural language inference
MUSRmusrMulti-step reasoning benchmark
PIQApiqaPhysical commonsense reasoning
WinoGrandewinograndeCommonsense reasoning via pronoun resolution

Math

Evaluate mathematical problem-solving at various difficulty levels, from grade school to competition-level problems.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AGIE Mathagie_mathAdvanced mathematical reasoning from AGIE benchmark suite
AGIE SAT Mathagie_sat_mathSAT mathematics questions
AIME 2024aime2024American Invitational Mathematics Examination problems from 2024
AIME 2025aime2025American Invitational Mathematics Examination problems from 2025
GSM8Kgsm8kGrade School Math 8K: multi-step math word problems
InfiniteBench Math Calcinfinite_bench_math_calcMathematical calculations in long contexts
InfiniteBench Math Findinfinite_bench_math_findFinding mathematical patterns in long contexts
MATHmathCompetition-level mathematics problems
MGSMmgsmMultilingual Grade School Math

Code

Evaluate programming and software development capabilities like debugging, code execution prediction, and function calling.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
BFCLbfclBerkeley Function Calling Leaderboard: tests function calling and tool use capabilities
InfiniteBench Code Debuginfinite_bench_code_debugLong-context code debugging tasks
InfiniteBench Code Runinfinite_bench_code_runLong-context code execution prediction

Reading

Evaluate reading comprehension and information extraction from complex texts.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AGIE LSAT Reading Comprehensionagie_lsat_rcLSAT reading comprehension passages and questions
AGIE SAT Englishagie_sat_enSAT reading and writing questions with passages
AGIE SAT English (No Passage)agie_sat_en_without_passageSAT English questions without accompanying passages
DROPdropDiscrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning
RACE-Hrace_hReading comprehension from English exams (high difficulty)
SQuADsquadStanford Question Answering Dataset: extractive question answering on Wikipedia articles

Long context

Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
InfiniteBench KV Retrievalinfinite_bench_kv_retrievalKey-value retrieval in long contexts
InfiniteBench LongBook (English)infinite_bench_longbook_choice_engMultiple choice questions on long books
InfiniteBench LongDialogue QA (English)infinite_bench_longdialogue_qa_engQuestion answering over long dialogues
InfiniteBench Number Stringinfinite_bench_number_stringNumber pattern recognition in long sequences
InfiniteBench Passkeyinfinite_bench_passkeyRetrieval of information from long context
NIAHniahNeedle in a Haystack: long-context retrieval test

Safety

Evaluate alignment, bias detection, harmful content resistance, and truthfulness.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AgentHarmagentharmYesTests model resistance to harmful agent behavior and misuse scenarios
AgentHarm Benignagentharm_benignYesBenign baseline for AgentHarm to measure false positive rates
Agentic Misalignmentagentic_misalignmentEvaluates potential misalignment in agentic behavior
AHBahbAgent Harmful Behavior: tests resistance to harmful agentic actions
AIRBenchair_benchTests adversarial instruction resistance
BBEHbbehBias Benchmark for Evaluating Harmful behavior
BBEH Minibbeh_miniSmaller version of BBEH benchmark
BBQbbqBias Benchmark for Question Answering
BOLDboldBias in Open-Ended Language Generation Dataset
CYSE3 Visual Prompt Injectioncyse3_visual_prompt_injectionTests resistance to visual prompt injection attacks
Make Me Paymake_me_payTests resistance to financial scam and fraud scenarios
MASKmaskYesYesTests model’s handling of sensitive information
Personality BFIpersonality_BFIBig Five personality trait assessment
Personality TRAITpersonality_TRAITYesComprehensive personality trait evaluation
SOSBenchsosbenchYesSafety and oversight stress test
StereoSetstereosetMeasures stereotypical biases in language models
StrongREJECTstrong_rejectTests model’s ability to reject harmful requests
SycophancysycophancyEvaluates tendency toward sycophantic behavior
TruthfulQAtruthfulqaTests model truthfulness and resistance to falsehoods
UCCBuccbUnsafe Content Classification Benchmark
WMDP Biowmdp_bioTests hazardous knowledge in biology
WMDP Chemwmdp_chemTests hazardous knowledge in chemistry
WMDP Cyberwmdp_cyberTests hazardous knowledge in cybersecurity
XSTestxstestYesExaggerated safety test for over-refusal detection

Domain-Specific

Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
ChemBenchchembenchChemistry knowledge and problem-solving benchmark
HealthBenchhealthbenchYesHealthcare and medical knowledge evaluation
HealthBench Consensushealthbench_consensusYesHealthcare questions with expert consensus
HealthBench Hardhealthbench_hardYesChallenging healthcare scenarios
LabBench Cloning Scenarioslab_bench_cloning_scenariosLaboratory experiment planning and cloning
LabBench DBQAlab_bench_dbqaDatabase question answering for lab scenarios
LabBench FigQAlab_bench_figqaFigure interpretation in scientific contexts
LabBench LitQAlab_bench_litqaLiterature-based question answering for research
LabBench ProtocolQAlab_bench_protocolqaLaboratory protocol understanding
LabBench SeqQAlab_bench_seqqaBiological sequence analysis questions
LabBench SuppQAlab_bench_suppqaSupplementary material interpretation
LabBench TableQAlab_bench_tableqaTable interpretation in scientific papers
MedQAmedqaMedical licensing exam questions
PubMedQApubmedqaBiomedical question answering from research abstracts
SEC-QA v1sec_qa_v1SEC filing question answering
SEC-QA v1 (5-shot)sec_qa_v1_5_shotSEC-QA with 5 examples
SEC-QA v2sec_qa_v2Updated SEC filing benchmark
SEC-QA v2 (5-shot)sec_qa_v2_5_shotSEC-QA v2 with 5 examples

Multimodal

Evaluate vision and language understanding combining visual and textual inputs.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
DocVQAdocvqaDocument Visual Question Answering: questions about document images
MathVistamathvistaMathematical reasoning with visual contexts combining vision and math
MMMU Multiple Choicemmmu_multiple_choiceMultimodal understanding with multiple choice format
MMMU Openmmmu_openMultimodal understanding with open-ended responses
V*Star Bench Attribute Recognitionvstar_bench_attribute_recognitionVisual attribute recognition tasks
V*Star Bench Spatial Relationshipvstar_bench_spatial_relationship_reasoningSpatial reasoning with visual inputs

Instruction Following

Evaluate adherence to specific instructions and formatting requirements.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
IFEvalifevalTests precise instruction-following capabilities

System

Basic system validation and pre-flight checks.
EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
Pre-Flightpre_flightBasic system check and validation test

Next steps