Best LLM Benchmarks in 2024
‣
Several videos and articles explaining individual benchmarks, generate these
Name | URL | Text | Popularity |
---|---|---|---|
MMLU | Common sense and reasoning, Multiple-choice questions in 57 subjects (professional & academic) | High | |
HellaSwag | Common sense and reasoning, Commonsense reasoning around everyday events | High | |
WinoGrande | Common sense and reasoning, Commonsense reasoning around pronoun resolution | High | |
ARC (AI2 Reasoning Challenge) | Common sense and reasoning, Grade-school multiple choice science questions. Challenge-set. | High | |
HumanEval | Coding, Python coding tasks | High | |
GSM8K | Math | High | |
DROP (f1 score) | Reading comprehension & arithmetic. | Medium | |
MATH | Math | Medium | |
Big-Bench Hard | Medium | ||
TriviaQA | Knowledge | ||
TruthfulQA | Knowledge | ||
MBPP | Coding | ||
Math maj@4 | Math | ||
MT-Bench | |||
GPQA |
// Ways to measure LLM performance, Benchmarks, Human Ratings, Simulated Exams, etc.
// Simulated exams like Bar, SAT, GRE, etc., https://openai.com/research/gpt-4
// Leaderboards