Last Update: July 18, 2024
Model Evaluations Explanation
Tromero has tests in the following 6 distinct categories:
- Mathematics: Assess a model's mathematical problem-solving abilities.
- General AI: Measure a model's capabilities across a diverse set of tasks.
- Reasoning: Evaluate a model's discrete reasoning, commonsense reasoning, and reading comprehension skills.
- Knowledge: Test a model's fact-recall capabilities and knowledge in specific domains.
- Programming: Assess a model's ability to solve programming problems.
- Problem-Solving: Evaluate a model's performance on challenging benchmarks and real-world tasks.
Benchmarks
The following benchmarks are available for selection:
Mathematics
- MATH: Measures performance on mathematical problem-solving tasks.
- GSM8k: Performance on the GSM8k dataset, which consists of grade-school math problems.
General AI
- AGIEval: Measures general AI problem-solving abilities, often across a diverse set of tasks.
Reasoning
- DROP (Discrete Reasoning Over Paragraphs): Evaluates the model's ability to perform discrete reasoning in reading comprehension tasks.
- HellaSwag: Tests the model's commonsense reasoning capabilities.
- CommonsenseQA: Tests commonsense reasoning abilities.
Knowledge
- TriviaQA: Tests the model's ability to answer trivia questions, reflecting its fact-recall capabilities.
- OpenBookQA: Evaluates performance on open-book question-answering tasks, testing both knowledge and reasoning.
Programming
- MBPP (Multiple-Choice Benchmark for Programming Problems): Evaluates the model's ability to solve multiple-choice programming problems.
Problem-Solving
- BBH (BIG-Bench Hard): Assesses performance on difficult and challenging benchmarks from the BIG-Bench dataset.
- MMLU (Massive Multitask Language Understanding): Assesses performance across a wide range of subjects and tasks.
- PIQA (Physical Interaction Question Answering): Assesses the model's understanding of physical interactions.
- ARC (AI2 Reasoning Challenge): Measures the ability to answer questions that require reasoning.
- SIQA (Social IQA): Evaluates social intelligence by assessing how well the model understands social situations.
- BoolQ: Measures the model's performance on binary (yes/no) questions.
- GPQA: General-purpose question answering performance.