Last Update: July 18, 2024

Model Evaluations Explanation

Tromero has tests in the following 6 distinct categories:

  • Mathematics: Assess a model's mathematical problem-solving abilities.
  • General AI: Measure a model's capabilities across a diverse set of tasks.
  • Reasoning: Evaluate a model's discrete reasoning, commonsense reasoning, and reading comprehension skills.
  • Knowledge: Test a model's fact-recall capabilities and knowledge in specific domains.
  • Programming: Assess a model's ability to solve programming problems.
  • Problem-Solving: Evaluate a model's performance on challenging benchmarks and real-world tasks.

Model Evaluations

Benchmarks

The following benchmarks are available for selection:

Mathematics

  • MATH: Measures performance on mathematical problem-solving tasks.
  • GSM8k: Performance on the GSM8k dataset, which consists of grade-school math problems.

General AI

  • AGIEval: Measures general AI problem-solving abilities, often across a diverse set of tasks.

Reasoning

  • DROP (Discrete Reasoning Over Paragraphs): Evaluates the model's ability to perform discrete reasoning in reading comprehension tasks.
  • HellaSwag: Tests the model's commonsense reasoning capabilities.
  • CommonsenseQA: Tests commonsense reasoning abilities.

Knowledge

  • TriviaQA: Tests the model's ability to answer trivia questions, reflecting its fact-recall capabilities.
  • OpenBookQA: Evaluates performance on open-book question-answering tasks, testing both knowledge and reasoning.

Programming

  • MBPP (Multiple-Choice Benchmark for Programming Problems): Evaluates the model's ability to solve multiple-choice programming problems.

Problem-Solving

  • BBH (BIG-Bench Hard): Assesses performance on difficult and challenging benchmarks from the BIG-Bench dataset.
  • MMLU (Massive Multitask Language Understanding): Assesses performance across a wide range of subjects and tasks.
  • PIQA (Physical Interaction Question Answering): Assesses the model's understanding of physical interactions.
  • ARC (AI2 Reasoning Challenge): Measures the ability to answer questions that require reasoning.
  • SIQA (Social IQA): Evaluates social intelligence by assessing how well the model understands social situations.
  • BoolQ: Measures the model's performance on binary (yes/no) questions.
  • GPQA: General-purpose question answering performance.

Was this page helpful?