^{Last Update: July 18, 2024}

Model Evaluations Explanation

Tromero has tests in the following 6 distinct categories:

Mathematics: Assess a model's mathematical problem-solving abilities.
General AI: Measure a model's capabilities across a diverse set of tasks.
Reasoning: Evaluate a model's discrete reasoning, commonsense reasoning, and reading comprehension skills.
Knowledge: Test a model's fact-recall capabilities and knowledge in specific domains.
Programming: Assess a model's ability to solve programming problems.
Problem-Solving: Evaluate a model's performance on challenging benchmarks and real-world tasks.

Model Evaluations

Benchmarks

The following benchmarks are available for selection:

MATH: Measures performance on mathematical problem-solving tasks.
GSM8k: Performance on the GSM8k dataset, which consists of grade-school math problems.

AGIEval: Measures general AI problem-solving abilities, often across a diverse set of tasks.

DROP (Discrete Reasoning Over Paragraphs): Evaluates the model's ability to perform discrete reasoning in reading comprehension tasks.
HellaSwag: Tests the model's commonsense reasoning capabilities.
CommonsenseQA: Tests commonsense reasoning abilities.

TriviaQA: Tests the model's ability to answer trivia questions, reflecting its fact-recall capabilities.
OpenBookQA: Evaluates performance on open-book question-answering tasks, testing both knowledge and reasoning.

MBPP (Multiple-Choice Benchmark for Programming Problems): Evaluates the model's ability to solve multiple-choice programming problems.

BBH (BIG-Bench Hard): Assesses performance on difficult and challenging benchmarks from the BIG-Bench dataset.
MMLU (Massive Multitask Language Understanding): Assesses performance across a wide range of subjects and tasks.
PIQA (Physical Interaction Question Answering): Assesses the model's understanding of physical interactions.
ARC (AI2 Reasoning Challenge): Measures the ability to answer questions that require reasoning.
SIQA (Social IQA): Evaluates social intelligence by assessing how well the model understands social situations.
BoolQ: Measures the model's performance on binary (yes/no) questions.
GPQA: General-purpose question answering performance.