Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

by   Daman Arora, et al.
Indian Institute of Technology Delhi

The performance on Large Language Models (LLMs) on existing reasoning benchmarks has shot up considerably over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 450 challenging pre-engineering mathematics, physics and chemistry problems from the IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on the GPT series of models reveals that although performance improves with newer models, the best being GPT-4, the highest performance, even after using techniques like Self-Consistency and Chain-of-Thought prompting is less than 40 percent. Our analysis demonstrates that errors in algebraic manipulation and failure in retrieving relevant domain specific concepts are primary contributors to GPT4's low performance. Given the challenging nature of the benchmark, we hope that it can guide future research in problem solving using LLMs. Our code and dataset is available here.


page 1

page 2

page 3

page 4


ARB: Advanced Reasoning Benchmark for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable performance on...

An Empirical Study on Challenging Math Problem Solving with GPT-4

Employing Large Language Models (LLMs) to address mathematical problems ...

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Recent advances in large language models (LLMs) have demonstrated notabl...

True Detective: A Challenging Benchmark for Deep Abductive Reasoning in Foundation Models

Large language models (LLMs) have demonstrated strong performance in zer...

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

Error prediction in large language models often relies on domain-specifi...

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

With the continuous evolution and refinement of LLMs, they are endowed w...

Please sign up or login with your details

Forgot password? Click here to reset