Evaluating Language Models for Mathematics through Interactions

by   Katherine M. Collins, et al.

The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models' algebraic fallibility, and for that reason discern where they should be used.


page 9

page 10

page 40


Mathematical Capabilities of ChatGPT

We investigate the mathematical capabilities of ChatGPT by testing it on...

Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs

Language models have become very popular recently and many claims have b...

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Recently, there has been an increase in interest in evaluating large lan...

ARB: Advanced Reasoning Benchmark for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable performance on...

Proof Artifact Co-training for Theorem Proving with Language Models

Labeled data for imitation learning of theorem proving in large librarie...

Towards the Scalable Evaluation of Cooperativeness in Language Models

It is likely that AI systems driven by pre-trained language models (PLMs...

Evaluating Human-Language Model Interaction

Many real-world applications of language models (LMs), such as code auto...

Please sign up or login with your details

Forgot password? Click here to reset