PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

by   Ruosen Li, et al.

Nowadays, the quality of responses generated by different modern large language models (LLMs) are hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs as a reference-free metric for open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho and MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose the (1) peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments, respectively. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.


page 9

page 15

page 18

page 21

page 22

page 23

page 27

page 30


An Application of HodgeRank to Online Peer Assessment

Bias and heterogeneity in peer assessment can lead to the issue of unfai...

ZARA: Improving Few-Shot Self-Rationalization for Small Language Models

Language models (LMs) that jointly generate end-task answers as well as ...

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain ...

KEPR: Knowledge Enhancement and Plausibility Ranking for Generative Commonsense Question Answering

Generative commonsense question answering (GenCQA) is a task of automati...

Analyzing Influential Factors in Human Preference Judgments via GPT-4

Pairwise human judgments are pivotal in guiding large language models (L...

Ratings to Ranking: Preference Elicitation and Aggregation for Student Peer Assessment

Voters are usually asked to either rank or rate alternatives. However, r...

Unsupervised Contrast-Consistent Ranking with Language Models

Language models contain ranking-based knowledge and are powerful solvers...

Please sign up or login with your details

Forgot password? Click here to reset