Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

by   Miao Xiong, et al.

The task of empowering large language models (LLMs) to accurately express their confidence, referred to as confidence elicitation, is essential in ensuring reliable and trustworthy decision-making processes. Previous methods, which primarily rely on model logits, have become less suitable for LLMs and even infeasible with the rise of closed-source LLMs (e.g., commercialized LLM APIs). This leads to a growing need to explore the untapped area of non-logit-based approaches to estimate the uncertainty of LLMs. Hence, in this study, we investigate approaches for confidence elicitation that do not require model fine-tuning or access to proprietary information. We introduce three categories of methods: verbalize-based, consistency-based, and their hybrid methods for benchmarking, and evaluate their performance across five types of datasets and four widely-used LLMs. Our analysis of these methods uncovers several key insights: 1) LLMs often exhibit a high degree of overconfidence when verbalizing their confidence; 2) Prompting strategies such as CoT, Top-K and Multi-step confidences improve calibration of verbalized confidence; 3) Consistency-based methods outperform the verbalized confidences in most cases, with particularly notable improvements on the arithmetic reasoning task; 4) Hybrid methods consistently deliver the best performance over their baselines, thereby emerging as a promising state-of-the-art approach; 5) Despite these advancements, all investigated methods continue to struggle with challenging tasks, such as those requiring professional knowledge, leaving significant scope for improvement of confidence elicitation.


Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty

Open Information Extraction (OIE) task aims at extracting structured fac...

Improving Classifier Confidence using Lossy Label-Invariant Transformations

Providing reliable model uncertainty estimates is imperative to enabling...

Confidence-based Out-of-Distribution Detection: A Comparative Study and Analysis

Image classification models deployed in the real world may receive input...

MACEst: The reliable and trustworthy Model Agnostic Confidence Estimator

Reliable Confidence Estimates are hugely important for any machine learn...

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Large language models (LLMs) have achieved widespread success on a varie...

How Can We Know When Language Models Know?

Recent works have shown that language models (LM) capture different type...

Reliable Gradient-free and Likelihood-free Prompt Tuning

Due to privacy or commercial constraints, large pre-trained language mod...

Please sign up or login with your details

Forgot password? Click here to reset