FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

by   Liwen Zhang, et al.

Large language models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, yet their efficacy in more challenging and domain-specific tasks remains largely unexplored. This paper presents FinEval, a benchmark specifically designed for the financial domain knowledge in the LLMs. FinEval is a collection of high-quality multiple-choice questions covering Finance, Economy, Accounting, and Certificate. It includes 4,661 questions spanning 34 different academic subjects. To ensure a comprehensive model performance evaluation, FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts. Evaluating state-of-the-art Chinese and English LLMs on FinEval, the results show that only GPT-4 achieved an accuracy close to 70 in different prompt settings, indicating significant growth potential for LLMs in the financial domain knowledge. Our work offers a more comprehensive financial knowledge evaluation benchmark, utilizing data of mock exams and covering a wide range of evaluated LLMs.


page 5

page 8


Evaluating the Performance of Large Language Models on GAOKAO Benchmark

Large language models have demonstrated remarkable performance across va...

CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains

Generative chat models, such as ChatGPT and GPT-4, have revolutionized n...

Evaluating the Generation Capabilities of Large Chinese Language Models

This paper presents CG-Eval, the first comprehensive evaluation of the g...

Beyond Classification: Financial Reasoning in State-of-the-Art Language Models

Large Language Models (LLMs), consisting of 100 billion or more paramete...

GLUECons: A Generic Benchmark for Learning Under Constraints

Recent research has shown that integrating domain knowledge into deep le...

Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors

ChatGPT has stimulated the research boom in the field of large language ...

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? An Examination on Several Typical Tasks

The most recent large language models such as ChatGPT and GPT-4 have gar...

Code Repositories



view repo

Please sign up or login with your details

Forgot password? Click here to reset