An Empirical Study on the Overlapping Problem of Open-Domain Dialogue Datasets

01/17/2022
by   Yuqiao Wen, et al.
0

Open-domain dialogue systems aim to converse with humans through text, and its research has heavily relied on benchmark datasets. In this work, we first identify the overlapping problem in DailyDialog and OpenSubtitles, two popular open-domain dialogue benchmark datasets. Our systematic analysis then shows that such overlapping can be exploited to obtain fake state-of-the-art performance. Finally, we address this issue by cleaning these datasets and setting up a proper data processing procedure for future research.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset