Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement
We commonly use agreement measures to assess the utility of judgements made by human annotators in Natural Language Processing (NLP) tasks. While inter-annotator agreement is frequently used as an indication of label reliability by measuring consistency between annotators, we argue for the additional use of intra-annotator agreement to measure label stability over time. However, in a systematic review, we find that the latter is rarely reported in this field. Calculating these measures can act as important quality control and provide insights into why annotators disagree. We propose exploratory annotation experiments to investigate the relationships between these measures and perceptions of subjectivity and ambiguity in text items.
READ FULL TEXT