Augmented Datasheets for Speech Datasets and Ethical Decision-Making

Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to "Datasheets for Datasets". We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.


page 19

page 21

page 23


Considerations for Ethical Speech Recognition Datasets

Speech AI Technologies are largely trained on publicly available dataset...

AI4D – African Language Dataset Challenge

As language and speech technologies become more advanced, the lack of fu...

Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation

The automatic detection of hate speech online is an active research area...

Does Speech enhancement of publicly available data help build robust Speech Recognition Systems?

Automatic speech recognition (ASR) systems play a key role in many comme...

On the Challenges of Building Datasets for Hate Speech Detection

Detection of hate speech has been formulated as a standalone application...

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets

Detecting toxic or pejorative expressions in online communities has beco...

Directions in Abusive Language Training Data: Garbage In, Garbage Out

Data-driven analysis and detection of abusive online content covers many...

Please sign up or login with your details

Forgot password? Click here to reset