Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

by   Mahima Pushkarna, et al.

As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. It requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models, such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. We also present frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over 20 Data Cards.


page 1

page 2

page 3

page 4


Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

The rapid entry of machine learning approaches in our daily activities a...

Beyond XAI:Obstacles Towards Responsible AI

The rapidly advancing domain of Explainable Artificial Intelligence (XAI...

Ethical Considerations for Collecting Human-Centric Image Datasets

Human-centric image datasets are critical to the development of computer...

Training Ethically Responsible AI Researchers: a Case Study

Ethical oversight of AI research is beset by a number of problems. There...

Human Body Digital Twin: A Master Plan

The human body DT has the potential to revolutionize healthcare and well...

CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation

Human annotated data plays a crucial role in machine learning (ML) resea...

EgoBlur: Responsible Innovation in Aria

Project Aria pushes the frontiers of Egocentric AI with large-scale real...

Please sign up or login with your details

Forgot password? Click here to reset