AI Training Data Debacle: Massive Dataset Exposes Millions of Private Records

AI Training Data Debacle: Massive Dataset Exposes Millions of Private Records

Photo by Pixabay on Pexels

A massive open-source AI training dataset, DataComp CommonPool, has been found to contain millions of instances of exposed personal data, including images of passports, credit cards, and birth certificates. Researchers estimate hundreds of millions of images with personally identifiable information, from faces to sensitive documents, are likely included in the 12.8 billion data sample collection used for training generative AI models.

The findings, published on arXiv, highlight a significant privacy risk stemming from the indiscriminate web scraping practices used to build such datasets. Experts warn that the nature of web-scraped data invariably leads to the inclusion of problematic and sensitive content. While dataset curators attempted to implement face blurring, the study revealed that millions of faces were missed by the algorithm.

DataComp CommonPool builds upon the LAION-5B dataset, which was used to train prominent AI models like Stable Diffusion and Midjourney. This shared data sourcing suggests similar privacy concerns likely exist in LAION-5B and models trained using CommonPool data. Hugging Face, the platform hosting CommonPool, provides a tool for users to remove their data, but this is contingent on individuals being aware of its presence in the dataset.

Experts like Tiffany Li of the University of New Hampshire School of Law, stress that even if data is deleted from training sets, the damage may be irreversible if the trained model isn’t retrained. The researchers call for a critical reevaluation of web scraping practices in AI development, emphasizing potential violations of privacy regulations and hoping the findings spark improvements in AI dataset creation and usage.