Digital

USA: Personal data leak detected in one of the largest AI training kits

A large-scale study in the field of privacy has revealed serious risks associated with the use of publicly available datasets for training generative models of artificial intelligence. We are talking about the DataComp commonPool data set, formed on the basis of web scraping and numbering more than 12.8 billion samples by 2023. In the course of analyzing even a small sample (0.1%), experts found thousands of images containing personal data, including scans of passports, credit cards, birth certificates, resumes and other confidential documents.

Researchers estimate that the total number of images with personal information attributes can reach hundreds of millions. Special attention was paid to employment documents — resumes and cover letters containing sensitive information about health, results of inspections, place of residence, civil status, as well as data on family members and referrals. In some cases, such documents could be easily linked to specific people through publicly available profiles on the Internet, which gave attackers access to email, home addresses, and government identifiers.

The DataComp commonPool was created as a continuation of the LAION-5B project — a widely used dataset for training image generators, including models such as Stable Diffusion and Midjourney. Both datasets were formed as a result of automated Internet scraping from 2014 to 2022. Although the developers of commonPool claimed scientific purposes and open access, the license did not exclude commercial use, which significantly expanded the area of potential risk.

Among the key problems is the inefficiency of automatic depersonalization methods. In the study sample, more than 800 non-blurred faces were identified, which suggests that there are more than 100 million similar images in the entire database. The sample also lacked filters for automatic recognition of PII, such as email addresses, social security numbers, and bank details.

Despite the fact that commonPool is distributed by a platform with the ability to submit requests for deletion of personal data, only those users who are aware of the presence of their data in the database can exercise their right. Moreover, if the trained models have already integrated this data, excluding them from the original dataset does not guarantee that the training traces will be removed.

The researchers emphasize the need for an urgent review of ethical and legal norms in the field of machine learning. There are loopholes in the current regulatory framework — both in Europe and the United States — that allow the use of publicly available data to circumvent basic privacy principles. The lack of strict regulation in this area creates a threat of mass dissemination of personal data, uncontrolled training of models and loss of confidence in artificial intelligence technologies.

Tags: ITtop
Maili News

Maili.uz -news portal of Uzbekistan.

Recent Posts

Узбекистан принял участие в промышленном фестивале Ульсана 2025

В Ульсане состоялось открытие промышленного фестиваля, на котором Узбекистан был представлен делегацией во главе с хокимом Хайрулло Бозоровым. Площадка продемонстрировала…

2 weeks ago

Узбекистан и Беларусь заключили партнерство в машиностроении

Узбекистан и Беларусь согласовывают приоритеты промышленной кооперации с фокусом на машиностроение, где возможна быстрая капитализация компетенций обеих сторон. На переговорах…

2 weeks ago

Узбекистан и Венгрия расширяют сотрудничество в области профессионального обучения

В Ташкенте состоялась встреча министра по сокращению бедности и занятости Узбекистана Батира Захидова с делегацией Венгрии во главе с министром…

2 weeks ago

Президент Финляндии посетит Узбекистан 30–31 октября

Официальный визит президента Финляндии Александра Стубба в Узбекистан запланирован на 30–31 октября 2025 года. В Ташкенте состоится встреча глав государств,…

2 weeks ago

В Узбекистане открыта первая школа подготовки специалистов по ИИ

В Узбекистане начала работу Yandex ML School — первый специализированный центр подготовки специалистов по искусственному интеллекту, ориентированный на машинное обучение,…

2 weeks ago

В Московской области пройдет II Совет регионов России и Узбекистана

21–22 октября 2025 года в Московской области состоится II Совет регионов России и Узбекистана — ключевая платформа для обсуждения перспектив…

2 weeks ago