USA: Personal data leak detected in one of the largest AI training kits

personal data leak in AI datasets

A large-scale study in the field of privacy has revealed serious risks associated with the use of publicly available datasets for training generative models of artificial intelligence. We are talking about the DataComp commonPool data set, formed on the basis of web scraping and numbering more than 12.8 billion samples by 2023. In the course of analyzing even a small sample (0.1%), experts found thousands of images containing personal data, including scans of passports, credit cards, birth certificates, resumes and other confidential documents.

Researchers estimate that the total number of images with personal information attributes can reach hundreds of millions. Special attention was paid to employment documents — resumes and cover letters containing sensitive information about health, results of inspections, place of residence, civil status, as well as data on family members and referrals. In some cases, such documents could be easily linked to specific people through publicly available profiles on the Internet, which gave attackers access to email, home addresses, and government identifiers.

The DataComp commonPool was created as a continuation of the LAION-5B project — a widely used dataset for training image generators, including models such as Stable Diffusion and Midjourney. Both datasets were formed as a result of automated Internet scraping from 2014 to 2022. Although the developers of commonPool claimed scientific purposes and open access, the license did not exclude commercial use, which significantly expanded the area of potential risk.

Among the key problems is the inefficiency of automatic depersonalization methods. In the study sample, more than 800 non-blurred faces were identified, which suggests that there are more than 100 million similar images in the entire database. The sample also lacked filters for automatic recognition of PII, such as email addresses, social security numbers, and bank details.

Despite the fact that commonPool is distributed by a platform with the ability to submit requests for deletion of personal data, only those users who are aware of the presence of their data in the database can exercise their right. Moreover, if the trained models have already integrated this data, excluding them from the original dataset does not guarantee that the training traces will be removed.

The researchers emphasize the need for an urgent review of ethical and legal norms in the field of machine learning. There are loopholes in the current regulatory framework — both in Europe and the United States — that allow the use of publicly available data to circumvent basic privacy principles. The lack of strict regulation in this area creates a threat of mass dissemination of personal data, uncontrolled training of models and loss of confidence in artificial intelligence technologies.

Tags: ITtop

2 days ago

Maili News

Maili.uz -news portal of Uzbekistan.

Next Илон Маск очень устает »

Previous « Россия: Fix Price завершает обмен GDR на акции перед началом торгов на Мосбирже

Iran's Nahid-2 satellite and Ionosphere-M spacecraft are being prepared for launch from the Vostochny cosmodrome

25 июля 2025 года с российского космодрома Восточный запланирован запуск ракеты-носителя «Союз-2.1б», которая доставит на околоземную орбиту телекоммуникационный спутник Nahid-2,…

14 hours ago

Brands

USA: Starlink launches T-Satellite service in the country

Американская компания Starlink запустила спутниковый сервис T-Satellite — телекоммуникационное решение, ориентированное на обеспечение мобильной связи в зонах, где отсутствует покрытие…

14 hours ago

Brands

USA: Microsoft introduces the most stable version of Windows-11 24H2

Американская корпорация Microsoft сообщила о достижении нового уровня стабильности в своей операционной системе: Windows 11 24H2 признана самой надёжной версией…

14 hours ago

Persons

USA: Elon Musk's Starship can change the global air transportation market

Американская компания SpaceX рассматривает возможность использования ракетной системы Starship не только для космических миссий, но и в качестве средства сверхбыстрого…

14 hours ago

Brands

USA: xAI creates world's first gigawatt cluster for AI training

Компания xAI, основанная Илоном Маском, представила амбициозный проект в области высокопроизводительных вычислений — суперкластер Colossus 2, который станет первым в…

14 hours ago

Persons

USA: James Cameron will present the most emotional film of his career — "Avatar 3"

Американский режиссёр Джеймс Кэмерон завершает работу над третьим фильмом знаменитой франшизы «Аватар». Картина под названием «Аватар 3: Огонь и пепел»…

14 hours ago

USA: Personal data leak detected in one of the largest AI training kits

Related Post

Recent Posts

Iran's Nahid-2 satellite and Ionosphere-M spacecraft are being prepared for launch from the Vostochny cosmodrome

USA: Starlink launches T-Satellite service in the country

USA: Microsoft introduces the most stable version of Windows-11 24H2

USA: Elon Musk's Starship can change the global air transportation market

USA: xAI creates world's first gigawatt cluster for AI training

USA: James Cameron will present the most emotional film of his career — "Avatar 3"