personal data leak in AI datasets
A large-scale study in the field of privacy has revealed serious risks associated with the use of publicly available datasets for training generative models of artificial intelligence. We are talking about the DataComp commonPool data set, formed on the basis of web scraping and numbering more than 12.8 billion samples by 2023. In the course of analyzing even a small sample (0.1%), experts found thousands of images containing personal data, including scans of passports, credit cards, birth certificates, resumes and other confidential documents.
Researchers estimate that the total number of images with personal information attributes can reach hundreds of millions. Special attention was paid to employment documents — resumes and cover letters containing sensitive information about health, results of inspections, place of residence, civil status, as well as data on family members and referrals. In some cases, such documents could be easily linked to specific people through publicly available profiles on the Internet, which gave attackers access to email, home addresses, and government identifiers.
The DataComp commonPool was created as a continuation of the LAION-5B project — a widely used dataset for training image generators, including models such as Stable Diffusion and Midjourney. Both datasets were formed as a result of automated Internet scraping from 2014 to 2022. Although the developers of commonPool claimed scientific purposes and open access, the license did not exclude commercial use, which significantly expanded the area of potential risk.
Among the key problems is the inefficiency of automatic depersonalization methods. In the study sample, more than 800 non-blurred faces were identified, which suggests that there are more than 100 million similar images in the entire database. The sample also lacked filters for automatic recognition of PII, such as email addresses, social security numbers, and bank details.
Despite the fact that commonPool is distributed by a platform with the ability to submit requests for deletion of personal data, only those users who are aware of the presence of their data in the database can exercise their right. Moreover, if the trained models have already integrated this data, excluding them from the original dataset does not guarantee that the training traces will be removed.
The researchers emphasize the need for an urgent review of ethical and legal norms in the field of machine learning. There are loopholes in the current regulatory framework — both in Europe and the United States — that allow the use of publicly available data to circumvent basic privacy principles. The lack of strict regulation in this area creates a threat of mass dissemination of personal data, uncontrolled training of models and loss of confidence in artificial intelligence technologies.
25 июля 2025 года с российского космодрома Восточный запланирован запуск ракеты-носителя «Союз-2.1б», которая доставит на околоземную орбиту телекоммуникационный спутник Nahid-2,…
Американская компания Starlink запустила спутниковый сервис T-Satellite — телекоммуникационное решение, ориентированное на обеспечение мобильной связи в зонах, где отсутствует покрытие…
Американская корпорация Microsoft сообщила о достижении нового уровня стабильности в своей операционной системе: Windows 11 24H2 признана самой надёжной версией…
Американская компания SpaceX рассматривает возможность использования ракетной системы Starship не только для космических миссий, но и в качестве средства сверхбыстрого…
Компания xAI, основанная Илоном Маском, представила амбициозный проект в области высокопроизводительных вычислений — суперкластер Colossus 2, который станет первым в…
Американский режиссёр Джеймс Кэмерон завершает работу над третьим фильмом знаменитой франшизы «Аватар». Картина под названием «Аватар 3: Огонь и пепел»…