Digital

USA: Personal data leak detected in one of the largest AI training kits

A large-scale study in the field of privacy has revealed serious risks associated with the use of publicly available datasets for training generative models of artificial intelligence. We are talking about the DataComp commonPool data set, formed on the basis of web scraping and numbering more than 12.8 billion samples by 2023. In the course of analyzing even a small sample (0.1%), experts found thousands of images containing personal data, including scans of passports, credit cards, birth certificates, resumes and other confidential documents.

Researchers estimate that the total number of images with personal information attributes can reach hundreds of millions. Special attention was paid to employment documents — resumes and cover letters containing sensitive information about health, results of inspections, place of residence, civil status, as well as data on family members and referrals. In some cases, such documents could be easily linked to specific people through publicly available profiles on the Internet, which gave attackers access to email, home addresses, and government identifiers.

The DataComp commonPool was created as a continuation of the LAION-5B project — a widely used dataset for training image generators, including models such as Stable Diffusion and Midjourney. Both datasets were formed as a result of automated Internet scraping from 2014 to 2022. Although the developers of commonPool claimed scientific purposes and open access, the license did not exclude commercial use, which significantly expanded the area of potential risk.

Among the key problems is the inefficiency of automatic depersonalization methods. In the study sample, more than 800 non-blurred faces were identified, which suggests that there are more than 100 million similar images in the entire database. The sample also lacked filters for automatic recognition of PII, such as email addresses, social security numbers, and bank details.

Despite the fact that commonPool is distributed by a platform with the ability to submit requests for deletion of personal data, only those users who are aware of the presence of their data in the database can exercise their right. Moreover, if the trained models have already integrated this data, excluding them from the original dataset does not guarantee that the training traces will be removed.

The researchers emphasize the need for an urgent review of ethical and legal norms in the field of machine learning. There are loopholes in the current regulatory framework — both in Europe and the United States — that allow the use of publicly available data to circumvent basic privacy principles. The lack of strict regulation in this area creates a threat of mass dissemination of personal data, uncontrolled training of models and loss of confidence in artificial intelligence technologies.

Tags: ITtop
Maili News

Maili.uz -news portal of Uzbekistan.

Recent Posts

США: Figma провела IPO и вышла на оценку около 70 млрд

Сорвавшаяся сделка с Adobe стала для Figma не точкой остановки, а моментом перезапуска. После отказа от поглощения из-за антимонопольных претензий…

10 hours ago

Великобритания: Bentley представила уникальный кабриолет Batur

Bentley Mulliner, старейшее в мире кузовное ателье и подразделение бренда по созданию эксклюзивных моделей, представило первый кабриолет Batur, который стал…

10 hours ago

Где отдохнуть состоятельным гостям из Узбекистана? В Лос-Анджелес

Лос-Анджелес давно закрепил за собой репутацию центра мировой индустрии развлечений, однако за пределами привычных клубов и светских мероприятий существует менее…

10 hours ago

Мир: электронная почта трансформируется в конкурента мессенджеров

Глобальный рынок сервисов электронной почты переживает системную трансформацию. По прогнозам аналитиков, к 2030 году его объём вырастет почти втрое за…

10 hours ago

Китай: Xiaohongshu намерена утроить прибыль до $3 млрд в 2025 году

Китайская компания Xiaohongshu, развивающая приложение Little Red Book (Rednote), поставила цель утроить свою чистую прибыль к 2025 году и довести…

10 hours ago

США: Warner Bros. подала иск к Midjourney за использование персонажей в ИИ-генерации

Американская медиакорпорация Warner Bros. Discovery подала иск в федеральный суд Лос-Анджелеса против стартапа Midjourney, работающего в области искусственного интеллекта. Причиной…

10 hours ago