Global: Enormous data pipelines powering major generative AI systems are rooted in mass invasions of privacy by design

Source: Amnesty International –

Companies are extracting vast troves of online data through unlawful web scraping to build their generative artificial intelligence (AI) products in a way that is enabling a mass invasion of privacy, making these systems unlawful by design, Amnesty International said in a new briefing today.

Unlawful by Design: Exposing the Human Rights Costs of Generative AI documents serious risks in the large-scale data scraping and processing being used to build and train these systems, including violations of the right to privacy by design and adverse consequences for the environment and historically marginalized communities.

“Companies across the world are supplying generative AI products under the veneer of efficiency and sophistication, but in reality, these systems perpetuate mass invasions of privacy through unlawful web scraping: an automated process for extracting data from websites, including personal data, such as images and social media activity, to train AI models,” said Likhita Banerji, Head of the Algorithmic Accountability Lab, Amnesty International.

“The extractive data pipeline, inherent design choices made by tech companies and exploitative supply chains, to build generative AI systems have enabled a paradigm of technology development that opens up a risk of mass abuse of human rights.”

Amnesty International researched the models powering some of the most popular publicly available standalone generative AI tools, including GPT 3 by Open AI, Google’s Gemini, Meta’s Llama, DeepSeek and tools by Midjourney and Stable Diffusion.

Such systems rely on extracting information from billions of public online posts and images often without the explicit consent of the individuals appearing in or creating them. Not only does this infringe on privacy by design but as datasets powering AI models scale up, the presence of hateful and discriminatory content in their outputs also gets amplified, along with negative stereotypes and prejudices, especially along racial and gendered lines.

“These choices are not inevitable. We must challenge the design choices adopted by companies who build generative AI systems by relying on training data, including personal data, that is extracted non-consensually and on a grand scale.”

Likhita Banerji, Head of the Algorithmic Accountability Lab, Amnesty International

Racial, gender and cultural biases are consistent features of generative AI systems, a product of the training data that is largely pulled from the web and therefore polluted with real-world biases which harm historically marginalized communities. Additionally, generative AI systems pose risks to the right to freedom of thought as they are capable of influencing users’ thoughts and shaping their personal beliefs through predictive suggestions. This is especially true for larger models reliant on expansive training data.

“These choices are not inevitable. We must challenge the design choices adopted by companies who build generative AI systems by relying on training data, including personal data, that is extracted non-consensually and on a grand scale,” said Likhita Banerji.

“This is one of the most egregious practices among AI companies operating with disregard for human rights and must urgently be addressed. A different trajectory of technology development is possible if authorities act urgently to course correct.”