# Who owns the knowledge that teaches AI?

**Authors:** Charles Salvaudon
**Categories:** Opinion
**Tags:** Analysis
**Last Updated:** 2026-06-17T10:14:41.934Z
**Reading Time:** 9 min read

---

## Summary

At the World News Media Congress by WAN-IFRA in Marseille (1–3 June 2026), the New York Times Chairman and Publisher A.G. Sulzberger called on publishers to unite against AI giants. Charles Salvaudon traces the legal, economic and cultural battle taking shape around training data.

---

Artificial intelligence is routinely presented as the greatest technological revolution since the internet. Its capabilities are striking: writing text, generating images and video, translating languages instantly, assisting with decisions, retrieving documents, producing code… Behind this apparent magic, however, lies a far less discussed reality. **AI models do not invent their knowledge.** They learn it from vast quantities of content produced by human beings.

Press articles, books, academic studies, photographs, illustrations, music, videos, blogs, forums, encyclopaedias, professional publications: everything that has ever been published online or digitised constitutes **valuable raw material for training algorithms.** The question this raises is fundamental. Are the AI giants building their fortunes on the back of a form of mass appropriation of content created by others?

# An industry built on a free resource

Since the emergence of large language models, the economic value of data has reached unprecedented levels. The companies that dominate the AI sector today have access to gigantic volumes of content, and that access constitutes a decisive advantage. **Training a modern model requires absorbing billions of documents.** The more numerous, varied and high-quality the data, the better the system performs.

The problem is that most of this content was not created to feed artificial intelligence. Researchers publish to advance knowledge. Photographers sell their images. Companies produce studies for their clients. Yet a substantial portion of these works has been scraped by automated bots and fed into training datasets.

For a long time, this collection process took place in relative opacity. Technology companies frequently invoked the **public nature of information available on the internet** to justify their use of it. But being visible is not the same as being freely exploitable. A press article available online remains protected by copyright. **A photograph published on a website remains the property of its creator.** A digitised book does not automatically enter the public domain. The argument that any accessible content can be used without compensation is increasingly being challenged.

# Creators discover the scale of the problem

For years, most content producers had no idea their work was being used to train artificial intelligence. A series of revelations about the datasets used by major players in the sector has gradually changed that. Writers discovered that their books appeared in massive data compilations. Illustrators found that image-generation systems could reproduce their style with unsettling precision. Media organisations realised their articles were being used to power tools capable of producing competing summaries. A sense of injustice spread rapidly.

Creators point to a striking paradox: AI platforms valued at hundreds of billions of dollars rely on content whose authors have received no remuneration. The impression is one of **a massive transfer of value.** Content producers fund intellectual creation and bear the costs of research, investigation and production, while AI companies capture a growing share of the economic value generated by those works. For many, the situation resembles **a form of resource extraction without consent,** comparable to the free exploitation of a raw material.

# The historical precedent of digital platforms

This controversy is not entirely new. The history of the internet is punctuated by similar conflicts. Search engines were long accused of profiting from media content without fairly sharing the revenues generated. Social networks built their power on content published freely by their users. Streaming platforms profoundly transformed the economics of music, often to the detriment of artists.

Artificial intelligence simply takes this logic to a higher level. Where search engines still directed users to the original source, **generative models produce a synthetic answer directly,** reducing the need to consult the original creator. Tomorrow, a user could obtain a detailed summary of a book, an analysis of an article or the synthesis of a study without ever visiting the original site. The economic consequences for content producers could be considerable.

# A threat to the knowledge ecosystem

Beyond the legal questions, **the debate concerns the long-term viability of intellectual creation.** Producing quality content takes time, money and expertise. An investigative journalism piece may require several months of work. A scientific study can mobilise years of research. A book often represents thousands of hours of effort. If AI models capture the bulk of user attention while freely using human productions, a systemic risk emerges. Why invest in creating original content if the economic value is captured elsewhere?

This question is particularly acute for the media. Professional journalism plays an essential role in the functioning of democracies. Yet its economic model is **already weakened by digital competition.** Over the past decade, the press in France has lost some 50% of its advertising revenues to digital platforms such as Google and Meta, according to Pierre Louette, ex chief executive of the Groupe Les Échos-Le Parisien. The broader picture is equally striking: in 1998, when Google was founded, newspapers captured roughly half of global advertising spending. Today, that figure has fallen to less than 10%.

If conversational assistants become the primary interface through which people access information, publishers’ revenues could fall further, potentially to the point of disappearance. In the long run, the paradox is evident: **without content producers, AI models would have no quality raw material left to exploit.** The risk is one of a progressive impoverishment of the information ecosystem.

The alarm is now sounding at the highest level. At the World News Media Congress, organised by WAN-IFRA in Marseille from 1 to 3 June, the chief executive of the New York Times, Arthur Gregg Sulzberger, [called on publishers worldwide to unite against the AI giants](https://wan-ifra.org/2026/06/nyts-sulzberger-condemns-ai-giants-for-brazen-theft-of-intellectual-property/), warning that **the very foundations of independent journalism were at stake.**

# Legal battles multiply

Faced with this situation, legal action is proliferating around the world. **Authors, artists, media organisations and publishing groups have launched proceedings against several AI companies.** Five publishing groups (the Dutch company Elsevier, the American firms Cengage Learning and McGraw Hill, the French publisher Hachette, and the British company Macmillan Publishers), joined by American author Scott Turow, filed a class action lawsuit on 5 May 2026 against Meta Platforms and its chief executive Mark Zuckerberg, alleging wilful copyright infringement. They accuse Meta’s large language model, Llama, of having used millions of protected works, including literary texts, educational materials and scientific publications, without authorisation. In November 2025, OpenAI was found guilty by the federal court in Munich of having infringed the copyrights of songs in Germany, the first ruling of its kind in Europe.

The core of the debate is straightforward. **Does training a model constitute a legitimate use of works, or a violation of copyright?** Technology companies generally argue that statistical learning constitutes a transformative use and does not amount to a conventional reproduction of content. Plaintiffs counter that the mass exploitation of protected works without authorisation or compensation constitutes an infringement of their rights. The judicial decisions to come could profoundly redraw the rules of the digital economy. The stakes reach well beyond AI; they concern **intellectual property in the data economy as a whole.**

# Towards a licensing economy

Aware of the legal risk, some technology companies have begun negotiating agreements with content holders. Last year, **Anthropic agreed to pay at least 1.5 billion dollars into a compensation fund for authors, rights holders and publishers** who had sued the company for illegally downloading millions of books. Press groups have also signed partnerships with AI players: Meta with News Corp, the dailies Le Figaro and Süddeutsche Zeitung, the television channels CNN and Fox News, the newspaper Le Monde and the Spanish group Prisa Media. **Le Monde and Prisa Media are also partners of OpenAI,** alongside the American news agency AP and the German media group Axel Springer. In parallel, publishers have agreed to provide access to their catalogues in exchange for payment, while image banks are developing specific licences for model training.

This evolution may mark the beginning of a new equilibrium. The idea is not necessarily to prohibit the use of content, but **to create a value-sharing mechanism.** Just as artists receive royalties when their music is broadcast, creators could be remunerated when their works contribute to training AI systems. This approach would reconcile technological innovation with respect for creative work.

# The geopolitical challenge of cultural sovereignty

The question also extends beyond the economic frame. The data used to train models reflects worldviews, languages, cultures and systems of values. Today, a significant share of AI infrastructure is controlled by a small number of predominantly American companies. This concentration raises questions of sovereignty. **Who decides which content is incorporated into models?** Which cultures are represented? Which languages receive the largest investments? Which narratives dominate the global information space?

The mass exploitation of content is therefore not only a matter of author remuneration. It also touches on **the preservation of cultural and intellectual diversity.** In a world where conversational assistants are becoming major intermediaries in accessing knowledge, control over training data constitutes **a form of strategic power.**

# A new frontier of digital capitalism

Economic history shows that every industrial revolution rests on a key resource. Coal powered the first industrial revolution. Oil sustained the growth of the twentieth century. **Data has become the strategic resource of the twenty-first.** But unlike traditional raw materials, this resource is produced daily by billions of individuals, companies, researchers, journalists and creators.

The current debate therefore concerns a fundamental question: who actually owns this wealth? AI giants tend to view data as an indispensable fuel for innovation. Creators view it as the product of their labour. States see it as a matter of economic and cultural sovereignty. **The confrontation taking shape around training content could become one of the defining economic conflicts of the decade.**

# Towards a new social contract for artificial intelligence

Artificial intelligence does not create from nothing. It draws on an **immense intellectual heritage** built by generations of authors, researchers, artists, journalists and entrepreneurs. The spectacular success of the AI giants rests in large part on this accumulated wealth of knowledge.

The real question is therefore not whether AI should use existing content. Without it, it simply could not function. The question is determining **under what rules that use should take place.** A sustainable system will probably need to rest on three principles: 
1. Transparency about the data used 
2. Consent from creators 
3. A fair share of the value produced

Without this, there is a serious risk that artificial intelligence will thrive by weakening precisely those who produce the knowledge on which it depends.

The future of AI will not be decided solely in laboratories or data centres. It will also be decided by our collective capacity to define **a new social contract between technological innovation and human creation.**

## Key Takeaways

1. AI models do not invent their knowledge. They learn it from vast quantities of content produced by human beings, most of it scraped without authorisation or compensation.
2. Being publicly visible is not the same as being freely exploitable. A press article, a photograph or a digitised book remains protected by copyright, regardless of where it appears online.
3. The economic logic is stark. Content producers bear the full cost of creation while AI companies capture a growing share of the value those works generate, raising the risk of a progressive impoverishment of the information ecosystem.
4. Legal battles are multiplying worldwide, and the judicial decisions to come could profoundly redraw the rules of the digital economy, well beyond the AI sector alone.
5. A sustainable system will need to rest on transparency about the data used, consent from creators, and a fair share of the value produced. Without this, AI risks thriving by weakening precisely those who produce the knowledge it depends on.


---

*Article from [Albert's Deep Dive](https://deepdive.albertschool.com) - Albert School's Journal*
