Danger and opportunity for news industry as AI woos it for vital human-written copy

OpenAI, the developer of ChatGPT, knows that high-quality data matters in the artificial intelligence business – and news publishers have vast amounts of it.

“It would be impossible to train today’s leading AI models without using copyrighted materials,” the company said this year in a submission to the UK’s House of Lords, adding that limiting its options to books and drawings in the public domain would create underwhelming products.

AI labs construct large language models – the technology that underpins tools such as OpenAI’s leading chatbot – by using trillions of words taken from the internet, a vital resource for providing material that allows LLMs to understand text-based prompts and predict the right response to them.

OpenAI’s deal with the Financial Times this week underscores the US company’s need for acceptable material, with the FT group’s chief executive, John Ridding, saying: “It’s clearly in the interests of users that these products contain reliable sources.”

As AI labs grow increasingly hungry for reliable, timely, and above all human-written text to make those responses as good as possible, the news industry is assessing how best to react: while many are stepping up the fight to defend their copyrighted turf, others are engaging with the big AI players to reach a compromise – and potentially gain some commercial advantage.

The New York Times landed the first major blow for the defence in December, suing OpenAI and Microsoft, the AI company’s biggest investor, for copyright infringement. In court filings, the paper demonstrated that OpenAI’s chatbots could be induced to recreate, near-verbatim, articles from its archive.

OpenAI, in response, argued that the NYT’s “prompting” was more than just unrealistic: the publisher, it said, used “deceptive prompts that blatantly violate OpenAI’s terms of use … The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products.”

The cold war between the NYT and OpenAI had been simmering for months before the lawsuit was launched. In August, the paper blocked OpenAI’s web crawler – which hoovers up data for its models – from accessing its website. The Guardian and the BBC followed.

Reuters and CNN have taken action to prevent the company from reading their material, a move that carries little legal weight but makes it harder in practical terms for news to be used as training data.

In the months since, others have launched their own lawsuits. The independent publishers Intercept, Raw Story and AlterNet sued in February, while in April, the hedge fund Alden Global Capital, which owns eight US newspapers, launched a flurry of lawsuits targeting both ChatGPT and Microsoft’s Copilot AI.

Speaking in January, OpenAI’s chief executive, Sam Altman, appeared dismissive of NYT’s relevance to its products. “Any one particular training source, it doesn’t move the needle for us that much,” he said.

Nonetheless, deals have been struck with news publishers who spot a new revenue stream, while OpenAI, as it said of this week’s FT deal, wants to “enrich the ChatGPT experience with real-time, world-class journalism”.

The deal lets OpenAI train future models on FT content, while giving the news group access to the AI developer’s tech and expertise to build tools for its own business. ChatGPT users will also receive summaries and quotes from FT journalism, as well as links to articles, in responses to prompts, where appropriate.

OpenAI has already signed content licensing deals with the US news agency the Associated Press, the French newspaper Le Monde, the El País owner Prisa Media and Germany’s Axel Springer, which publishes the Bild tabloid.

A spokesperson for Guardian News & Media, publisher of the Guardian, ​confirmed that it does not ​currently have a deal with OpenAI, ​but added that it remains in discussions with a range of leading AI companies.

The deals highlight the uncertain balance of power between AI and the media. On the one hand, uncertain copyright protections and the easy access to material online has encouraged many AI companies to take the chance with unlicensed data, hoping they will be able to claim fair use in any legal battles. When they do need to license material, the commodity nature of much reporting encourages a “divide and conquer” approach – if only one deal is needed to keep a chatbot up-to-date with the latest news, this offers strong bargaining potential.

Niamh Burns, a senior analyst at Enders Analysis, argues that OpenAI and the FT share enough incentives to sign a deal, but publishers and tech companies bring different perspectives to the negotiating table.

“Publishers say using their content to train LLMs is against their terms of use and that licensing is essential. OpenAI says it doesn’t breach copyright, and frames deals as voluntary support of the journalism sector,” she says.

“Licensing is still a grey area, but these early deals are setting some precedents. The problem for publishers is we have no idea what AI products will look like in a year’s time. They might not even know what to ask for.”

At the same time, the ravenous nature of AI models means they always need more data. OpenAI’s James Betker argued last year that the difference in quality between AI models was entirely down to the dataset. “Model behaviour is not determined by architecture, hyperparameters, or optimizer choices,” he said, referring to the technical difficulties of training a language model. “It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently [delivering] compute to approximating that dataset.”

If true, it means a company with few tech skills but a sufficiently large dataset would find it easier to build a top-tier AI system than an equally well resourced company with expert engineers but no access to training data – a very different balance of skills from that normally assumed. Either way, it underlines the importance of news publishers’ work to the next generation of AI models.

The Guardian

Leave a Reply