What OpenAI’s new GPT-4o model means for developers

Short for GPT-4 Omni, OpenAI’s personality-filled new model was trained from the ground-up to be multimodal, and is at once faster, cheaper, and more powerful than its predecessors — possibly than most of its rivals, as well.

This is incredibly significant for software developers who plan on leveraging AI models in their own apps and features, a fact emphasized by OpenAI’s Head of Product, API, Olivier Godement, and a member of his team, Product Manager for Developer APIs and Models Owen Campbell-Moore, both of whom spoke to VentureBeat exclusively in a conference call yesterday.

Why should developers know and care about GPT-4o? Simple: they can now put OpenAI’s new tech into their own apps and services, be they customer-facing such as customer service chatbots, or internal and employee-facing, such as a bot that answers team members’ questions about company policies, expenses, time-off, equipment, support tickets or other common questions. Developers can even build whole businesses atop OpenAI’s latest, or older, AI models.

VB Event

The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Request an invite

How GPT-4o differs from what came before

While OpenAI previously offered multimodal capabilities through its GPT-4 and GPT-4V (vision) and Turbo models, those models all worked by converting inputs such as documents, attachments, images and even audio files into corresponding text, which was then mapped to underlying tokens, and its outputs delivered via the opposite mechanism.

“Before GPT-4o, if you wanted to build a voice personal assistant, you basically had to chain or plug together three different models: 1. audio in, such as [OpenAI’s] Whisper; 2. text intelligence, such as GPT-4 Turbo; then 3. back out with text-to-speech,” Godement told VentureBeat.

“That sequencing, that changing of model, led to a few issues,” he added, highlighting latency and loss of information as big ones.

The new GPT-4o model dispenses with that daisy chain mechanism, instead, turning other forms of media directly into tokens, making it the first truly natively multimodal model trained by the company.

As a consequence, GPT-4o boasts an impressive speed boost in its audio response time compared to its predecessor GPT-4 — It can respond to audio inputs in 232 milliseconds (average speed of 320 milliseconds) analogous to the speed of a human being, versus GPT-4, which looks sluggish by comparison by taking several seconds (up to 5) to respond.

By comparison, the old GPT-4 Voice Mode felt “a little laggy,” according to Godement.

Impressively, GPT-4o also manages to receive more information from multimodal responses compared to its predecessors, as well, resulting in greater accuracy in understanding a user’s inputs and in delivering the appropriate response.

While GPT-4/V/Turbo “can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion,” GPT-4o can do all of these things and more.

“Because theres is a single model, there’s no loss of signal,” Godement said. “A good example is around: if you were talking to me in a very happy way, that information would likely be lost,” by older models.

In addition, thanks to its increased performance and speed, OpenAI says it is passing on the associated cost reductions with running inferences of the new GPT-4o model on to its community of paying third-party developer customers.

GPT-4o is available through OpenAI’s API for half of the price of GPT-4: $5 USD for every 1 million input tokens (what users prompt the model with) and $15 USD for every 1 million output tokens (what the model responds with). OpenAI is also charging $0.001275 USD to analyze a 150 by 150 pixel image with GPT-4o compared to $0.00255 for doing the same with GPT-4 Turbo.

See full OpenAI API model pricing details here.

At the same time, OpenAI has also raised the rate limit — how many messages can be sent to and from GPT-4o — by 500% with the new model compared to its predecessors, from 2 million tokens per minute up to 10 million.

“You would see some big developer building some application and they’d be hitting the limit,” said Campbell-Moore.

But GPT-4o is now “vastly more efficient than for GPT-4 Turbo, even though it has the same level of intelligence,” he continued. “And so this allows us to run it essentially run more traffic with the same number of GPUs [graphic processing units, the computer chips on which generative AI models run and train].”

“That’s a very big deal for people building applications,” Godement said. “One of the top issues we hear from developers over and over again is that LLMs are too slow and expensive. ‘They’re cool and capable, but not fast enough for me to put them in a critical path of my applications.’”

The new GPT-4o model seeks to persuade those reluctant developers to jump in and start building OpenAI into their apps.

What kinds of third-party apps should developers build using GPT-4o?

While GPT-4o can be easily swapped into existing third-party apps that are built upon or leverage OpenAI’s older GPT-3.5 Turbo and GPT-4 class models, the leaders of OpenAI’s API team see a whole new class of applications being enabled by the new GPT-4o model.

“Any application that was doing personal assistant tasks, such as an educational assistant, or anything relying on audio, will immediately benefit,” by switching to GPT-4o as its underlying intelligence, according to Godement.

Meanwhile, the “drop in price, latency, and higher limits will benefit everyone,” he added. “That will raise the entire industry in getting new applications spun up.”

Longer-term, Godement said he believed the debut of GPT-4o marked a shift in the way humans have interacted with computers to-date, throughout history.

While the internet-connected smartphone has enabled more portability and a whole host of new kinds of applications as a result, the underlying graphical user interface (GUI) paradigms that users rely on to interact with the device — visual icons that open into applications — are virtually the same as those pioneered by Xerox PARC in the 1970s (I previously worked for Xerox from 2019-2022).

But “when you talk to people in the real world, many people much prefer speaking and listening compared to typing and reading,” asserted Godement. “My bet is that GPT-4o is going to usher in a wave of new applications and products that are truly audio-first.”

Data retention and security

Individual users of ChatGPT can select whether or not they want the company to store and train off their user data and inputs (via the “Settings” menu in the lower left corner on desktop and “Data Controls” followed by the “Improve the model for everyone” toggle box that can be turned on or off — where on will send your data to OpenAI for training).

However, for third-party developers using OpenAI’s API to build apps, the company by default does not collect or store any data submitted, except for 30 days for trust and safety purposes. After 30 days, the data is deleted.

This is the case with all OpenAI models available through the API, including the new GPT-4o and even the new data it can register such as a speaker’s tone from their audio or emotion from a facial expression in a still image, and later, video.

“We do not collect any data in the API for training purposes,” Godement said. “There are no exceptions.”

Voice and visual data, like text inputs, are first collected through an application developer’s server and then sent to OpenAI’s servers for processing by the AI model — in this case GPT-4o.

A copy of this data is retained for 30 days to ensure the usage of the model is in keeping with OpenAI’s terms of use — that it is not being used for fraud, abuse, or other nefarious purposes. But this data is “only accessible to Trust and Safety team members,” Godement noted.

OpenAI deletes the data, but an application developer could still choose to keep a copy on their own servers or use it as they see fit — OpenAI can’t control what the developer does with it, as the developer and their application is the first recipient of a user’s data and ultimately decides where it goes and what happens to it.

Limitations compared to rivals

And yet, both models have a 128,000-token context window — smaller compared to those of rivals such as Google Gemini, fine-tuned variants of Meta’s Llama 3, and Anthropic’s Claude 3, all of which offer 200,000 to 1 million context windows.

As AI and LLM developers and some users know well, the context window is important because it describes the maximum amount of text or information a foundation model can receive from a user in one single prompt.

A 128,000 token context window is equivalent to roughly 300 pages of text from a book, according to OpenAI and press coverage of the company, so that’s still a pretty tremendous amount that developers and their end-users can count on from GPT-4o, but it is substantially less than rivals.

For now, GPT-4o is available for developers to start building with through OpenAI’s API, though it is limited to text and vision (still image) capabilities. OpenAI “will be rolling out the audio and video capabilities in the coming weeks, with a group of a small group of trusted partners,” said Campbell-Moore, noting that the company would formally announce general availability of audio and video from its social accounts.

Leave a Reply