Google takes on GPT-4o with Project Astra, an AI agent that understands dynamics of the world

Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.


Today, at its annual I/O developer conference in Mountain View, Google made a ton of announcements focused on AI, including Project Astra – an effort to build a universal AI agent of the future.

An early version was demoed at the conference, however, the idea is to build a multimodal AI assistant that sits as a helper, sees and understands the dynamics of the world and responds in real time to help with routine tasks/questions. The premise is similar to what OpenAI showcased yesterday with GPT-4o-powered ChatGPT.

That said, as GPT-4o begins to roll out over the coming weeks for ChatGPT Plus subscribers, Google appears to be moving a tad slower. The company is still working on Astra and has not shared when its full-fledged AI agent will be launched. It only noted that some features from the project will land on its Gemini assistant later this year.

What to expect from Project Astra?

Building on the advances with Gemini Pro 1.5 and other task-specific models, Project Astra – short for advanced seeing and talking responsive agent – enables a user to interact while sharing the complex dynamics of their surroundings. The assistant understands what it sees and hears and responds with accurate answers in real time.

VB Event

The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Request an invite

“To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do — and take in and remember what it sees and hears to understand context and take action. It also needs to be proactive, teachable and personal, so users can talk to it naturally and without lag or delay,” Demis Hassabis, the CEO of Google Deepmind, wrote in a blog post.

In one of the demo videos released by Google, recorded in a single take, a prototype Project Astra agent, running on a Pixel smartphone, was able to identify objects, describe their specific components and understand code written on a whiteboard. It even identified the neighborhood by seeing through the camera viewfinder and displayed signs of memory by telling the user where they kept their glasses. 

Google Project Astra in action
Google Project Astra in action

The second demo video showed similar capabilities, including a case of an agent suggesting improvements to a system architecture, but with a pair of glasses overlaying the results on the vision of the user in real-time. 

Hassabis noted while Google had made significant advancements in reasoning across multimodal inputs, getting the response time of the agents down to the human conversational level was a difficult engineering challenge. To solve this, the company’s agents process information by continuously encoding video frames, combining the video and speech input into a timeline of events, and caching this information for efficient recall.

“By leveraging our leading speech models, we also enhanced how they sound, giving the agents a wider range of intonations. These agents can better understand the context they’re being used in, and respond quickly, in conversation,” he added.

OpenAI is not using multiple models for GPT-4o. Instead, the company trained the model end-to-end across text, vision and audio, enabling it to process all inputs and outputs and deliver responses with an average of 320 milliseconds. Google has not shared a specific number on the response time of Astra but the latency, if any, is expected to reduce as the work progresses. It also remains unclear if Project Astra agents will have the same kind of emotional range as OpenAI has shown with GPT-4o.

Availability

For now, Astra is just Google’s early work on a full-fledged AI agent that would sit right around the corner and help out with everyday life, be it work or some personal task, with relevant context and memory. The company has not shared when exactly this vision will translate into an actual product but it did confirm that the ability to understand the real world and interact at the same time will come to the Gemini app on Android, iOS and the web.

Google will first add Gemini Live to the application, allowing users to engage in two-way conversations with the chatbot. Eventually, probably sometime later this year, Gemini Live will include some of the vision capabilities demonstrated today, allowing users to open up their cameras and discuss their surroundings. Notably, users will also be able to interrupt Gemini during these dialogs, much like what OpenAI is doing with ChatGPT.

“With technology like this, it’s easy to envision a future where people could have an expert AI assistant by their side, through a phone or glasses,” Hassabis added.

Leave a Reply