Microsoft shows off VASA-1, an AI framework that makes human headshots talk, sing

Discover how companies are responsibly integrating AI in production. This invite-only event in SF will explore the intersection of technology and business. Find out how you can attend here.


Microsoft has taken a major leap in the field of AI-driven content generation. Just a few hours ago, the research arm of the Satya Nadella-led company presented VASA-1, an AI framework that can convert human headshots into talking and singing videos.

The project marks a significant shift in what has been achieved in AI-generated content as it works with very minimal input. All it needs is one static headshot and an audio file with speech and the model will bring it to life, complete with lip-sync and related expressions and head movements.

Microsoft shared multiple samples showcasing the prowess of the framework, including one of Mona Lisa rapping. However, given the evident risk of deepfake generation from such technology, the company also emphasized that this is just a research demo and there’s no plan of bringing the technology to market. 

Microsoft VASA brings static images to life

Today, tools generating AI content, especially video, are a double-edged sword. They can either be used for positive applications, like producing scenes for advertising projects, or for harmful acts like producing deepfakes and hurting the reputation of a person/celebrity

VB Event

The AI Impact Tour – San Francisco

Join us as we navigate the complexities of responsibly integrating AI in business at the next stop of VB’s AI Impact Tour in San Francisco. Don’t miss out on the chance to gain insights from industry experts, network with like-minded innovators, and explore the future of GenAI with customer experiences and optimize business processes.

Request an invite

But, here’s the tricky thing, even deepfakes can have positive applications. Imagine an artist who agrees to have his digital replica created for advertising projects or social media promotions. With VASA-1, Microsoft walks this fine line of deepfake production with what it describes as “generating lifelike talking faces of virtual characters with appealing visual affective skills (VAS).”

According to the company, the premier model, when provided with a still image of a person’s face and a speech audio file, can turn it into a video, complete with lip movements synchronized with the audio as well as a spectrum of emotions, facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. It even shared multiple examples showcasing how a single headshot of a person could be converted into a video of the same person talking or singing. 

“The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos,” the researchers behind VASA wrote on the company website.

More importantly, the team even noted that technology allows users to control their generation, enabling them to tweak aspects like motion sequence, eye gaze direction, head distance, and emotions by simply moving a slider up and down. On top of it, it can work on content that was not included in the training dataset, including artistic photos, singing audios and non-English speech.

Long way to go for actual VASA implementation

While the samples shared by Microsoft look real, especially in some cases, many clips do give away that they have been generated with AI. The movement does not appear smooth. The company, on its part, says the approach generates 512 X 512 videos at 45fps in the offline batch processing mode and can support up to 40fps in the online streaming mode. It even claims that the work outperforms other methods in this space when tested through extensive experiments, including comparison on an entirely new set of metrics.

That said, it is also important to note that this kind of work can easily be abused to portray an ill image of a person – making them say things they didn’t really say in a video. This is why Microsoft is not releasing VASA as a product or an API. The company has emphasized that all human headshots showcased in demo clips were generated using AI and this tech is largely aimed at generating visual affective skills for virtual AI avatars, aimed at positive applications rather than content that is used to mislead or deceive. 

In the long run, Microsoft sees VASA research as a step towards lifelike avatars emulating human movements and emotions. This, the company says, could help enhance educational equity, improve accessibility for individuals with communication challenges, and offer companionship or therapeutic support to those in need.