Google unveiled Gemini Omni at its I/O developer conference, a multimodal AI model that transforms images, audio, and text into broadcast-quality videos through plain conversation.
Users can combine any inputs — a photo, a voiceover, a text prompt — and Omni processes them to produce a coherent, visually detailed video that understands physics, culture, history, and science.
Google CEO Sundar Pichai stated the goal is to create "anything from any input." One demo showed Omni generating a claymation explainer on protein folding with a scripted voiceover from a single text prompt. Content creators have traditionally assembled this type of material manually.
Google is building safeguards against deepfakes. To create videos featuring a digital avatar, users must record themselves speaking a series of numbers during onboarding. Every video receives a watermark with Google's SynthID digital signature, allowing viewers to verify it is AI-generated.
Omni merges the intelligence of Gemini with media rendering capabilities. The long-term vision includes generating images from audio, audio from video, and other cross-modal combinations.
Omni Flash is rolling out now. The full suite is expected to affect YouTube creators and film production companies.



