
Gemini Omni: Google's New Any to Any Video Model
Google has pulled its scattered video tools into one place. The result is Gemini Omni, an any-to-any model Google DeepMind announced at I/O 2026 on May 19. You hand it text, an image, audio, or a video clip, and it returns something finished that actually makes sense of the request. No gluing three separate tools together. No starting from scratch every time you want one small change. After watching what Google put on stage, it's a bigger swing than the tidy name suggests.
What makes Gemini Omni worth paying attention to isn't a spec sheet. It's the feeling of handing a machine a messy pile of ideas and getting back something coherent. You don't storyboard across five tools anymore. You describe, you watch, you adjust. So let's get into what Gemini Omni actually is, how it works under the hood, and why it feels different from the wave of video tools that came before it.
When the person touches the mirror, the entire environment turns into 3d voxel art.
What Is Gemini Omni?
For a long time, making something with AI meant picking a lane. You'd use one model for images, a different one for video, a third for sound, then assemble the pieces by hand and hope they matched. Gemini Omni tosses that assembly line aside. It's a single model family from Google DeepMind that accepts almost any input you can imagine and produces video that's grounded in real-world reasoning rather than just pretty pixels.
Google describes the model as the place where Gemini's ability to reason meets its ability to create. That phrasing isn't marketing fluff. The whole point of Gemini Omni is that it draws on the same world knowledge powering Gemini's text and reasoning work, then puts that knowledge to use generating moving images. Ask it for a clay-animation explainer of how proteins fold, and it doesn't just animate something vaguely science-y. It tries to get the biology right.
The model was introduced by Koray Kavukcuoglu, CTO of Google DeepMind, with Demis Hassabis framing it on stage as a step toward systems that genuinely understand the physical world. The first release, Gemini Omni Flash, is the fastest and smallest member of a family that Google has already said will grow. A larger Omni Pro has been teased, though Google hasn't attached a release date to it.
A marble rolling fast on a chain reaction style track, continuous smooth shot.
How Gemini Omni Works
What makes Gemini Omni tick is that it was built to be multimodal from the ground up, not patched together after the fact. The model treats text, images, audio, and video as native inputs it can reason across all at once. You're not feeding it one thing at a time. You can drop in a still photo, a short reference clip for motion, an audio file for rhythm, and a written instruction, and the model weaves all four into one coherent output instead of treating them as separate jobs.
The output, at launch, is high-resolution video with synchronized audio. Google has said image and audio outputs are coming in later releases of the family, which is why the official line is "starting with video." The "starting with" is doing real work in that sentence. Gemini Omni is meant to expand outward over time.
There's a real architectural distinction worth understanding here. Google's earlier video model, Veo, is built primarily for turning text into footage. It's excellent at that, but it expects precise, prescriptive prompts and it doesn't carry a conversation forward. Gemini Omni adds a reasoning layer on top of generation and accepts every input type at once. More importantly, it remembers. Every edit you make is understood in the context of everything that came before it, which is the feature that quietly changes the entire experience.
Key Features of Gemini Omni
There's a lot packed into this model, and a feature list can flatten it. So instead of rushing through, here are the capabilities that genuinely stand out once you start using Gemini Omni for real work.
Conversational, Multi-Turn Editing
This is the headline. Google internally frames the model as "Nano Banana, but for video," and that comparison lands once you see it in motion. With Gemini Omni, you generate a base scene, then refine it by talking. Change the butterfly to a bee. Now turn the bee into a swarm of fireflies. Shift the camera to over the violinist's shoulder. Make the violin invisible.
A violinist transported into a sunlit daisy field with Gemini Omni, the scene staying consistent across every edit.
Each instruction builds on the last one. The characters, the lighting, the physics, the scene context all persist across turns. You're not regenerating the whole video and praying it stays consistent. You're nudging one thing while everything else holds steady. Anyone who's fought with AI video tools that reset on every prompt will understand instantly why this is such a relief.
Grounding in Real-World Knowledge
Gemini Omni carries far more world knowledge than a typical video generator because it pulls from Gemini's training. That shows up in two ways. First, physics. The model has an intuitive feel for gravity, kinetic energy, and fluid dynamics, so a marble rolling down a chain-reaction track moves the way a marble actually would. Second, factual grounding. Ask for an explainer on the brain's hippocampus and it constructs something accurate, not just decorative.
Claymation explainer of protein folding, everything is made out of clay, no hands, stop motion, accurate.
Reference Anything
You can hand the model almost any kind of reference and it'll use it. An image to lock a character's look. A video to borrow motion or camera distortion. An audio clip to set tempo. A rough sketch to guide how elements should move. Gemini Omni can take a doodle and turn it into realistic footage, using the drawing only as a movement guide without ever showing the drawing in the final cut.
Style and Motion Transfer
Want a scene reimagined as anime, claymation, or watercolor while keeping the original motion intact? Ask for it. The model can apply a new visual style across an existing clip, or transfer the motion from one reference onto a completely different subject. One of Google's demos showed a whale's swimming motion mapped onto reflective liquid material, with the water replaced by smooth moving shapes. The whale never appears. The motion does.
Text That Syncs With the Action
Rendering legible text inside generated video has been a stubborn problem across the field. Gemini Omni goes further than just rendering words correctly. It can time text to what's happening on screen, letting words appear one at a time to a rhythm, each with its own animation style. That's the difference between captions slapped on top and text that's part of the scene.
The video shows items of the alphabet. An unusual item starting with each letter is shown sitting on a table (like a Capybara for C, disco globe for D and Lava Lamp for L).
Real-World Use Cases
So who's this actually for? More people than you'd guess, partly because of where Google chose to launch it.
Short-form creators are the obvious first wave. Gemini Omni hands real production power to people who would never open a traditional editing suite. You don't need timelines, keyframes, or a decade of muscle memory. You describe the change you want and the model does the heavy lifting, which lowers the barrier to a polished clip dramatically.
Filmmakers and studio-minded users are another natural fit. When you're stitching a narrative across multiple beats, the thing that usually breaks is continuity, characters drift, lighting shifts, the world stops feeling like one place. Gemini Omni holds that consistency from shot to shot, which is exactly what project-based work needs to stay coherent.
Then there's the everyday slice. Marketers building quick campaign concepts. Teachers turning a dry topic into a claymation explainer. Designers translating a sketch into something that moves. Because you steer the model by conversation rather than by learning a dense tool, the barrier to a decent first result is genuinely low. You describe what you want, look at what comes back, and adjust by asking again.
When the finger in <video> touches the animal toy play the sound the animal makes.
How to Use Gemini Omni on Eachlabs
Getting started with Gemini Omni on Eachlabs is straightforward. Everything runs through one unified API, so there's no juggling logins or stitching tools together. You send your inputs, text, images, audio, video, or any blend of them, and the model reasons across them to build a single coherent clip. From there, you refine by talking to it: change the camera, shift the style, swap an object, one focused request at a time. And because it sits alongside every other model on Eachlabs, you're free to mix Gemini Omni into a larger pipeline whenever a project calls for it. The model is coming to Eachlabs soon, so it's worth getting familiar with how it thinks now, and keeping an eye on the catalog so you can start building the moment it lands.
Tips for Getting the Best Results
A model this flexible rewards a slightly different prompting habit than you might be used to. A few things consistently help.
Lead With Intention, Not Micromanagement
Because Gemini Omni reasons about the world, you don't have to spell out every frame. Tell it the effect you're after and let the model fill in the physics and detail. Over-describing can actually fight against its instincts. Say what should happen and how it should feel, then refine from there.
Edit One Thing at a Time
The model's superpower is consistency across turns, so use it. Rather than rewriting a sprawling prompt, make a single targeted change, see the result, and stack the next change on top. Want a different background? Ask for just that. The rest of your scene stays put.
Bring References When You Can
A reference image or clip carries more creative direction than words alone. If you need a character, an environment, or a specific motion to stay locked, supply a reference and Gemini Omni will hold it steady across the scene. References created in Nano Banana work nicely as starting points for characters and objects.
Be Specific About Camera and Sound
The model responds well to real videography language. Ask for "one continuous shot," a "dolly zoom," a "locked off" static frame, or a "natural smartphone zoom." On audio, you can request synchronized sound effects or background music tied to the action, like building lights flickering on in time with a beat. The more cinematic vocabulary you give it, the more control you get back.
Wrapping Up
Gemini Omni is the clearest sign yet that the messy, multi-tool era of AI video is giving way to something simpler. One model, almost any input, a finished clip you shape just by talking to it. That's a genuinely different relationship with creative software than most of us have had, closer to directing than operating. And this is only the first step in the family, with more on the way. Gemini Omni already looks less like a single product and more like the start of how Google wants creation to work from here.
Frequently Asked Questions
What can Gemini Omni take as input?
Pretty much anything you'd reach for in a creative project. You can combine text, images, audio, and video in a single prompt, and the model reasons across all of them at once to build one coherent video. That mix-and-match flexibility is the whole reason "any-to-any" stuck as the description.
Why are Gemini Omni Flash clips limited to 10 seconds?
It's a choice, not a wall. Google's DeepMind team has said the 10-second cap on Gemini Omni Flash is a deployment decision meant to get the model into more hands while compute demand is high, and a bet that most people don't need longer clips yet. The underlying model can generate more, and Google has signaled the limit will grow.
Is Gemini Omni the same thing as Veo?
Different model, different purpose. Veo is Google's specialized text-to-video line and it's still around. Gemini Omni is a separate, natively multimodal family that accepts mixed inputs, reasons across them, and supports multi-turn conversational editing that keeps your scene consistent. Google has framed the move as an architectural shift rather than a rename.