Google Gemini Omni Flash · Text to Video

Video·gemini-omni-flash·by Google

Create short text-prompted videos with synchronized audio using Gemini Omni Flash, including aspect ratio and duration controls.

Runtime (p50)
1m
Estimated price
Usage-based
Call the API
prediction.sh
sh
curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "google-gemini-omni-flash-text-to-video",
    "version": "0.0.1",
    "input": {
        "prompt": "Ultra cinematic macro nature film, hyperrealistic, soft depth of field, slow motion, atmospheric, dreamlike, volumetric lighting, Fibonacci-inspired transitions, seamless morphing between natural spirals, elegant camera movement, documentary-quality realism",
        "duration": "10s",
        "aspect_ratio": "16:9"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/
Documentation8 sections
  • Overview

    Google | Gemini Omni Flash | Text to Video Overview

    Google | Gemini Omni Flash | Text to Video is a multimodal generative video model from Google’s Gemini family that creates short video clips directly from natural language prompts. It solves the challenge of rapid video prototyping by turning text, images, audio, or existing footage into coherent, editable video sequences without traditional editing tools. A primary differentiator of Google | Gemini Omni Flash | Text to Video is its conversational editing workflow: users describe changes in plain language and the model updates the video while preserving visual consistency across iterations. Integrated into Google’s Gemini ecosystem, it is designed for creators, marketers, and developers who need fast, iterative video generation with synchronized audio and flexible multimodal inputs. On each::labs, this model focuses on efficient 720p text-to-video generation, billed by token usage.

  • Capabilities

    Capabilities

    • Generate short 720p video clips with synchronized audio directly from descriptive text prompts.
    • Accept multimodal inputs—text, images, audio clips, and existing video—to drive or refine video generation within the Gemini Omni Flash framework.
    • Support conversational editing, allowing users to iteratively adjust scenes, backgrounds, characters, and pacing through natural language instructions.
    • Maintain visual consistency across multiple edits to the same project, preserving characters and style while changing details.
    • Create coherent narrative sequences for short formats like social clips, explainer snippets, and concept demos.
    • Automatically align native audio with visual content, matching overall mood and timing of the generated clip.
    • Integrate with Gemini-based workflows, making it easier for developers to access the Google | Gemini Omni Flash | Text to Video API within broader applications.
    • Handle a range of visual styles, from stylized or illustrative looks to more realistic scenes, depending on prompt detail.
  • Use cases

    Use Cases for Google | Gemini Omni Flash | Text to Video

    For creators, Google | Gemini Omni Flash | Text to Video is ideal for quickly mocking up story beats or short animations with synchronized audio. A creator might use a prompt like, "Generate an 8-second animated fantasy forest reveal with orchestral swelling music" to test mood and pacing before full production. Marketers can rapidly produce social-ready clips that highlight products or campaigns, for example, "Create a 7-second 720p video of a new running shoe on a track, dynamic camera moves, energetic electronic soundtrack." Developers can embed the Google | Gemini Omni Flash | Text to Video API into apps that auto-generate onboarding videos or feature demos based on text descriptions. Designers can visualize motion concepts or UI interactions—"Produce a 5-second video showing a mobile app interface sliding between screens with subtle click sounds"—using multimodal inputs when needed to keep branding consistent.

  • Tips & tricks

    Tips and Tricks

    To get the most from Google | Gemini Omni Flash | Text to Video, write prompts that explicitly state duration, framing, motion, and audio style. For example, include phrases like “10-second clip,” “slow pan,” or “ambient electronic soundtrack” to steer both visuals and sound. Because Gemini Omni Flash supports conversational editing, iterate in steps: generate a base clip, then request targeted changes such as “make the background a sunset city skyline” or “add softer piano audio.” When using multimodal inputs, reference them directly in the prompt, e.g., “use the attached photo as the main character.” Avoid overly ambiguous instructions, which can lead to inconsistent motion or pacing. On each::labs, start with concise prompts, then refine with additional descriptive tokens only where they add clear value.

    Example prompts:

    • "Create a 6-second 720p video of a watercolor-style cat chasing a glowing butterfly at dusk, with gentle piano music."
    • "Generate a 10-second 16:9 city skyline timelapse at night, slow camera zoom, with soft ambient electronic audio."
    • "Produce a 5-second product showcase of a smartphone on a rotating pedestal, dramatic lighting, and minimal synth sound."
  • Technical spec

    Technical Specifications

    • Provider / Family: Google Gemini family, Gemini Omni Flash variant focused on video generation and editing.
    • Input modalities: Text prompts; optionally images, audio clips, and existing video as conditioning sources in the broader Gemini Omni Flash workflow.
    • Output: 3–10 second video clips at 720p resolution with native synchronized audio; 24 FPS playback, with desired duration described in the prompt.
    • Aspect ratio: Standard 16:9 for 720p output; other aspect ratios are not yet broadly documented.
    • Formats: Outputs as standard web video formats (e.g., MP4) with an embedded audio track; exact container may vary by integration.
    • Processing time: Typically on the order of seconds to tens of seconds per clip, depending on prompt complexity and infrastructure load.
    • Architecture: Multimodal Gemini Omni architecture combining text, vision, and audio understanding for unified video synthesis and editing.
  • Things to be aware of

    Things to Be Aware Of

    Because Google | Gemini Omni Flash | Text to Video focuses on short clips, attempts to force very long, complex narratives into a single generation may result in rushed or incoherent motion. Highly ambiguous or conflicting prompts can lead to unstable camera work or inconsistent character appearance across frames. Multimodal conditioning (images, audio, existing video) depends on the integration and may not always match user expectations, especially for very detailed brand guidelines. Users should also be aware of content policies and regional feature differences noted by Google, as some capabilities may not be available in all markets. Finally, token-based billing means repeated, verbose experimentation can increase costs; structured iteration helps control usage.

  • Key considerations

    Key Considerations

    Google | Gemini Omni Flash | Text to Video is best suited for short-form video generation and rapid iteration, not full-length productions. Users should provide clear prompts that specify clip duration, camera movement, and audio mood to guide the model. Because the model operates within the Gemini Omni multimodal framework, it can leverage visual or audio references, but coverage and quality may vary by integration. Costs are tied to token usage, so highly detailed or long prompts can increase spend. For longer, high-resolution or specialized cinematic outputs, users may prefer dedicated video tools, while this model excels at concept visualization, social content, and quick creative experiments.

  • Limitations

    Limitations

    Google | Gemini Omni Flash | Text to Video is currently best for short 3–10 second clips and does not serve as a full-length video production system. Resolution is limited to 720p in the described integration, which may not meet all broadcast or cinema requirements. Complex scene choreography, detailed human motion, and fine-grained lip-sync are not guaranteed and may show artifacts or minor inconsistencies. Aspect ratio and output format options are relatively constrained compared to dedicated video suites, and users must work within the multimodal Gemini Omni Flash framework rather than low-level timeline editing.

Related models

4 models