What prompt strategies work best with Veo 3.1 text-to-video?

Veo 3.1 text-to-video responds best to prompts that specify camera behavior, subject motion, lighting conditions, environment details, and visual style. Including references to real-world cinematography styles, shot types such as wide or close-up, and specific action descriptions produces more controlled and higher-quality results than broad or abstract prompts.

How can I generate videos with Veo 3.1 text-to-video through the eachlabs API?

Veo 3.1 text-to-video is accessible on the eachlabs platform under the model ID veo3.1-text-to-video. Submit a detailed text prompt via the eachlabs unified API and receive a high-quality video clip from Google. eachlabs provides access to all Veo model versions on pay-as-you-go pricing with no Google Cloud account required.

Example inputhover

prompt: "Two-person street interview in Paris. The host holds a small microphone and casually talks with a passerby near a café terrace with the Eiffel Tower in the background. Natural daylight, lively ambient city sounds — people chatting, distant traffic, light breeze. Dialogue: Host: “Hey! Did you catch the update?” Person: “Of course — VE0 3.1 just dropped on eachlabs! You have to check it out, it’s unreal.”"
aspect_ratio: "16:9"
duration: "8"
enhance_prompt: true
auto_fix: true
resolution: "720p"
generate_audio: true

Veo 3.1 · Text to Video

Video·veo3.1·by Google

The most advanced video generation model by Google DeepMind. Creates realistic scenes, natural sounds, and physically consistent motion from a single text prompt. Perfect for storytelling, cinematic ads, and short films.

Try it now →

API reference

Runtime (p50): 1m
Estimated price: From $0.8

Call the API

prediction.sh

curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "veo3-1-text-to-video",
    "version": "0.0.1",
    "input": {
        "prompt": "Two-person street interview in Paris. The host holds a small microphone and casually talks with a passerby near a café terrace with the Eiffel Tower in the background. Natural daylight, lively ambient city sounds — people chatting, distant traffic, light breeze.\n\nDialogue:\nHost: “Hey! Did you catch the update?”\nPerson: “Of course — VE0 3.1 just dropped on eachlabs! You have to check it out, it’s unreal.”",
        "aspect_ratio": "16:9",
        "duration": "8",
        "enhance_prompt": true,
        "auto_fix": true,
        "resolution": "720p",
        "generate_audio": true
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/

Documentation8 sections

Overview
veo3.1-text-to-video — Text to Video AI Model

veo3.1-text-to-video, Google DeepMind's most advanced text-to-video AI model, transforms detailed text prompts into cinematic videos with native audio, physically realistic motion, and up to 4K resolution. Creators and developers seeking a Google text-to-video solution can generate broadcast-quality clips featuring synchronized soundscapes and consistent characters, ideal for storytelling that rivals professional production. Developed as part of the Veo 3.1 family, this model excels in understanding cinematic terminology like lighting and camera movements, producing clips that extend beyond typical AI limitations for short films, ads, and social media content.
Capabilities
- What the model can do well: Generates high-fidelity videos with realistic motion and synchronized audio.
- Special features or abilities: Supports video extension, frame-specific generation, and image-based direction.
- Quality of outputs: Produces cinematic-quality videos with true-to-life textures and sounds.
- Versatility and adaptability: Can be used for a wide range of visual and cinematic styles.
- Technical strengths: Offers strong prompt adherence and improved audiovisual quality.
Use cases
Use Cases for veo3.1-text-to-video

Filmmakers and video creators pre-visualize scenes with Veo 3.1's 4K output and character consistency; input up to 4 reference images of actors and locations alongside a prompt like "A detective chases a suspect through rainy neon-lit streets at night, slow-motion puddles splashing, tense orchestral score rising," generating a coherent 8-second clip with synced audio for storyboarding.

Marketers producing cinematic ads leverage native vertical 9:16 support for TikTok and Reels, creating product demos with realistic motion and sound—such as a luxury watch ticking on a velvet surface with ambient clock chimes—directly optimized for social media without cropping.

Developers building AI video apps integrate the veo3.1-text-to-video API for e-commerce tools, using text prompts plus product images to output high-res videos with consistent branding and physics-accurate animations, streamlining automated content generation.

Content designers for short films extend base clips iteratively for narratives up to 60 seconds, maintaining temporal consistency and adding audio cues like "coffee shop chatter" to build immersive environments from simple prompts.
Tips & tricks
How to Use veo3.1-text-to-video on Eachlabs
Access veo3.1-text-to-video seamlessly through Eachlabs Playground, API, or SDK by providing a detailed text prompt structured as "Subject + Action + Lighting + Camera," optional up to 4 reference images, aspect ratio (16:9 or 9:16), and duration settings. Generate high-quality MP4 videos in 4K/1080p/720p with native audio and physically consistent motion, ready for immediate use in apps or editing workflows.
---
Technical spec
What Sets veo3.1-text-to-video Apart

veo3.1-text-to-video stands out in the text-to-video AI model landscape with its pioneering 4K resolution support at 3840x2160, enabling professional-grade output for cinema displays and high-end YouTube videos that competitors like Sora 2 cannot match at 1080p. This capability delivers sharper details and clarity, allowing creators to produce visuals suitable for commercial projects without post-processing upscaling.

Native audio generation sets it further apart by automatically syncing sound effects, ambient noise, and background scores to video content, such as sirens for a police chase or rain sounds for stormy scenes, enhancing immersion without manual editing. Users benefit from ready-to-use clips with realistic audio that elevates storytelling for platforms demanding full audiovisual experiences.

Support for up to 4 reference images ensures exceptional character and object consistency across scenes via the "Ingredients to Video" feature, maintaining identities and environments even in complex narratives. This enables seamless multi-shot videos with coherent expressions and backgrounds, perfect for developers integrating veo3.1-text-to-video API into production workflows.

Technical specs include 4K/1080p/720p resolutions, 4-8 second base durations extendable to 60+ seconds or up to 148 seconds via API, 16:9 and native 9:16 aspect ratios, MP4 output at 24 fps, and image/text inputs.
Things to be aware of
- Experimental features or behaviors: Audio capabilities are noted as experimental in some contexts.
- Known quirks or edge cases: May struggle with overly complex or abstract prompts.
- Performance considerations: Requires significant computational resources for high-quality outputs.
- Resource requirements: Demands powerful hardware for efficient video generation.
- Consistency factors: Outputs may vary slightly between different runs with the same prompt.
- Positive user feedback themes: Users appreciate the model's realism and ease of use.
- Common concerns or negative feedback patterns: Some users report inconsistencies in audio quality or availability.
Key considerations
- Important factors to keep in mind: Ensure clear and specific text prompts for optimal results.
- Best practices for optimal results: Use detailed descriptions and reference images when available.
- Common pitfalls to avoid: Overly vague prompts can lead to inconsistent outputs.
- Quality vs speed trade-offs: Models like Veo 3.1 Fast offer faster generation at a lower cost but may compromise slightly on quality.
- Prompt engineering tips: Use descriptive language and specify desired audio elements for better synchronization.
Limitations
- Primary technical constraints: Limited to generating videos up to a certain duration (e.g., 8 seconds for some configurations).
- Main scenarios where it may not be optimal: Struggles with very abstract or complex prompts, and may not be ideal for real-time video generation due to computational demands.

Related models

4 models

Google Gemini Omni Flash · Text to Video AI model preview

Google Gemini Omni Flash · Text to VideoGoogle

Bytedance Seedance 2.0 Text to Video · Fast AI model preview

Bytedance Seedance 2.0 Text to Video · FastBytedance

Luma Ray 3.2 · Text to Video AI model preview

Luma Ray 3.2 · Text to VideoLuma

Ltx v2.3 · Text to Video AI model preview

Ltx v2.3 · Text to VideoLTX

* FAQ

About Veo 3.1 · Text to Video

01 / 03

What is Veo 3.1 text-to-video and what makes it Google's most capable video model?

Veo 3.1 text-to-video is Google's latest text-to-video generation model, delivering cinematic-quality video clips from natural language descriptions with advanced scene understanding, realistic motion physics, and high temporal coherence. It supports diverse video styles and is suitable for advertising, storytelling, product marketing, and professional content production.

Veo 3.1 · Text to Video