Alibaba HappyHorse 1.0 · Text to Video

Video·happyhorse-1.0·by Alibaba

Creates video sequences from text descriptions with smooth motion and cinematic control, offering precise frame-level artistic direction.

Runtime (p50)
3m
Estimated price
From $0.14
Call the API
prediction.sh
sh
curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "alibaba-happyhorse-1-0-text-to-video",
    "version": "0.0.1",
    "input": {
        "ratio": "16:9",
        "prompt": "Shot 1 (wide, 0–2s): A vast LEGO city sits in complete darkness. A single street lamp flickers on, then another, then another — warm amber light spreading block by block across cobblestone streets and brick buildings, the city slowly awakening from the dark.\\nShot 2 (low angle, 2–4s): The camera glides at ground level through a narrow LEGO street. A minifigure baker steps out of a shop doorway carrying a tiny bread tray, children run past a park bench, a police officer raises one stiff arm to direct traffic — all moving with the subtle jerkiness of stop-motion life.\\nShot 3 (tracking, 4–6s): A LEGO train rounds a brick corner on its tracks, headlights cutting through the dark. The camera tracks alongside it in slow motion, amber light flickering through its tiny windows, miniature passengers faintly visible inside each compartment.\\nShot 4 (aerial, 6–8s): The camera rises slowly straight up above the entire city, revealing the full glowing LEGO world from above — every street lamp lit, every window warm, tiny minifigures still moving below, the whole miniature world alive and breathing against the dark.\\nMacro cinematic lens, shallow depth of field, rich plastic textures, soft night glow, slow deliberate camera movement throughout",
        "watermark": false,
        "duration": 5,
        "resolution": "1080P"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/
Documentation8 sections
  • Overview

    Alibaba | HappyHorse 1.0 | Text to Video Overview

    Alibaba | HappyHorse 1.0 | Text to Video is an advanced AI model that generates high-quality video sequences from text descriptions, complete with synchronized native audio in a single pass. Developed by Alibaba's ATH Innovation Division and Token Hub, this 15B-parameter model from the HappyHorse family excels in text-to-video and image-to-video tasks, topping the Artificial Analysis Video Arena leaderboard for its superior lip-sync, cinematic motion, and multilingual support.

    Unlike traditional two-stage pipelines, HappyHorse 1.0 uses a unified 40-layer Self-Attention Transformer to process text, video, and audio tokens simultaneously, delivering precise frame-level control and natural human-centric performances. This makes it ideal for creators seeking smooth, realistic videos with dialogue in languages like English, Mandarin, Japanese, Korean, German, and French. Available via Alibaba's HappyHorse AI Video Platform, it supports 1080p outputs for short cinematic clips, with API access planned soon.

  • Capabilities

    Capabilities

    • Generates 1080p videos up to 10 seconds from text prompts with native audio, including dialogue and effects in one pass.
    • Image-to-video consistency, preserving character identity from reference images.
    • Phoneme-level lip-sync in 6 languages: Chinese, English, Japanese, Korean, German, French.
    • Cinematic camera controls (pans, zooms, follows) and smooth human motion.
    • Human-centric excellence: delicate facial performances, realistic body dynamics, natural speech coordination.
    • Single-model handling of text-to-video and image-to-video without specialized variants.
    • Fast 8-step inference for efficient short-clip production.
    • Top-ranked on Artificial Analysis leaderboards for text-to-video (1333 Elo) and image-to-video (1392 Elo).
  • Use cases

    Use Cases for Alibaba | HappyHorse 1.0 | Text to Video

    Content Creators (Virtual Streamers): Produce short AI micro-dramas with multilingual dialogue. Example: "A virtual streamer in Japanese reacts excitedly to game news, webcam angle, perfect lip sync." Leverages phoneme-level sync and facial performance.

    Marketers: Create cross-lingual promotional videos. Example: "German businessman pitches a product on stage, confident gestures, audience applause audio, cinematic zoom." Uses native audio and motion controls for engaging ads.

    Developers: Integrate via upcoming Alibaba | HappyHorse 1.0 | Text to Video API for app prototypes. Example: "English tutor explains math, animated whiteboard behind, clear speech sync." Benefits from single-pass efficiency and language support.

    Designers: Storyboard character-focused segments. Example: "French chef demonstrates recipe, close-up hands and face, ambient kitchen sounds." Excels in human motion and lip-sync for precise visuals.

  • Tips & tricks

    Tips and Tricks

    For best results with Alibaba | HappyHorse 1.0 | Text to Video, craft prompts with specific camera movements, facial expressions, and dialogue to leverage its cinematic controls and lip-sync strengths. Include language indicators (e.g., "in Mandarin") and reference human subjects early. Use image-to-video mode for consistent character identity across generations.

    Optimize by keeping durations to 5-10s and focusing on one key action per prompt to maintain motion smoothness. Test with base vs. distilled models if available via Alibaba | HappyHorse 1.0 | Text to Video API. Avoid complex backgrounds; emphasize foreground humans.

    Example prompts:

    • "A young woman in a red dress speaks passionately in French about climate change, close-up shot with subtle head tilts, cinematic lighting, smooth pan right."
    • "Image of a samurai warrior: He draws his sword and charges forward in slow motion, shouting in Japanese, dynamic camera follow, 1080p."
    • "English narrator explains AI ethics, professional studio setting, perfect lip sync, steady shot with zoom on face."

    These yield strong phoneme-level sync and natural motion.

  • Technical spec

    Technical Specifications

    • Model Type: Text-to-video and image-to-video with joint audio generation (visuals, dialogue, ambient sounds, Foley effects in one pass)
    • Architecture: 40-layer single-stream Self-Attention Transformer (no Cross-Attention); processes unified token sequence for all modalities
    • Parameters: 15B
    • Inference: 8 denoising steps, no CFG required
    • Resolution: Up to 1080p (confirmed), 720p promotional examples
    • Duration: 5 or 10 seconds
    • Language Support: Chinese, English, Japanese, Korean, German, French (phoneme-level lip-sync)
    • Input: Text prompts, optional reference images
    • Output: Video with native audio
    • Processing: Fast inference due to 8-step denoising; promotional pricing at RMB 2.2 for 720p 5s video

    These specs enable efficient, high-fidelity generation on Alibaba | HappyHorse 1.0 | Text to Video API.

  • Things to be aware of

    Things to Be Aware Of

    Alibaba | HappyHorse 1.0 | Text to Video may underperform on non-human scenes like landscapes, where motion coherence drops compared to human subjects. Common mistakes include vague prompts lacking dialogue or camera cues, leading to generic outputs—always specify actions and languages. Edge cases like rapid multi-character interactions can cause sync issues.

    Resource needs are moderate due to 8-step inference, but beta access limits scale. Prompts over 10s duration auto-clip, and audio quality varies with complex Foley. Test iteratively for optimal lip-sync in multilingual use.

  • Key considerations

    Key Considerations

    Before using Alibaba | HappyHorse 1.0 | Text to Video, note its focus on short human-centric clips with audio, making it best for scenarios needing lip-sync and motion realism over landscapes or long-form content. It requires clear, descriptive prompts emphasizing character actions and dialogue for optimal results. Currently in internal beta with API forthcoming, access is via Alibaba's HappyHorse platform, where new users get points and promotions apply (e.g., 30% off until May 10).

    Performance shines at 1080p for 5-10s videos, but tradeoffs include limited duration and potential variability in non-human scenes. Prioritize it over alternatives for multilingual talking-head videos; for product shots, other models may suit better.

  • Limitations

    Limitations

    Alibaba | HappyHorse 1.0 | Text to Video is constrained to 5-10 second clips, unsuitable for longer narratives. It prioritizes human-centric content, struggling with landscapes or product shots lacking motion realism. No confirmed support for durations beyond 10s or resolutions above 1080p yet. Currently in beta, with full API pending; audio leads less dominantly than visuals on leaderboards.

    ---

Related models

1 model