MOCHI

Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in a preliminary evaluation.

Avg Run Time: 261.000s

Model Slug: mochi-1

Playground

Input

Prompt

Num Frames

Image Prompt Strenght

Guidance Scale

Fps

Seed

Output

Example Result

Preview and download your result.

The total cost depends on how long the model runs. It costs $0.001677 per second. Based on an average runtime of 261 seconds, each run costs about $0.4378. With a $1 budget, you can run the model around 2 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents

Overview

Technical Specifications

Key Considerations

Tips & Tricks

Capabilities

What Can I Use It For?

Things to Be Aware Of

Limitations

Overview

mochi-1 — Text to Video AI Model

Developed by Genmo as part of the mochi family, mochi-1 is a state-of-the-art open-source text-to-video AI model that generates high-fidelity short videos with exceptional prompt adherence and smooth motion dynamics. This 10-billion-parameter diffusion model, powered by the Asymmetric Diffusion Transformer (AsymmDiT) architecture, excels in producing photorealistic clips up to 5.4 seconds at 30 fps and 480p resolution (640x480), making it ideal for creators seeking precise Genmo text-to-video outputs without proprietary limitations. Whether you're exploring text-to-video AI model options or building custom workflows, mochi-1 stands out for its customizable fine-tuning on personal videos, bridging open-source accessibility with professional-grade results.

Technical Specifications

What Sets mochi-1 Apart

mochi-1 differentiates itself through its AsymmDiT architecture and AsymmVAE compression (128:1 ratio), enabling efficient high-fidelity video synthesis that outperforms many closed models in motion consistency during bold camera moves like pans and orbits. This allows developers and creators to generate steady, shape-preserving videos quickly via command line or Gradio UI, ideal for mochi-1 API integrations in production pipelines. Its intuitive fine-tuning process supports training on user videos, delivering tailored outputs for unique styles that generic models can't match. Additionally, mochi-1 produces 162-frame clips at 480p with strong prompt precision, supporting artistic filters and nuanced motion for expressive storytelling in advertising or concept art.

AsymmDiT for top-tier efficiency: Handles complex dynamics like steady camera tracks, enabling seamless text-to-video AI model generation without shaking or distortion.
Custom fine-tuning: Train on your own videos for personalized high-fidelity results, perfect for bespoke creative experiments.
Prompt adherence and compression: Ensures accurate, fast outputs at 30 fps up to 5.4 seconds, with Apache 2.0 openness for full customization.

Key Considerations

Frame Count Limitations: Mochi-1 supports a range of 30-170 frames. Exceeding these limits may result in errors or degraded performance.
Frame Rate (FPS): Set between 10-60 FPS for smooth playback. Higher FPS values require additional computational power.
Guidance Scale: Ranges from 1 to 10, controlling the adherence to the textual prompt. Extreme values may reduce output quality.
Prompt Strength: Adjusted between 0-1, impacting the influence of image-based prompts relative to text.
Seed Consistency: The seed value determines output reproducibility. Keep it consistent for identical results across runs.

Tips & Tricks

How to Use mochi-1 on Eachlabs

Access mochi-1 seamlessly through Eachlabs' Playground for instant text-to-video generation—enter a detailed prompt, adjust styles or motion via sliders, and generate 480p clips up to 5.4 seconds at 30 fps. Integrate via the mochi-1 API or SDK for scalable apps, supporting fine-tuning inputs and outputs in MP4 format with high prompt fidelity. Eachlabs delivers fast, customizable access to this Genmo powerhouse without setup hassles.

---

Capabilities

Text-to-Video: Mochi-1 converts descriptive text into high-quality video clips.
Customizable Parameters: Provides extensive control over frame count, prompt strength, FPS, and more.
Reproducibility: Seed control enables consistent outputs for the same configuration.
Dynamic Visuals: Smooth transitions and coherent sequences.

What Can I Use It For?

Use Cases for mochi-1

Content creators can leverage mochi-1's artistic filters and motion details to animate static images into dynamic sequences, such as converting a product photo into a looping ad clip with smooth pans—saving hours on manual editing for social media campaigns.

Developers building Genmo text-to-video apps fine-tune mochi-1 on branded footage to generate consistent video assets, like "a sleek smartphone orbiting on a futuristic desk with neon glows," ensuring prompt-aligned outputs for e-commerce demos without external tools.

Marketers use its camera motion handling for professional shorts, inputting prompts like "a dancing peacock in a neon jungle with steady zoom-in," to craft vibrant, watermark-free visuals for ads that maintain object integrity across frames.

Designers experiment with its 480p 5.4-second clips for storyboarding, fine-tuning on reference videos to produce nuanced, expressive animations tailored to concept art needs in advertising workflows.

Things to Be Aware Of

Creative Storytelling: Use vivid and imaginative prompts to craft compelling narratives.
Dynamic Compositions: Experiment with various FPS and frame counts to suit different styles.
Prompt Strength Balance: Adjust the image and text prompt strengths for hybrid inspirations.
Reproducibility: Use a fixed seed to iterate on a consistent baseline.

Limitations

Prompt Sensitivity: Ambiguous or overly complex prompts may result in inconsistent outputs.
Balance Challenge: Finding the ideal parameter configuration may require multiple iterations.
Output Consistency: While seeds ensure reproducibility, varying parameter combinations may lead to unexpected results.

Output Format: MP4

Pricing

Pricing Detail

This model runs at a cost of $0.001677 per second.

The average execution time is 261 seconds, but this may vary depending on your input data.

The average cost per run is $0.437827

Pricing Type: Execution Time

Cost Per Second means the total cost is calculated based on how long the model runs. Instead of paying a fixed fee per run, you are charged for every second the model is actively processing. This pricing method provides flexibility, especially for models with variable execution times, because you only pay for the actual time used.

AI TRENDS

Related AI Models

You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

Text to Video

Cutting edge text to video generation delivering cinematic shots, lifelike motion dynamics, and seamless native audio all from a single prompt.

Kling | v2.6 | Pro | Text to Video

170 s

Text to Video

Pika v2.2 generates high-quality videos directly from text prompts with stunning visual detail.

Pika | v2.2 | Text to Video

100 s

Text to Video

Generate cinematic videos with synchronized audio in seconds. The Fast mode of LTXV-2 delivers high-quality motion and sound at accelerated rendering speed

Ltx v2 | Text to Video | Fast

65 s

Text to Video

Kling 3.0 Pro delivers premium text-to-video generation with cinematic visuals, smooth motion, native audio, and support for multi-shot sequences.

Kling | v3 | Pro | Text to Video

200 s

Explore More