inference · 5.2s

MM Audio

Video·mm-audio·by Meta

MMAudio generates synchronized audio given video and/or text inputs.

Runtime (p50): 5s
Estimated price: $0.00108 / sec

Call the API

prediction.sh

curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "mmaudio",
    "version": "0.0.1",
    "input": {
        "video": "https://storage.googleapis.com/magicpoint/inputs/mmaudio-input.mp4",
        "prompt": "galloping",
        "duration": 5,
        "num_steps": 25,
        "cfg_strength": 4.5,
        "negative_prompt": "music"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/

Documentation8 sections

Overview
mmaudio — Video-to-Audio AI Model

Developed by Meta as part of the mm-audio family, mmaudio is a multimodal audio generation model that creates synchronized audio from video and/or text inputs. Rather than treating video-to-audio and text-to-audio as separate tasks, mmaudio unifies both capabilities within a single architecture, enabling developers to generate high-quality, temporally aligned audio for video content, narration, ambient soundscapes, and interactive applications.

The core problem mmaudio solves is the challenge of creating audio that perfectly synchronizes with visual content. Whether you're working with existing video footage or generating audio from text descriptions, temporal misalignment between audio and visual events breaks immersion and reduces perceived quality. mmaudio addresses this through fine-grained temporal synchronization, ensuring every audio event aligns precisely with its corresponding visual moment—critical for video-to-audio AI model applications where even millisecond-level drift becomes noticeable.

This unified approach eliminates the need to manage multiple models or APIs for different input types, making mmaudio ideal for developers building AI video audio synthesis tools that need flexibility across use cases.
Capabilities
- Audio for Silent Films: Enhance silent footage with contextual soundscapes.
- Nature Ambiance: Generate immersive environmental audio for landscapes and wildlife videos.
- Content Creation: Add professional-quality sound to video projects.
- Virtual Reality: Create synchronized audio for VR environments, boosting immersion.
Use cases
Use Cases for mmaudio

Content Creators & Video Producers: Filmmakers and video editors can use mmaudio to generate synchronized background audio, ambient soundscapes, or sound effects that match visual action without manual timing adjustments. For example, a creator editing a nature documentary can input video footage with a text prompt like "gentle forest ambience with distant bird calls and rustling leaves" and receive perfectly timed audio that enhances the visual narrative without requiring foley recording or manual synchronization.

E-Commerce & Product Marketing: Marketing teams building AI video audio synthesis tools can automatically generate product demonstration videos with synchronized narration and ambient audio. A product video showing a coffee maker in action can be paired with text-to-audio generation ("warm, inviting café ambience with subtle espresso machine sounds") to create professional-quality marketing content without hiring voice actors or sound designers.

Game Developers & Interactive Media: Developers creating dynamic game environments or interactive experiences can leverage mmaudio's flexible input handling to generate contextual audio that responds to both visual scenes and gameplay events. A game level showing a thunderstorm can receive both video input (for visual timing) and text input ("intense thunder with heavy rain") to create immersive, synchronized audio that reacts to player actions and environmental changes.

Accessibility & Content Localization: Media companies can use mmaudio to generate audio descriptions and localized voiceovers for video content. By inputting video footage with text descriptions in multiple languages, teams can efficiently create accessible versions of content with perfectly synchronized narration, reducing production time compared to traditional voice recording and editing workflows.
Tips & tricks
How to Use mmaudio on Eachlabs

Access mmaudio through Eachlabs via the Playground for interactive testing or through the API and SDK for production integration. Provide video input, text prompts, or both—mmaudio intelligently processes whichever modalities you supply. The model outputs high-fidelity stereo audio synchronized with your visual content. Use Eachlabs' unified interface to experiment with different input combinations, adjust temporal alignment parameters, and integrate mmaudio's synchronized audio generation capabilities directly into your applications.
---END---
Technical spec
What Sets mmaudio Apart

Unified Multi-Modal Architecture: Unlike competing models that require separate pipelines for video-to-audio versus text-to-audio generation, mmaudio handles both input types within a single model. This means you can feed video alone, text alone, or both simultaneously—without padding constraints or architectural workarounds. The model intelligently omits absent modalities, reducing latency and simplifying integration for developers building synchronized audio generation API solutions.

Fine-Grained Temporal Synchronization: mmaudio employs Synchformer architecture to extract dense temporal features from video, ensuring audio events align with visual content at frame-level precision. This is particularly critical for speech synchronization, sound effect timing, and music beats—areas where competing models often struggle. The result is audio that feels naturally embedded in the video rather than overlaid on top of it.

Flexible Input Handling: The model supports video-to-audio (V2A), text-to-audio (T2A), and combined video-text-to-audio (VT2A) generation within the same inference call. This flexibility enables use cases ranging from adding ambient audio to silent footage, generating voiceovers from text descriptions, or enhancing existing audio with contextual sound design based on visual content.

Technical Specifications:
- Supports high-resolution video input with temporal feature extraction
- Generates stereo audio output with spatial awareness capabilities
- Achieves professional-grade audio quality (MOS scores: 4.21 for acoustic fidelity)
- Maintains semantic and temporal alignment across modalities
- Processes variable-length video and text inputs
Things to be aware of
- Silent Film Enhancement: Apply MMAudio to silent films to generate authentic soundtracks, revitalizing classic cinema.
- Nature Documentary Soundscapes: Use the model to add realistic environmental sounds to nature footage, creating an immersive experience.
- Action Sequence Audio: Generate dynamic sound effects for action scenes in videos, enhancing excitement and realism.
- Custom Narration: Input textual descriptions to produce corresponding audio narrations, useful for documentaries and presentations.
Key considerations
- Video Quality: Use high-resolution videos for better audio alignment.
- Prompt Clarity: Ambiguous prompts may lead to less desirable outcomes. Be descriptive and precise.
- Processing Time: Higher num_steps improves quality but increases processing time.
- Negative Prompt Usage: Avoid distractions by specifying what not to include in the audio.
Limitations
- Complex Scenes: May encounter challenges when processing videos with rapid scene changes or intricate visual details.
- Unique Sound Effects: Certain distinctive sound effects might require additional customization beyond the model's standard capabilities.
- Resource Intensive: Processing high-resolution videos can be computationally demanding.
- Output Format: MP4