MM Audio

mmaudio

MMAudio generates synchronized audio given video and/or text inputs.

L40S 45GB
Fast Inference
REST API

Model Information

Response Time~5 sec
StatusActive
Version
0.0.1
Updated12 days ago

Prerequisites

  • Create an API Key from the Eachlabs Console
  • Install the required dependencies for your chosen language (e.g., requests for Python)

API Integration Steps

1. Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

import requests
import time
API_KEY = "YOUR_API_KEY" # Replace with your API key
HEADERS = {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}
def create_prediction():
response = requests.post(
"https://api.eachlabs.ai/v1/prediction/",
headers=HEADERS,
json={
"model": "mmaudio",
"version": "0.0.1",
"input": {
"seed": "-1",
"video": "your_file.video/mp4",
"prompt": "your prompt here",
"duration": "8",
"num_steps": "25",
"cfg_strength": "4.5",
"negative_prompt": "music"
}
}
)
prediction = response.json()
if prediction["status"] != "success":
raise Exception(f"Prediction failed: {prediction}")
return prediction["predictionID"]

2. Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

def get_prediction(prediction_id):
while True:
result = requests.get(
f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",
headers=HEADERS
).json()
if result["status"] == "success":
return result
elif result["status"] == "error":
raise Exception(f"Prediction failed: {result}")
time.sleep(1) # Wait before polling again

3. Complete Example

Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.

try:
# Create prediction
prediction_id = create_prediction()
print(f"Prediction created: {prediction_id}")
# Get result
result = get_prediction(prediction_id)
print(f"Output URL: {result['output']}")
print(f"Processing time: {result['metrics']['predict_time']}s")
except Exception as e:
print(f"Error: {e}")

Additional Information

  • The API uses a two-step process: create prediction and poll for results
  • Response time: ~5 seconds
  • Rate limit: 60 requests/minute
  • Concurrent requests: 10 maximum
  • Use long-polling to check prediction status until completion

Overview

MMAudio is an innovative multi-modal AI model designed to analyze, process, and generate audio data with advanced capabilities. By integrating state-of-the-art techniques in audio analysis and synthesis, MMAudio supports tasks such as transcription, audio classification, and text-to-audio generation. Its versatility makes it ideal for applications in media, research, and interactive systems.

Technical Specifications

  • Architecture: Combines convolutional neural networks (CNNs) with transformer-based architectures for robust audio analysis and synthesis.
  • Supported Tasks:
    • Audio transcription and classification
    • Text-to-audio generation
    • Audio enhancement and denoising
  • Dataset Training: Trained on diverse audio datasets including speech, music, and environmental sounds.

Key Considerations

  • Video Quality: Use high-resolution videos for better audio alignment.
  • Prompt Clarity: Ambiguous prompts may lead to less desirable outcomes. Be descriptive and precise.
  • Processing Time: Higher num_steps improves quality but increases processing time.
  • Negative Prompt Usage: Avoid distractions by specifying what not to include in the audio.

Tips & Tricks

  • Optimize CFG Strength:
    • High values (e.g., 10): Strict adherence to the prompt.
    • Low values (e.g., 2-5): More creative and flexible outputs.
  • Leverage Negative Prompts: To refine results, use phrases like "no human voices" or "no loud background music."
  • Experiment with Seeds: Fixed seeds ensure repeatability, while varying seeds can inspire new outcomes.
  • Balance Steps and Speed: Start with moderate num_steps (e.g., 50) for efficiency and adjust based on quality needs.


Capabilities

  • Audio for Silent Films: Enhance silent footage with contextual soundscapes.
  • Nature Ambiance: Generate immersive environmental audio for landscapes and wildlife videos.
  • Content Creation: Add professional-quality sound to video projects.
  • Virtual Reality: Create synchronized audio for VR environments, boosting immersion.


What can I use for?

  • Media Production: Automate the addition of soundtracks to silent videos, enriching content without manual audio editing.
  • Gaming and VR: Create immersive environments by generating context-specific audio that responds dynamically to visual cues.

  • Educational Content: Enhance instructional videos with appropriate sound effects, aiding in better comprehension and engagement.

Things to be aware of

  • Silent Film Enhancement: Apply MMAudio to silent films to generate authentic soundtracks, revitalizing classic cinema.
  • Nature Documentary Soundscapes: Use the model to add realistic environmental sounds to nature footage, creating an immersive experience.
  • Action Sequence Audio: Generate dynamic sound effects for action scenes in videos, enhancing excitement and realism.

  • Custom Narration: Input textual descriptions to produce corresponding audio narrations, useful for documentaries and presentations.

Limitations

  • Complex Scenes: May encounter challenges when processing videos with rapid scene changes or intricate visual details.
  • Unique Sound Effects: Certain distinctive sound effects might require additional customization beyond the model's standard capabilities.

  • Resource Intensive: Processing high-resolution videos can be computationally demanding.
  • Output Format: MP4

Related AI Models

video-retalking

Audio Based Lip Synchronization

video-retalking

Video to Video