MM Audio
mmaudio
MMAudio generates synchronized audio given video and/or text inputs.
Model Information
Input
Configure model parameters
Output
View generated results
Result
Preview, share or download your results with a single click.
Prerequisites
- Create an API Key from the Eachlabs Console
- Install the required dependencies for your chosen language (e.g., requests for Python)
API Integration Steps
1. Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
import requestsimport timeAPI_KEY = "YOUR_API_KEY" # Replace with your API keyHEADERS = {"X-API-Key": API_KEY,"Content-Type": "application/json"}def create_prediction():response = requests.post("https://api.eachlabs.ai/v1/prediction/",headers=HEADERS,json={"model": "mmaudio","version": "0.0.1","input": {"seed": "-1","video": "your_file.video/mp4","prompt": "your prompt here","duration": "8","num_steps": "25","cfg_strength": "4.5","negative_prompt": "music"}})prediction = response.json()if prediction["status"] != "success":raise Exception(f"Prediction failed: {prediction}")return prediction["predictionID"]
2. Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
def get_prediction(prediction_id):while True:result = requests.get(f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",headers=HEADERS).json()if result["status"] == "success":return resultelif result["status"] == "error":raise Exception(f"Prediction failed: {result}")time.sleep(1) # Wait before polling again
3. Complete Example
Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.
try:# Create predictionprediction_id = create_prediction()print(f"Prediction created: {prediction_id}")# Get resultresult = get_prediction(prediction_id)print(f"Output URL: {result['output']}")print(f"Processing time: {result['metrics']['predict_time']}s")except Exception as e:print(f"Error: {e}")
Additional Information
- The API uses a two-step process: create prediction and poll for results
- Response time: ~5 seconds
- Rate limit: 60 requests/minute
- Concurrent requests: 10 maximum
- Use long-polling to check prediction status until completion
Overview
MMAudio is an innovative multi-modal AI model designed to analyze, process, and generate audio data with advanced capabilities. By integrating state-of-the-art techniques in audio analysis and synthesis, MMAudio supports tasks such as transcription, audio classification, and text-to-audio generation. Its versatility makes it ideal for applications in media, research, and interactive systems.
Technical Specifications
- Architecture: Combines convolutional neural networks (CNNs) with transformer-based architectures for robust audio analysis and synthesis.
- Supported Tasks:
- Audio transcription and classification
- Text-to-audio generation
- Audio enhancement and denoising
- Dataset Training: Trained on diverse audio datasets including speech, music, and environmental sounds.
Key Considerations
- Video Quality: Use high-resolution videos for better audio alignment.
- Prompt Clarity: Ambiguous prompts may lead to less desirable outcomes. Be descriptive and precise.
- Processing Time: Higher num_steps improves quality but increases processing time.
- Negative Prompt Usage: Avoid distractions by specifying what not to include in the audio.
Tips & Tricks
- Optimize CFG Strength:
- High values (e.g., 10): Strict adherence to the prompt.
- Low values (e.g., 2-5): More creative and flexible outputs.
- Leverage Negative Prompts: To refine results, use phrases like "no human voices" or "no loud background music."
- Experiment with Seeds: Fixed seeds ensure repeatability, while varying seeds can inspire new outcomes.
Balance Steps and Speed: Start with moderate num_steps (e.g., 50) for efficiency and adjust based on quality needs.
Capabilities
- Audio for Silent Films: Enhance silent footage with contextual soundscapes.
- Nature Ambiance: Generate immersive environmental audio for landscapes and wildlife videos.
- Content Creation: Add professional-quality sound to video projects.
Virtual Reality: Create synchronized audio for VR environments, boosting immersion.
What can I use for?
- Media Production: Automate the addition of soundtracks to silent videos, enriching content without manual audio editing.
- Gaming and VR: Create immersive environments by generating context-specific audio that responds dynamically to visual cues.
- Educational Content: Enhance instructional videos with appropriate sound effects, aiding in better comprehension and engagement.
Things to be aware of
- Silent Film Enhancement: Apply MMAudio to silent films to generate authentic soundtracks, revitalizing classic cinema.
- Nature Documentary Soundscapes: Use the model to add realistic environmental sounds to nature footage, creating an immersive experience.
- Action Sequence Audio: Generate dynamic sound effects for action scenes in videos, enhancing excitement and realism.
- Custom Narration: Input textual descriptions to produce corresponding audio narrations, useful for documentaries and presentations.
Limitations
- Complex Scenes: May encounter challenges when processing videos with rapid scene changes or intricate visual details.
- Unique Sound Effects: Certain distinctive sound effects might require additional customization beyond the model's standard capabilities.
- Resource Intensive: Processing high-resolution videos can be computationally demanding.
- Output Format: MP4