EACHLABS
Realistic Voice Cloning v2 (RVC v2) is an advanced voice-to-voice model that transforms an input voice into a chosen target voice with realistic results, accessible through the RVC v2 Web UI on Replicate.
Avg Run Time: 80.000s
Model Slug: train-rvc
Playground
Input
Enter a URL or choose a file from your computer.
Invalid URL.
(Max 50MB)
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Realistic Voice Cloning v2 (RVC v2), often referred to as train-rvc in community contexts, is an open-source voice conversion model developed by the AI community, building on Retrieval-based Voice Conversion techniques. It enables high-fidelity transformation of an input voice to a target voice by extracting and applying voice characteristics from training data. The model excels in real-time or near-real-time applications, preserving expressiveness, prosody, and speaker identity while handling dynamic speech elements like pitch variations and emotional nuances.
Key features include customizable training on user-provided audio datasets, support for low-latency inference suitable for streaming, and granular control over parameters like pitch detection and frame processing. Its underlying architecture leverages causal convolutional layers for left-to-right audio processing, adaptive frame buffering for latency optimization, and prosody-aware conditioning that integrates fundamental frequency (F0), energy, and voicing directly into generation. This makes it unique for balancing sub-50ms latency with high expressiveness retention, outperforming traditional pitch-shifting tools in vocal texture and dynamic range preservation.
RVC v2 stands out for its accessibility to users with Python skills and GPU hardware, allowing custom model training that captures speaker-specific quirks like breathiness and glottal fry, which are often lost in simpler voice changers.
Technical Specifications
- Architecture: Retrieval-based Voice Conversion with causal convolutional layers and prosody-aware vocoder
- Parameters: Not explicitly specified; optimized for lightweight deployment (e.g., comparable to 80M-parameter models in efficiency analogs)
- Resolution: Supports sample rates matching dataset (e.g., 48kHz recommended); hop lengths 64-512 for detail vs speed
- Input/Output formats: Raw waveform audio input; spectrogram-aligned output with F0 contour, energy envelope preservation; TorchScript-compatible for inference
- Performance metrics: 42ms median latency (GPU), 4.6/5 expressiveness score; real-time factor up to 2000x in similar efficient models; training epochs 200-400 for optimal quality
Key Considerations
- Collect 30-60 minutes of clean, varied audio for training; prioritize quality and diversity over quantity to capture emotions, speeds, and ranges
- Match sample rate and hop length to dataset; use Harvest or Crepe for F0 detection, starting with batch size 4-8 based on GPU VRAM
- Monitor training loss and test checkpoints every 10-20 epochs to avoid overfitting, which causes artifacts or degradation after 400-500 epochs
- Balance quality vs speed: lower hop lengths (64-128) for detail at cost of training time; GPU (NVIDIA RTX 3060+) required for sub-50ms latency
- Ensure clear, isolated voice data; avoid background noise or music, as it leads to mumbling, crackling, or inconsistent outputs
Tips & Tricks
- Optimal parameter settings: Sample rate 48kHz, hop length 128, epochs 200-400, save frequency every 10-20 epochs; batch size scaled to VRAM (e.g., 4-8 for most GPUs)
- Prompt structuring advice: For inference, condition on F0 scaling, energy envelopes, and harmonic-noise ratios to retain prosody; test with varied inputs matching training data
- How to achieve specific results: For breathy giggles or hesitant reactions, include such examples in dataset; use adaptive buffering (5-12ms windows) for dynamic speech
- Iterative refinement strategies: Train initial model on 30min data, identify weaknesses (e.g., poor excited speech), add targeted audio, retrain from checkpoint
- Advanced techniques: Combine with real-time cloning pipelines for 42ms latency; critically compare checkpoints (e.g., epoch 100 vs 300) and revert if overfitting detected
Capabilities
- High-fidelity voice cloning capturing timbral warmth, micro-pauses, emotional resonance, and speaker quirks like pitch wobble
- Real-time processing with 42ms median latency on GPU, preserving consonants, dynamic shifts, and prosody better than FFT-based tools
- Excellent expressiveness (4.6/5 score) in streaming, maintaining breathiness, glottal fry, and vocal texture
- Versatile across emotions, speeds, and volumes when trained on diverse data; granular customization of F0, energy, and breath intensity
- Technical strengths in causal processing and spectrogram alignment for low hardware demand relative to performance (RTX 3060+ sufficient)
What Can I Use It For?
- Streaming and live performances, e.g., charity streams where custom models retain authentic giggles and reactions for engaging interactions
- Custom voice model creation for content like videos or games, with users training on 45-90min data for clear pronunciation and emotional variety
- Real-time voice transformation in broadcasts, preserving prosody for jokes, questions, and dynamic speech as reported in user benchmarks
- Personal projects on GitHub-like repos, iterating models for specific weaknesses like excited speech or complex words via dataset expansion
- Technical audio experiments, e.g., low-latency vocoding for professional voice acting tests with blind expressiveness scoring
Things to Be Aware Of
- Experimental real-time behaviors shine in GPU setups but may hit 65-110ms on CPU; users report 25-35% CPU peaks on 8-core systems
- Known quirks: Artifacts like crackling from noisy data or overfitting; mumbling on unclear training audio; inconsistent traits without varied dataset
- Performance from benchmarks: Best balance at 42ms latency/4.6 expressiveness; scales well with NVIDIA GPUs but needs buffer tuning for stability
- Resource requirements: NVIDIA RTX 3060+ or equivalent for sub-50ms; Python fluency for training; negligible bandwidth add (<5kbps)
- Consistency improves with 30-60min diverse data; early checkpoints rough, later ones superior if not overtrained
- Positive feedback: Unmatched control over prosody transfer; high retention of speaker identity in user tests with voice actors
- Common concerns: Mediocre quality on <30min data; diminishing returns >2hrs; verify GPU usage to avoid training failures
Limitations
- Requires substantial clean, varied training data (min 30min) for usable quality; short or noisy datasets yield mediocre, artifact-prone results
- GPU dependency for real-time low latency; CPU fallback increases delay to 65+ms and reduces practicality for live use
- Risk of overfitting beyond 400-500 epochs, leading to worse performance on unseen inputs despite lower loss
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
