RVC
Voice-to-Voice with RVC v2 converts your spoken voice into any RVC v2 trained AI voice while preserving your tone, emotion, and natural delivery.
Avg Run Time: 0.000s
Model Slug: rvc-v2
Release Date: December 10, 2025
Playground
Input
Enter a URL or choose a file from your computer.
Invalid URL.
(Max 50MB)
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
RVC v2 refers to the second version of Retrieval-based Voice Conversion (RVC) models, an open-source AI technology for real-time and offline voice conversion. Developed by the RVC community through collaborative efforts on platforms like GitHub, it enables transforming input audio, such as spoken voice or singing, into target voices trained on specific datasets while aiming to preserve prosody, emotion, and intonation. Key contributors include projects like ultimate-rvc and voice-changer repositories that extend RVC v2 capabilities for training, inference, and real-time applications.
Core features include pitch adjustment (transpose), embedder models for speaker identity extraction (e.g., ContentVec, Spin), protection for voiceless consonants to reduce artifacts, volume envelope matching, and split audio processing for consistent output. It supports voice-to-voice conversion, text-to-speech with trained models, and multi-step pipelines for tasks like song covers or speech generation. The technology stands out for its efficiency in real-time scenarios, caching for faster inference, and flexibility with pitch extraction methods like RMVPE, FCPE, making it suitable for both amateur and advanced users.
The underlying architecture relies on PyTorch models serialized as .pth files, using speaker embeddings to separate timbre from content, combined with pitch estimation algorithms for natural conversion. What makes RVC v2 unique is its balance of high-quality conversion with low-latency inference, community-driven improvements in noise reduction, autotuning, and embedder options that minimize "bleeding" of original speaker traits into outputs.
Technical Specifications
- Architecture: PyTorch-based Retrieval Voice Conversion (RVC) with speaker embedders (ContentVec, Spin, Ch979) and pitch extractors (RMVPE, FCPE)
- Parameters: Not publicly specified; model sizes vary by training dataset (typically tens of MB for .pth files)
- Resolution: Audio sample rates typically 32kHz or 48kHz; supports variable lengths via split audio processing
- Input/Output formats: WAV, MP3 audio files; .pth model files; real-time microphone input/output
- Performance metrics: Real-time inference capable with caching (reduces time by skipping vocal extraction); RMVPE offers fast processing with decent precision, suitable for harmonic-rich voices
Key Considerations
- Adjust transpose (pitch) precisely, using decimals like -4.3, to match target model tone for natural results
- Select embedder matching the model's training (e.g., ContentVec for most models, Spin for better breath handling and noise robustness)
- Set Protect Voiceless Consonants to 0.5 or lower to reduce breath artifacts, but avoid extremes to prevent inhumane-sounding suppression of words
- Use Volume Envelope near 0 to preserve input loudness; closer to 1 matches training dataset volume
- Enable Split Audio for longer files to ensure consistent volume and faster inference by processing segments individually
- RMVPE is the go-to pitch extractor for speed and convenience, though it may sound harsh; test FCPE for fuller voices
- Balance quality vs speed: caching skips redundant steps, but high protection or advanced embedders increase processing time
Tips & Tricks
- For optimal pitch: Start with transpose at 0, increment/decrement by 0.5 while previewing until matching target voice
- Prompt structuring: No text prompts needed; focus on clean input audio - preprocess with noise reduction for best embedder performance
- Achieve singing covers: Use multi-step tabs to isolate vocal extraction, conversion, and mixing; cache vocals for repeated model tests
- Iterative refinement: Listen to intermediate files (vocals, pitch, converted) in UI to tweak embedder or protection per step
- Advanced: Combine with compressor (threshold -20dB, ratio 4:1) for even volume; add reverb (room size 0.5, wet 0.3) for spatial effects; use high-pass filter (cutoff 80Hz) to remove rumble
- Real-time: Load clean sound files as "mic input" for playback conversion; select per-model embedder to avoid mismatches
Capabilities
- High-fidelity voice conversion preserving tone, emotion, and delivery in speech or song
- Real-time voice changing via microphone input with low latency using caching and efficient pitch methods
- Text-to-speech generation using trained RVC models for audiobooks or character voices
- Multi-step processing for song covers: vocal separation, pitch extraction, timbre swap, and remixing
- Robust to noise with advanced embedders like Spin, separating timbre from phonetic content accurately
- Customizable effects: pitch shift, volume matching, consonant protection, filters (low/high-pass, reverb, chorus)
- Versatile for polyphonic audio with RMVPE+ variants; supports user-trained models via .pth uploads
What Can I Use It For?
- Creating song covers by converting vocals to target artist voices, as shared in GitHub tutorials and user pipelines
- Real-time voice modulation for streaming or calls, reported in community setups with microphone passthrough
- Generating audiobooks or TTS in custom voices, highlighted in ultimate-rvc features for character speech
- Training personal voice models from datasets, used by developers for unique timbre applications per PyPI docs
- Live performances or content creation, with users discussing sound file playback conversion in real-time
Things to Be Aware Of
- Embedder mismatches (e.g., using Spin on ContentVec model) cause poor timbre transfer; always check model description
- RMVPE can sound harsh on non-harmonic voices; users recommend FCPE or RMVPEGPU forks for smoother results
- Volume drops in long audio fixed by Split Audio, per inference guides; improves speed and consistency
- Resource needs: GPU acceleration via RMVPEGPU variants reduces CPU load for real-time use
- Breaths and noise handled better by Spin embedder, but less common in older models
- Positive feedback: Fast setup, quality improvements in v2 with autotuning; users praise caching for efficiency
- Common concerns: Over-protection makes speech robotic; test intermediates to avoid artifacts
Limitations
- Dependent on training data quality; poor datasets lead to inconsistent timbre or pronunciation bleed
- Real-time mode sensitive to input noise or mic quality, potentially causing artifacts without preprocessing
- Limited to RVC-compatible .pth models; incompatible formats like gpt-sovits fail silently
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

