inference · 8.0s

Rvc v2

Audio·rvc·by RVC Project

Voice-to-Voice with RVC v2 converts your spoken voice into any RVC v2 trained AI voice while preserving your tone, emotion, and natural delivery.

Runtime (p50)
-
Estimated price
$0.000247 / sec
Call the API
prediction.sh
sh
curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "rvc-v2",
    "version": "0.0.1",
    "input": {
        "protect": 0.5,
        "f0_method": "rmvpe",
        "rvc_model": "CUSTOM",
        "index_rate": 1,
        "input_audio": "https://storage.googleapis.com/magicpoint/inputs/rvc-v2-input.mp3",
        "pitch_change": 8,
        "rms_mix_rate": 1,
        "filter_radius": 1,
        "output_format": "wav",
        "crepe_hop_length": 128,
        "custom_rvc_model_download_url": "https://huggingface.co/Argax/doofenshmirtz-RUS/resolve/main/doofenshmirtz.zip"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/
Documentation8 sections
  • Overview

    rvc-v2 — Voice-to-Voice AI Model

    Developed by RVC Project as part of the rvc family, rvc-v2 is a powerful voice-to-voice AI model that converts your spoken input into any RVC-trained AI voice while preserving tone, emotion, and natural delivery. This makes it ideal for creators and developers seeking voice-to-voice AI models that maintain linguistic content and prosody without extensive retraining. Built on a conditional variational autoencoder using HuBERT for content encoding and CREPE for pitch extraction, rvc-v2 excels in high-fidelity voice conversion, delivering outputs with UTMOS perceptual quality scores up to 4.190—outperforming alternatives like kNN-VC in naturalness.

    Whether you're cloning voices for content creation or building real-time applications, rvc-v2 from RVC Project handles clean audio inputs seamlessly, supporting use cases like "RVC Project voice-to-voice" transformations that users search for daily.

  • Capabilities
    • High-fidelity voice conversion preserving tone, emotion, and delivery in speech or song
    • Real-time voice changing via microphone input with low latency using caching and efficient pitch methods
    • Text-to-speech generation using trained RVC models for audiobooks or character voices
    • Multi-step processing for song covers: vocal separation, pitch extraction, timbre swap, and remixing
    • Robust to noise with advanced embedders like Spin, separating timbre from phonetic content accurately
    • Customizable effects: pitch shift, volume matching, consonant protection, filters (low/high-pass, reverb, chorus)
    • Versatile for polyphonic audio with RMVPE+ variants; supports user-trained models via .pth uploads
  • Use cases

    Use Cases for rvc-v2

    Content creators cloning custom voices: Podcasters or YouTubers upload 3-5 minutes of clean target voice audio to train an rvc-v2 model, then convert their narration—preserving emotion for "clone any voice RVC" results without noise or re-recording. For example, input a script read in your voice with prompt parameters like "convert to trained 'Baha' model, maintain pitch contour," yielding singing-capable outputs.

    Developers building real-time voice changers: Integrate rvc-v2 via API for apps needing live "voice-to-voice AI model" conversion, leveraging its cross-platform support (Windows, Mac, Linux) and DirectML for AMD GPUs—perfect for gaming or streaming tools where low WER (0.120) ensures clear communication.

    Musicians experimenting with vocal styles: Train on a singer's 32-minute WAV dataset (44.1 kHz mono), then apply to tracks for seamless style transfer, using rvc-v2's HuBERT-CREPE pipeline to match vocal range intelligently without quality loss.

    Marketers personalizing audio ads: Convert spokesperson audio to brand voices quickly, capitalizing on rvc-v2's high UTMOS scores for natural delivery in targeted campaigns searching for "RVC Project voice-to-voice."

  • Tips & tricks

    How to Use rvc-v2 on Eachlabs

    Access rvc-v2 seamlessly through Eachlabs Playground for instant testing, API for production-scale "rvc-v2 API" integrations, or SDK for custom apps. Upload clean WAV input (mono, 44.1 kHz recommended), specify your trained RVC model name, epic count (200-1000), and parameters like pitch tracking—generate high-fidelity voice outputs in seconds with preserved prosody and emotion.

    ---
  • Technical spec

    What Sets rvc-v2 Apart

    rvc-v2 stands out in the competitive landscape of voice conversion tools due to its architecture tailored for realistic self-voice conversion and multi-speaker support. Unlike basic retrieval methods, it uses a conditional variational autoencoder with HuBERT content features and optional CREPE pitch tracking, enabling precise disentanglement of speaker identity from linguistic content. This allows users to generate high-perceptual-quality outputs (UTMOS 4.190) even in adversarial scenarios like watermark removal, while keeping word error rates low at 0.120.

    • Pitch-preserving conversion via CREPE integration: Extracts and fuses pitch contours accurately, enabling singing voice cloning or emotional speech transfer that retains prosody—ideal for "best voice-to-voice AI" applications where natural inflection is critical.
    • ECAPA-adapted multi-speaker embeddings: Supports one-to-one and same-speaker reconstruction through finetuned embeddings, preserving speaker similarity at 0.748 while outperforming in quality over kNN-VC.
    • High-quality training from short clean audio: Requires just 3-5 minutes of mono 44.1 kHz 16-bit WAV data for effective models, with training via 200-1000 epochs for rapid deployment in "RVC v2 voice cloning" workflows.

    Processing supports standard audio formats like WAV, with real-time capable inference on GPUs (Nvidia/AMD) or high-end CPUs, making rvc-v2 a top choice for rvc-v2 API integrations.

  • Things to be aware of
    • Embedder mismatches (e.g., using Spin on ContentVec model) cause poor timbre transfer; always check model description
    • RMVPE can sound harsh on non-harmonic voices; users recommend FCPE or RMVPEGPU forks for smoother results
    • Volume drops in long audio fixed by Split Audio, per inference guides; improves speed and consistency
    • Resource needs: GPU acceleration via RMVPEGPU variants reduces CPU load for real-time use
    • Breaths and noise handled better by Spin embedder, but less common in older models
    • Positive feedback: Fast setup, quality improvements in v2 with autotuning; users praise caching for efficiency
    • Common concerns: Over-protection makes speech robotic; test intermediates to avoid artifacts
  • Key considerations
    • Adjust transpose (pitch) precisely, using decimals like -4.3, to match target model tone for natural results
    • Select embedder matching the model's training (e.g., ContentVec for most models, Spin for better breath handling and noise robustness)
    • Set Protect Voiceless Consonants to 0.5 or lower to reduce breath artifacts, but avoid extremes to prevent inhumane-sounding suppression of words
    • Use Volume Envelope near 0 to preserve input loudness; closer to 1 matches training dataset volume
    • Enable Split Audio for longer files to ensure consistent volume and faster inference by processing segments individually
    • RMVPE is the go-to pitch extractor for speed and convenience, though it may sound harsh; test FCPE for fuller voices
    • Balance quality vs speed: caching skips redundant steps, but high protection or advanced embedders increase processing time
  • Limitations
    • Dependent on training data quality; poor datasets lead to inconsistent timbre or pronunciation bleed
    • Real-time mode sensitive to input noise or mic quality, potentially causing artifacts without preprocessing
    • Limited to RVC-compatible .pth models; incompatible formats like gpt-sovits fail silently

Related models

3 models
* FAQ

About Rvc v2

01 / 03

What is RVC v2?

RVC v2 (Retrieval-based Voice Conversion v2) is an AI voice conversion model that transforms input audio to sound like a target voice trained on a custom dataset. It enables high-quality voice style transfer while preserving the original speech content, pitch, and timing.