each::sense is in private beta.
Eachlabs | AI Workflows for app builders
rvc-v2

RVC

Voice-to-Voice with RVC v2 converts your spoken voice into any RVC v2 trained AI voice while preserving your tone, emotion, and natural delivery.

Avg Run Time: 0.000s

Model Slug: rvc-v2

Release Date: December 10, 2025

Playground

Input

Enter a URL or choose a file from your computer.

Output

Example Result

Preview and download your result.

The total cost depends on how long the model runs. It costs $0.000247 per second. Based on an average runtime of 20 seconds, each run costs about $0.004950. With a $1 budget, you can run the model around 202 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

RVC v2 refers to the second version of Retrieval-based Voice Conversion (RVC) models, an open-source AI technology for real-time and offline voice conversion. Developed by the RVC community through collaborative efforts on platforms like GitHub, it enables transforming input audio, such as spoken voice or singing, into target voices trained on specific datasets while aiming to preserve prosody, emotion, and intonation. Key contributors include projects like ultimate-rvc and voice-changer repositories that extend RVC v2 capabilities for training, inference, and real-time applications.

Core features include pitch adjustment (transpose), embedder models for speaker identity extraction (e.g., ContentVec, Spin), protection for voiceless consonants to reduce artifacts, volume envelope matching, and split audio processing for consistent output. It supports voice-to-voice conversion, text-to-speech with trained models, and multi-step pipelines for tasks like song covers or speech generation. The technology stands out for its efficiency in real-time scenarios, caching for faster inference, and flexibility with pitch extraction methods like RMVPE, FCPE, making it suitable for both amateur and advanced users.

The underlying architecture relies on PyTorch models serialized as .pth files, using speaker embeddings to separate timbre from content, combined with pitch estimation algorithms for natural conversion. What makes RVC v2 unique is its balance of high-quality conversion with low-latency inference, community-driven improvements in noise reduction, autotuning, and embedder options that minimize "bleeding" of original speaker traits into outputs.

Technical Specifications

  • Architecture: PyTorch-based Retrieval Voice Conversion (RVC) with speaker embedders (ContentVec, Spin, Ch979) and pitch extractors (RMVPE, FCPE)
  • Parameters: Not publicly specified; model sizes vary by training dataset (typically tens of MB for .pth files)
  • Resolution: Audio sample rates typically 32kHz or 48kHz; supports variable lengths via split audio processing
  • Input/Output formats: WAV, MP3 audio files; .pth model files; real-time microphone input/output
  • Performance metrics: Real-time inference capable with caching (reduces time by skipping vocal extraction); RMVPE offers fast processing with decent precision, suitable for harmonic-rich voices

Key Considerations

  • Adjust transpose (pitch) precisely, using decimals like -4.3, to match target model tone for natural results
  • Select embedder matching the model's training (e.g., ContentVec for most models, Spin for better breath handling and noise robustness)
  • Set Protect Voiceless Consonants to 0.5 or lower to reduce breath artifacts, but avoid extremes to prevent inhumane-sounding suppression of words
  • Use Volume Envelope near 0 to preserve input loudness; closer to 1 matches training dataset volume
  • Enable Split Audio for longer files to ensure consistent volume and faster inference by processing segments individually
  • RMVPE is the go-to pitch extractor for speed and convenience, though it may sound harsh; test FCPE for fuller voices
  • Balance quality vs speed: caching skips redundant steps, but high protection or advanced embedders increase processing time

Tips & Tricks

  • For optimal pitch: Start with transpose at 0, increment/decrement by 0.5 while previewing until matching target voice
  • Prompt structuring: No text prompts needed; focus on clean input audio - preprocess with noise reduction for best embedder performance
  • Achieve singing covers: Use multi-step tabs to isolate vocal extraction, conversion, and mixing; cache vocals for repeated model tests
  • Iterative refinement: Listen to intermediate files (vocals, pitch, converted) in UI to tweak embedder or protection per step
  • Advanced: Combine with compressor (threshold -20dB, ratio 4:1) for even volume; add reverb (room size 0.5, wet 0.3) for spatial effects; use high-pass filter (cutoff 80Hz) to remove rumble
  • Real-time: Load clean sound files as "mic input" for playback conversion; select per-model embedder to avoid mismatches

Capabilities

  • High-fidelity voice conversion preserving tone, emotion, and delivery in speech or song
  • Real-time voice changing via microphone input with low latency using caching and efficient pitch methods
  • Text-to-speech generation using trained RVC models for audiobooks or character voices
  • Multi-step processing for song covers: vocal separation, pitch extraction, timbre swap, and remixing
  • Robust to noise with advanced embedders like Spin, separating timbre from phonetic content accurately
  • Customizable effects: pitch shift, volume matching, consonant protection, filters (low/high-pass, reverb, chorus)
  • Versatile for polyphonic audio with RMVPE+ variants; supports user-trained models via .pth uploads

What Can I Use It For?

  • Creating song covers by converting vocals to target artist voices, as shared in GitHub tutorials and user pipelines
  • Real-time voice modulation for streaming or calls, reported in community setups with microphone passthrough
  • Generating audiobooks or TTS in custom voices, highlighted in ultimate-rvc features for character speech
  • Training personal voice models from datasets, used by developers for unique timbre applications per PyPI docs
  • Live performances or content creation, with users discussing sound file playback conversion in real-time

Things to Be Aware Of

  • Embedder mismatches (e.g., using Spin on ContentVec model) cause poor timbre transfer; always check model description
  • RMVPE can sound harsh on non-harmonic voices; users recommend FCPE or RMVPEGPU forks for smoother results
  • Volume drops in long audio fixed by Split Audio, per inference guides; improves speed and consistency
  • Resource needs: GPU acceleration via RMVPEGPU variants reduces CPU load for real-time use
  • Breaths and noise handled better by Spin embedder, but less common in older models
  • Positive feedback: Fast setup, quality improvements in v2 with autotuning; users praise caching for efficiency
  • Common concerns: Over-protection makes speech robotic; test intermediates to avoid artifacts

Limitations

  • Dependent on training data quality; poor datasets lead to inconsistent timbre or pronunciation bleed
  • Real-time mode sensitive to input noise or mic quality, potentially causing artifacts without preprocessing
  • Limited to RVC-compatible .pth models; incompatible formats like gpt-sovits fail silently