Singing Voice Conversion (SVC)
Overview
Amphion's Voice Conversion system allows you to transform the voice characteristics of speech while preserving the linguistic content and prosody.
Some demos can be found here.
Features
- Zero-shot voice conversion
- Many-to-many voice conversion
- Real-time conversion
- Emotion preservation/transfer
- Cross-lingual support
Quick Start
from amphion import VoiceConverter converter = VoiceConverter() converted = converter.convert( source="input.wav", target_speaker="target_speaker" ) converted.save("output.wav")
Advanced Features
Zero-shot Conversion
# Convert using a reference audio converted = converter.convert( source="source.wav", reference="reference.wav" ) # Convert using multiple references converted = converter.convert( source="source.wav", references=["ref1.wav", "ref2.wav"] )
Style Transfer
# Transfer emotion while converting voice converted = converter.convert( source="source.wav", target_speaker="target_speaker", emotion="happy" ) # Mix multiple styles converted = converter.convert( source="source.wav", target_speaker="target_speaker", styles=["happy", "energetic"], weights=[0.7, 0.3] )
Model Configuration
from amphion.config import Config config = Config( model_name="contentvec", conversion_mode="zero-shot", preserve_content=True, enhance_quality=True ) converter = VoiceConverter(config)
We provide a beginner recipe to demonstrate how to train a cutting edge SVC model. Specifically, it is also an official implementation of the paper "Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion" (2024 IEEE Spoken Language Technology Workshop).
Pipeline Overview
The main idea of SVC is to first disentangle the speaker-agnostic representations from the source audio, and then inject the desired speaker information to synthesize the target, which usually utilizes an acoustic decoder and a subsequent waveform synthesizer (vocoder):
Supported Features
Until now, Amphion SVC has supported the following features and models:
Speaker-agnostic Representations
- Content Features: Sourcing from WeNet, Whisper, and ContentVec.
- Prosody Features: F0 and energy.
Speaker Embeddings
- Speaker Look-Up Table
- Reference Encoder (👨💻 developing): For zero-shot SVC
Acoustic Decoders
Diffusion-based Models
- DiffWaveNetSVC: Based on Bidirectional Non-Causal Dilated CNN
- DiffComoSVC (👨💻 developing): Based on Consistency Model
Transformer-based Models
- TransformerSVC: Encoder-only and Non-autoregressive Transformer Architecture
VAE- and Flow-based Models
- VitsSVC: VITS-like model with content features input
Basic Usage
from amphion import VoiceConverter # Initialize converter converter = VoiceConverter() # Basic conversion converted = converter.convert( source="input.wav", target_speaker="target_speaker" ) converted.save("output.wav")
Best Practices
-
Audio Quality
- Use high-quality input audio
- Maintain consistent audio format
- Consider post-processing
-
Performance
- Batch processing for multiple files
- GPU acceleration
- Memory management
-
Voice Preservation
- Balance identity transfer
- Preserve original prosody
- Maintain naturalness