Singing Voice Conversion (SVC)

Overview

Amphion's Voice Conversion system allows you to transform the voice characteristics of speech while preserving the linguistic content and prosody.

Some demos can be found here.

Features

Zero-shot voice conversion
Many-to-many voice conversion
Real-time conversion
Emotion preservation/transfer
Cross-lingual support

Quick Start

from amphion import VoiceConverter

converter = VoiceConverter()
converted = converter.convert(
    source="input.wav",
    target_speaker="target_speaker"
)
converted.save("output.wav")

Advanced Features

Zero-shot Conversion

# Convert using a reference audio
converted = converter.convert(
    source="source.wav",
    reference="reference.wav"
)

# Convert using multiple references
converted = converter.convert(
    source="source.wav",
    references=["ref1.wav", "ref2.wav"]
)

Style Transfer

# Transfer emotion while converting voice
converted = converter.convert(
    source="source.wav",
    target_speaker="target_speaker",
    emotion="happy"
)

# Mix multiple styles
converted = converter.convert(
    source="source.wav",
    target_speaker="target_speaker",
    styles=["happy", "energetic"],
    weights=[0.7, 0.3]
)

Model Configuration

from amphion.config import Config

config = Config(
    model_name="contentvec",
    conversion_mode="zero-shot",
    preserve_content=True,
    enhance_quality=True
)

converter = VoiceConverter(config)

We provide a beginner recipe to demonstrate how to train a cutting edge SVC model. Specifically, it is also an official implementation of the paper "Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion" (2024 IEEE Spoken Language Technology Workshop).

Pipeline Overview

The main idea of SVC is to first disentangle the speaker-agnostic representations from the source audio, and then inject the desired speaker information to synthesize the target, which usually utilizes an acoustic decoder and a subsequent waveform synthesizer (vocoder):

Supported Features

Until now, Amphion SVC has supported the following features and models:

Speaker-agnostic Representations

Content Features: Sourcing from WeNet, Whisper, and ContentVec.
Prosody Features: F0 and energy.

Speaker Embeddings

Speaker Look-Up Table
Reference Encoder (👨‍💻 developing): For zero-shot SVC

Acoustic Decoders

Diffusion-based Models

DiffWaveNetSVC: Based on Bidirectional Non-Causal Dilated CNN
DiffComoSVC (👨‍💻 developing): Based on Consistency Model

Transformer-based Models

TransformerSVC: Encoder-only and Non-autoregressive Transformer Architecture

VAE- and Flow-based Models

VitsSVC: VITS-like model with content features input

Basic Usage

from amphion import VoiceConverter

# Initialize converter
converter = VoiceConverter()

# Basic conversion
converted = converter.convert(
    source="input.wav",
    target_speaker="target_speaker"
)
converted.save("output.wav")

Best Practices

Audio Quality
- Use high-quality input audio
- Maintain consistent audio format
- Consider post-processing
Performance
- Batch processing for multiple files
- GPU acceleration
- Memory management
Voice Preservation
- Balance identity transfer
- Preserve original prosody
- Maintain naturalness