Singing Voice Conversion (SVC)

Overview

Amphion's Voice Conversion system allows you to transform the voice characteristics of speech while preserving the linguistic content and prosody.

Some demos can be found here.

Features

  • Zero-shot voice conversion
  • Many-to-many voice conversion
  • Real-time conversion
  • Emotion preservation/transfer
  • Cross-lingual support

Quick Start

from amphion import VoiceConverter converter = VoiceConverter() converted = converter.convert( source="input.wav", target_speaker="target_speaker" ) converted.save("output.wav")

Advanced Features

Zero-shot Conversion

# Convert using a reference audio converted = converter.convert( source="source.wav", reference="reference.wav" ) # Convert using multiple references converted = converter.convert( source="source.wav", references=["ref1.wav", "ref2.wav"] )

Style Transfer

# Transfer emotion while converting voice converted = converter.convert( source="source.wav", target_speaker="target_speaker", emotion="happy" ) # Mix multiple styles converted = converter.convert( source="source.wav", target_speaker="target_speaker", styles=["happy", "energetic"], weights=[0.7, 0.3] )

Model Configuration

from amphion.config import Config config = Config( model_name="contentvec", conversion_mode="zero-shot", preserve_content=True, enhance_quality=True ) converter = VoiceConverter(config)

We provide a beginner recipe to demonstrate how to train a cutting edge SVC model. Specifically, it is also an official implementation of the paper "Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion" (2024 IEEE Spoken Language Technology Workshop).

Pipeline Overview

The main idea of SVC is to first disentangle the speaker-agnostic representations from the source audio, and then inject the desired speaker information to synthesize the target, which usually utilizes an acoustic decoder and a subsequent waveform synthesizer (vocoder):



Supported Features

Until now, Amphion SVC has supported the following features and models:

Speaker-agnostic Representations

Speaker Embeddings

  • Speaker Look-Up Table
  • Reference Encoder (👨‍💻 developing): For zero-shot SVC

Acoustic Decoders

Diffusion-based Models

  • DiffWaveNetSVC: Based on Bidirectional Non-Causal Dilated CNN
  • DiffComoSVC (👨‍💻 developing): Based on Consistency Model

Transformer-based Models

  • TransformerSVC: Encoder-only and Non-autoregressive Transformer Architecture

VAE- and Flow-based Models

  • VitsSVC: VITS-like model with content features input

Basic Usage

from amphion import VoiceConverter # Initialize converter converter = VoiceConverter() # Basic conversion converted = converter.convert( source="input.wav", target_speaker="target_speaker" ) converted.save("output.wav")

Best Practices

  1. Audio Quality

    • Use high-quality input audio
    • Maintain consistent audio format
    • Consider post-processing
  2. Performance

    • Batch processing for multiple files
    • GPU acceleration
    • Memory management
  3. Voice Preservation

    • Balance identity transfer
    • Preserve original prosody
    • Maintain naturalness