Text-to-Speech (TTS)

Overview

Amphion's Text-to-Speech (TTS) system converts text into natural-sounding speech. It supports multiple languages, speakers, and emotions.

Here are some TTS samples from Amphion.

Features

  • Multi-speaker synthesis
  • Emotion control
  • Prosody modeling
  • Language support
  • Custom voice creation

Quick Start

We provide a beginner recipe to demonstrate how to train a cutting edge TTS model. Specifically, it is Amphion's re-implementation for VALL-E, which is a zero-shot TTS architecture that uses a neural codec language model with discrete codes.

Supported Model Architectures

Until now, Amphion TTS supports the following models or architectures:

  • FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
  • VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
  • VALL-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. This model is our updated VALL-E implementation as of June 2024 which uses Llama as its underlying architecture. The previous version of VALL-E release can be found here
  • NaturalSpeech2 (👨‍💻 developing): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
  • Jets: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.

Basic Usage

from amphion import TextToSpeech # Initialize model tts = TextToSpeech() # Basic synthesis audio = tts.synthesize("Hello, world!") audio.save("output.wav")

Advanced Features

Speaker Control

# Multi-speaker synthesis audio = tts.synthesize( text="Hello, world!", speaker="speaker_1" ) # Speaker mixing audio = tts.synthesize( text="Hello, world!", speakers=["speaker_1", "speaker_2"], weights=[0.7, 0.3] )

Emotion Control

# Synthesis with emotion audio = tts.synthesize( text="Hello, world!", emotion="happy" ) # Custom emotion intensity audio = tts.synthesize( text="Hello, world!", emotion="happy", intensity=0.8 )

Prosody Control

# Control speech rate audio = tts.synthesize( text="Hello, world!", speed=1.5 ) # Control pitch audio = tts.synthesize( text="Hello, world!", pitch_shift=2 )

Model Configuration

from amphion.config import Config config = Config( model_name="fastspeech2", vocoder="hifigan", speaker_embedding=True, prosody_modeling=True ) tts = TextToSpeech(config)

Custom Voice Training

from amphion import VoiceTrainer trainer = VoiceTrainer() trainer.train( data_dir="path/to/voice/data", speaker_name="custom_speaker", epochs=1000 )

Best Practices

  1. Text Preprocessing

    • Use proper punctuation
    • Consider text normalization
    • Handle special characters
  2. Audio Quality

    • Use appropriate sampling rate
    • Consider post-processing
    • Monitor CPU/GPU usage
  3. Performance Optimization

    • Batch processing for multiple texts
    • Use GPU when available
    • Cache frequently used voices