Text-to-Speech (TTS)

Overview

Amphion's Text-to-Speech (TTS) system converts text into natural-sounding speech. It supports multiple languages, speakers, and emotions.

Here are some TTS samples from Amphion.

Features

Multi-speaker synthesis
Emotion control
Prosody modeling
Language support
Custom voice creation

Quick Start

We provide a beginner recipe to demonstrate how to train a cutting edge TTS model. Specifically, it is Amphion's re-implementation for VALL-E, which is a zero-shot TTS architecture that uses a neural codec language model with discrete codes.

Supported Model Architectures

Until now, Amphion TTS supports the following models or architectures:

FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
VALL-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. This model is our updated VALL-E implementation as of June 2024 which uses Llama as its underlying architecture. The previous version of VALL-E release can be found here
NaturalSpeech2 (👨‍💻 developing): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
Jets: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.

Basic Usage

from amphion import TextToSpeech

# Initialize model
tts = TextToSpeech()

# Basic synthesis
audio = tts.synthesize("Hello, world!")
audio.save("output.wav")

Advanced Features

Speaker Control

# Multi-speaker synthesis
audio = tts.synthesize(
    text="Hello, world!",
    speaker="speaker_1"
)

# Speaker mixing
audio = tts.synthesize(
    text="Hello, world!",
    speakers=["speaker_1", "speaker_2"],
    weights=[0.7, 0.3]
)

Emotion Control

# Synthesis with emotion
audio = tts.synthesize(
    text="Hello, world!",
    emotion="happy"
)

# Custom emotion intensity
audio = tts.synthesize(
    text="Hello, world!",
    emotion="happy",
    intensity=0.8
)

Prosody Control

# Control speech rate
audio = tts.synthesize(
    text="Hello, world!",
    speed=1.5
)

# Control pitch
audio = tts.synthesize(
    text="Hello, world!",
    pitch_shift=2
)

Model Configuration

from amphion.config import Config

config = Config(
    model_name="fastspeech2",
    vocoder="hifigan",
    speaker_embedding=True,
    prosody_modeling=True
)

tts = TextToSpeech(config)

Custom Voice Training

from amphion import VoiceTrainer

trainer = VoiceTrainer()
trainer.train(
    data_dir="path/to/voice/data",
    speaker_name="custom_speaker",
    epochs=1000
)

Best Practices

Text Preprocessing
- Use proper punctuation
- Consider text normalization
- Handle special characters
Audio Quality
- Use appropriate sampling rate
- Consider post-processing
- Monitor CPU/GPU usage
Performance Optimization
- Batch processing for multiple texts
- Use GPU when available
- Cache frequently used voices