Text-to-Speech (TTS)
Overview
Amphion's Text-to-Speech (TTS) system converts text into natural-sounding speech. It supports multiple languages, speakers, and emotions.
Here are some TTS samples from Amphion.
Features
- Multi-speaker synthesis
- Emotion control
- Prosody modeling
- Language support
- Custom voice creation
Quick Start
We provide a beginner recipe to demonstrate how to train a cutting edge TTS model. Specifically, it is Amphion's re-implementation for VALL-E, which is a zero-shot TTS architecture that uses a neural codec language model with discrete codes.
Supported Model Architectures
Until now, Amphion TTS supports the following models or architectures:
- FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
- VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
- VALL-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. This model is our updated VALL-E implementation as of June 2024 which uses Llama as its underlying architecture. The previous version of VALL-E release can be found here
- NaturalSpeech2 (👨💻 developing): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
- Jets: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.
Basic Usage
from amphion import TextToSpeech # Initialize model tts = TextToSpeech() # Basic synthesis audio = tts.synthesize("Hello, world!") audio.save("output.wav")
Advanced Features
Speaker Control
# Multi-speaker synthesis audio = tts.synthesize( text="Hello, world!", speaker="speaker_1" ) # Speaker mixing audio = tts.synthesize( text="Hello, world!", speakers=["speaker_1", "speaker_2"], weights=[0.7, 0.3] )
Emotion Control
# Synthesis with emotion audio = tts.synthesize( text="Hello, world!", emotion="happy" ) # Custom emotion intensity audio = tts.synthesize( text="Hello, world!", emotion="happy", intensity=0.8 )
Prosody Control
# Control speech rate audio = tts.synthesize( text="Hello, world!", speed=1.5 ) # Control pitch audio = tts.synthesize( text="Hello, world!", pitch_shift=2 )
Model Configuration
from amphion.config import Config config = Config( model_name="fastspeech2", vocoder="hifigan", speaker_embedding=True, prosody_modeling=True ) tts = TextToSpeech(config)
Custom Voice Training
from amphion import VoiceTrainer trainer = VoiceTrainer() trainer.train( data_dir="path/to/voice/data", speaker_name="custom_speaker", epochs=1000 )
Best Practices
-
Text Preprocessing
- Use proper punctuation
- Consider text normalization
- Handle special characters
-
Audio Quality
- Use appropriate sampling rate
- Consider post-processing
- Monitor CPU/GPU usage
-
Performance Optimization
- Batch processing for multiple texts
- Use GPU when available
- Cache frequently used voices