Datasets

Overview

Amphion unifies the data preprocessing for various open-source datasets and provides tools for managing and preparing training data.

Supported Datasets

Speech Datasets

  • LibriTTS

    • Multi-speaker English speech dataset
    • 585 hours of speech data
    • 2,456 speakers
  • LJSpeech

    • Single speaker English speech dataset
    • 24 hours of speech data
    • Professional female voice
  • VCTK

    • Multi-speaker English speech dataset
    • 44 hours of speech data
    • 109 speakers

Singing Datasets

  • M4Singer

    • Mandarin singing voice dataset
    • Professional singers
    • Multiple singing styles
  • Opencpop

    • Chinese popular music dataset
    • Professional recordings
    • Phoneme-level alignments
  • OpenSinger

    • Open-source singing voice dataset
    • Multiple languages
    • Various singing styles

Audio Datasets

  • AudioCaps
    • Audio captioning dataset
    • 46K audio clips
    • Natural language descriptions

Emilia Dataset

The Emilia dataset is a large-scale multilingual speech dataset specifically designed for speech generation:

  • 101K hours of speech data
  • Multiple languages
  • In-the-wild recordings
  • High-quality annotations

Accessing Emilia

from amphion.data import EmiliaDataset dataset = EmiliaDataset( root="path/to/emilia", split="train" )

Data Preprocessing

from amphion.data import preprocess_dataset # Preprocess a supported dataset preprocess_dataset( dataset="ljspeech", input_dir="path/to/raw", output_dir="path/to/processed" )

Custom Dataset

from amphion.data import AudioDataset class CustomDataset(AudioDataset): def __init__(self, root, split): super().__init__(root, split) # Custom initialization def __getitem__(self, index): # Custom data loading logic return item