Evaluation Metrics
Overview
Amphion provides comprehensive objective evaluation metrics for generated audio. These metrics help assess the quality, intelligibility, and similarity of synthesized audio.
Available Metrics
F0 Modeling
- F0 Pearson Coefficients (FPC)
- Measures correlation between predicted and ground truth F0
- F0 Periodicity RMSE
- Root mean square error of F0 periodicity
- F0 RMSE
- Root mean square error of fundamental frequency
- Voiced/Unvoiced F1 Score
- F1 score for voiced/unvoiced frame classification
Energy Modeling
- Energy RMSE
- Root mean square error of energy contours
- Energy Pearson Coefficients
- Correlation between predicted and ground truth energy
Intelligibility
- Character Error Rate (CER)
- Using Whisper ASR for transcription evaluation
- Word Error Rate (WER)
- Word-level accuracy assessment
Spectrogram Distortion
- Frechet Audio Distance (FAD)
- Measures quality and diversity of generated audio
- Mel Cepstral Distortion (MCD)
- Evaluates spectral envelope differences
- Multi-Resolution STFT Distance
- Time-frequency domain similarity measure
- PESQ
- Perceptual evaluation of speech quality
- STOI
- Short-time objective intelligibility measure
Speaker Similarity
Cosine similarity using various models:
- RawNet3
- Resemblyzer
- WeSpeaker
- WavLM
Usage Example
from amphion.metrics import evaluate_audio metrics = evaluate_audio( reference="path/to/reference.wav", generated="path/to/generated.wav", metrics=["mcd", "pesq", "fad"], sample_rate=44100 ) print(metrics)
Batch Evaluation
from amphion.metrics import batch_evaluate results = batch_evaluate( reference_dir="path/to/references", generated_dir="path/to/generated", metrics=["similarity", "cer", "wer"], num_workers=4 )