Evaluation Metrics

Overview

Amphion provides comprehensive objective evaluation metrics for generated audio. These metrics help assess the quality, intelligibility, and similarity of synthesized audio.

Available Metrics

F0 Modeling

  • F0 Pearson Coefficients (FPC)
    • Measures correlation between predicted and ground truth F0
  • F0 Periodicity RMSE
    • Root mean square error of F0 periodicity
  • F0 RMSE
    • Root mean square error of fundamental frequency
  • Voiced/Unvoiced F1 Score
    • F1 score for voiced/unvoiced frame classification

Energy Modeling

  • Energy RMSE
    • Root mean square error of energy contours
  • Energy Pearson Coefficients
    • Correlation between predicted and ground truth energy

Intelligibility

  • Character Error Rate (CER)
    • Using Whisper ASR for transcription evaluation
  • Word Error Rate (WER)
    • Word-level accuracy assessment

Spectrogram Distortion

  • Frechet Audio Distance (FAD)
    • Measures quality and diversity of generated audio
  • Mel Cepstral Distortion (MCD)
    • Evaluates spectral envelope differences
  • Multi-Resolution STFT Distance
    • Time-frequency domain similarity measure
  • PESQ
    • Perceptual evaluation of speech quality
  • STOI
    • Short-time objective intelligibility measure

Speaker Similarity

Cosine similarity using various models:

  • RawNet3
  • Resemblyzer
  • WeSpeaker
  • WavLM

Usage Example

from amphion.metrics import evaluate_audio metrics = evaluate_audio( reference="path/to/reference.wav", generated="path/to/generated.wav", metrics=["mcd", "pesq", "fad"], sample_rate=44100 ) print(metrics)

Batch Evaluation

from amphion.metrics import batch_evaluate results = batch_evaluate( reference_dir="path/to/references", generated_dir="path/to/generated", metrics=["similarity", "cer", "wer"], num_workers=4 )