---
title: Deepfake Audio Detection
emoji: 🎤
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: "5.50.0"
app_file: app/app.py
pinned: false
license: mit
short_description: Detect AI-generated speech using Wav2Vec 2.0
---

# Deepfake Audio Detection — Wav2Vec 2.0 Fine-tuning

Detection of synthetic (deepfake) speech using a fine-tuned Wav2Vec 2.0 model. Trained on ASVspoof 2019 LA, evaluated cross-dataset on ASVspoof 2019 LA eval, ASVspoof 2021 LA, and WaveFake.

## Headline results

| Evaluation | EER | What it measures |
|---|---|---|
| ASVspoof 2019 LA dev (seen attacks A01-A06) | **0.69%** | In-distribution memorization check |
| ASVspoof 2019 LA eval (unseen attacks A07-A19) | **5.55%** | Generalization to new attack types |
| ASVspoof 2021 LA eval (unseen attacks + codec degradation) | **9.09%** | Real-world transmission conditions |
| WaveFake (LJSpeech vocoders, mean) | **29.4%** | Out-of-distribution vocoder synthesis |

### Comparison to published baselines

| System | 2019 LA EER | 2021 LA EER |
|---|---|---|
| Official LFCC-GMM baseline | 8.09% | 25.56% |
| Official CQCC-GMM baseline | 9.57% | 19.30% |
| Official LFCC-LCNN baseline | - | 9.26% |
| Official RawNet2 baseline | - | 9.50% |
| **This work (Wav2Vec 2.0 + fine-tuning)** | **5.55%** | **9.09%** |

Our model outperforms the LFCC-GMM 2019 baseline by 2.54 percentage points and matches the strongest neural baselines on 2021 LA, despite no codec-specific augmentation during training.

## Architecture

Pipeline:

1. Raw waveform input (16 kHz, 4 sec, 64,000 samples)
2. Wav2Vec 2.0 Base backbone (95M params, 12 transformer layers)
   - Stage 1: fully frozen
   - Stage 2: top 2 layers + final LayerNorm unfrozen (~14M trainable)
3. Mean pooling over time dimension
4. Linear classification head (768 -> 2)
5. Softmax -> P(spoof), P(bonafide)

### Two-stage training rationale

- **Stage 1**: train only the linear head on top of frozen pretrained features. Establishes that pretrained Wav2Vec representations already carry strong anti-spoofing signal. Result: 10.09% dev EER with 1,538 trainable params.
- **Stage 2**: unfreeze the top 2 transformer layers, lower learning rate from 1e-3 to 1e-5 (warmup + linear decay), enable mixed precision. Result: 0.69% dev EER, 14.18M trainable params.

## Quickstart

### Inference on a single file

    from src.inference.predict import DeepfakeDetector
    
    detector = DeepfakeDetector(checkpoint_path="path/to/stage2_best.pt")
    result = detector.predict("path/to/audio.wav")
    print(result)
    # {
    #   "spoof_probability": 0.84,
    #   "prediction": "spoof",
    #   "confidence": 0.84,
    #   "utterance_duration_sec": 3.42,
    #   "n_windows": 1,
    #   "threshold_used": 0.5
    # }

The detector handles any audio format readable by torchaudio. Audio is automatically resampled to 16 kHz and segmented into 4-second windows; per-window scores are mean-aggregated.

## Repository structure

    .
    ├── src/
    │   ├── data/
    │   │   ├── protocols.py             # ASVspoof 2019 LA protocol parser
    │   │   ├── protocols_2021.py        # ASVspoof 2021 LA protocol parser
    │   │   ├── preprocessing.py         # audio loading, resampling, windowing
    │   │   └── dataset.py               # PyTorch Dataset
    │   ├── models/
    │   │   └── wav2vec_classifier.py    # Wav2Vec backbone + head
    │   ├── training/
    │   │   └── trainer.py               # training loop with mixed precision + LR scheduler
    │   ├── evaluation/
    │   │   └── metrics.py               # EER, AUC, window-to-utterance aggregation
    │   └── inference/
    │       └── predict.py               # production inference wrapper
    ├── notebooks/
    │   ├── 01_data_acquisition.ipynb    # Phase 2: data exploration + pipeline
    │   ├── 02_train_stage2.ipynb        # Phase 3 + 4: training (Stage 1 + Stage 2)
    │   └── 03_evaluation.ipynb          # Phase 5: cross-dataset evaluation
    ├── results/
    │   ├── metrics/                     # JSON results for each phase
    │   │   ├── stage1_results.json
    │   │   ├── stage2_results.json
    │   │   ├── stage2_eval2019_results.json
    │   │   ├── stage2_eval2021_results.json
    │   │   └── stage2_eval_wavefake_results.json
    │   └── scores/                      # raw per-utterance inference scores (.npz)
    └── docs/
        └── environment.md               # verified runtime environment

## Datasets

This project uses three external datasets, none of which are committed to this repository:

- **ASVspoof 2019 LA** ([paper](https://arxiv.org/abs/1911.01601)) - training and primary eval. Available via Kaggle: `anishsarkar22/asvpoof-2019-dataset-la`.
- **ASVspoof 2021 LA** ([paper](https://arxiv.org/abs/2109.00537)) - secondary eval, real-world codec conditions. Available via Kaggle: `ajaysuryal/asvspoof2021-la` plus key file `simontrann/asvspoof2021-la-key`.
- **WaveFake** ([paper](https://arxiv.org/abs/2111.02813)) - supplementary eval, neural vocoder synthesis. Available via Kaggle: `walimuhammadahmad/fakeaudio` plus LJSpeech `mathurinache/the-lj-speech-dataset`.

## Experiment tracking

All training runs logged to Weights & Biases:
- Project: https://wandb.ai/sara-jaffrani17-dlp/deepfake-audio-detection
- Stage 2 final run: see `stage2-full` (5,320 training steps, 10 epochs)

## Key findings

1. **Pretrained Wav2Vec features carry significant anti-spoofing signal.** A frozen-backbone linear classifier achieves 10.09% dev EER on ASVspoof 2019 LA - competitive with hand-crafted feature baselines.
2. **Two-stage fine-tuning is highly effective.** Unfreezing the top 2 transformer layers (15% of model params) drops dev EER from 10.09% to 0.69% - a 93% relative error reduction.
3. **Generalization profile maps cleanly to distribution shift type:**
   - Unseen attacks (same dataset): +4.86 pp degradation
   - Real-world codec degradation: +3.54 pp additional degradation
   - Novel vocoder pipelines (different domain): +17.24 pp additional degradation
4. **Per-codec analysis identifies model weaknesses.** Aggressive lossy compression (GSM, PSTN) degrades performance ~6 pp vs uncompressed audio. Modern codecs (Opus, G.722) preserve detection signal well.
5. **WaveFake reveals an ASVspoof-specific overfitting pattern.** The model has learned ASVspoof-style attack artifacts but not standalone neural vocoder artifacts. This matches findings in the original WaveFake paper.

## Hardware

Trained on Google Colab Pro:
- Stage 1: T4 GPU, 4h 8m wall-clock
- Stage 2: T4 GPU with mixed precision, 2h 56m wall-clock
- All evaluations: T4 GPU, 35-45 minutes total

## Authors

- Sara Iqbal (23K-0669)
- Areeba Arif (23K-0618)

Course: Deep Learning Project, FAST-NUCES, Spring 2026.

## License

Code: MIT. Datasets retain their original licenses (see individual dataset pages).