"""
return (error_badge, None, None, None)
elapsed_ms = (time.time() - start) * 1000
pred = result["prediction"]
confidence = result["confidence"] * 100
spoof_pct = result["spoof_probability"] * 100
bona_pct = result["bonafide_probability"] * 100
# Plain-language hint about difficulty based on confidence
if confidence >= 97:
difficulty_hint = "clear case"
elif confidence >= 80:
difficulty_hint = "moderately confident"
elif confidence >= 65:
difficulty_hint = "borderline"
else:
difficulty_hint = "uncertain — interpret with caution"
if pred == "spoof":
badge_class = "result-card-spoof"
icon = "⚠"
verdict = "Likely synthetic"
verdict_sub = "This audio shows characteristics of AI-generated speech."
else:
badge_class = "result-card-bonafide"
icon = "✓"
verdict = "Likely authentic"
verdict_sub = "This audio shows characteristics of natural human speech."
badge = f"""
{icon}
{verdict}
{verdict_sub}
Confidence{confidence:.1f}%
What does this number mean?
Confidence is how much probability the model puts behind its
prediction. If it says "Likely synthetic" at 66%, it means the model
sees a 66% chance this audio is synthetic and a 34% chance it's authentic.
That IS the answer — the prediction label is just the side with more probability.
High confidence does not always mean the model is right.
On the example clips below, the model is 100% confident on the easy ones and
less confident on the harder ones — that's expected. But it can also be 100%
confident and wrong, especially on attack types it struggles with
(like A10, the hardest example). When a deepfake is made by a method the
model hasn't learned to detect, it may see no spoofing signal at all and
confidently call it authentic.
Bottom line: treat any single prediction as one piece of
evidence, not a definitive answer. High confidence means the model sees
strong signal — but it can't detect what it hasn't been trained to detect.
Try the examples in order (easy → hardest) to see how confidence varies.
Synthetic
{spoof_pct:.1f}%
Authentic
{bona_pct:.1f}%
{result['utterance_duration_sec']:.2f}s audio·{result['n_windows']} windows·{elapsed_ms:.0f}ms on CPU·{difficulty_hint}
Modern AI can clone any voice from just a few seconds of audio.
This detector uses Wav2Vec 2.0 to tell synthetic speech apart from authentic recordings —
with 0.69% Equal Error Rate on the ASVspoof 2019 LA benchmark.
""")
# Why this matters section
gr.HTML("""
Why this matters
Voice deepfakes are already in the wild
""")
with gr.Row():
with gr.Column():
gr.HTML("""
📞
Phone scams
Voice clones are increasingly used to impersonate family members in
"emergency call" scams. Reported cases have surged since 2022, with losses
running into millions of dollars annually.
""")
with gr.Column():
gr.HTML("""
📰
Misinformation
Fabricated political speeches, fake celebrity endorsements, and false
statements attributed to public figures have circulated widely on social
media platforms in recent election cycles.
""")
with gr.Column():
gr.HTML("""
⚖️
Trust in evidence
Courts now have to grapple with whether audio recordings are authentic.
The same challenge applies to investigative journalism and historical
archive verification.
""")
# CTA section
gr.HTML("""
Try the detector
Upload your own audio, record from your microphone, or pick an example to start.
Upload audio, record yourself, or pick an example. The detector returns a calibrated
prediction with confidence, plus per-window analysis showing how evidence accumulates over time.
""")
with gr.Row(equal_height=False):
with gr.Column(scale=1):
gr.HTML("
1 Provide audio
")
with gr.Tabs(elem_classes=["input-tabs"]):
with gr.Tab("Upload file"):
audio_upload = gr.Audio(
sources=["upload"],
type="filepath",
label="",
elem_classes=["audio-input-styled"],
)
with gr.Tab("Record mic"):
gr.HTML("""
🎤
Click the record button below, speak for 3 to 10 seconds, then click stop.
A live waveform will show your audio being captured.
Try all 5 examples in order — they go from easy to hardest.
You'll see the model handle easy cases confidently, become uncertain on medium
ones, and get the hardest one (A10) completely wrong. Why?
A10 uses Tacotron 2 + WaveRNN — a system so advanced that even human listeners
can't tell its output from real speech. The acoustic features literally overlap
with authentic speech, leaving our model (and any acoustic-feature-based
detector) with no signal to detect. We included this example so you can see
where the limits are, not just where it succeeds.
""")
gr.Examples(
examples=EXAMPLE_FILES,
inputs=audio_upload,
label="",
)
with gr.Column(scale=1):
gr.HTML("
Three datasets, two regimes (in-domain and out-of-domain), and full transparency about
where the model wins and where it struggles. Results are reported as Equal Error Rate (EER) —
lower is better.
""")
# Headline metric cards
gr.HTML("""
Headline results
Three benchmarks at a glance
""")
with gr.Row():
gr.HTML("""
In-domain
5.55%
ASVspoof 2019 LA
Unseen attacks A07–A19
""")
gr.HTML("""
Cross-dataset
9.09%
ASVspoof 2021 LA
Codec-degraded audio
""")
gr.HTML("""
Out-of-domain
26.33%
WaveFake
Novel vocoder pipelines
""")
# Baseline comparison
gr.HTML("""
Benchmark comparison
How we compare to published baselines
""")
gr.HTML("""
System
2019 LA EER
2021 LA EER
Official LFCC-GMM baseline
8.09%
25.56%
Official CQCC-GMM baseline
9.57%
19.30%
Official LFCC-LCNN baseline
—
9.26%
Official RawNet2 baseline
—
9.50%
This work (Wav2Vec 2.0)
5.55%
9.09%
Outperforms LFCC-GMM on 2019 LA by 2.54 pp and matches the strongest neural baselines
(LFCC-LCNN, RawNet2) on 2021 LA — without any codec-specific training augmentation.
""")
# Per-codec analysis
gr.HTML("""
Codec robustness
Performance by audio codec (ASVspoof 2021 LA)
Real-world speech goes through codecs for transmission. The model handles modern codecs
well but struggles with aggressive cellular compression.
""")
with gr.Row():
with gr.Column(elem_classes=["chart-wrap"]):
gr.Plot(value=make_per_codec_plot(), label=None)
# Per-attack analysis
gr.HTML("""
Attack-family robustness
Performance by attack type (ASVspoof 2019 LA eval)
13 different synthesis methods (A07–A19), all unseen during training. A10 is the
model's persistent weakness across both 2019 and 2021 evaluations.
""")
with gr.Row():
with gr.Column(elem_classes=["chart-wrap"]):
gr.Plot(value=make_per_attack_plot(), label=None)
# WaveFake story
gr.HTML("""
Out-of-domain limits
The WaveFake story — an honest negative result
On WaveFake the model performs significantly worse, particularly on LJSpeech-based
vocoders (22–34% EER). WaveFake tests pure neural vocoder synthesis, while the model
was trained on ASVspoof's mix of TTS and voice-conversion attacks.
Interpretation: the model has learned ASVspoof-specific synthesis
artifacts, not universal vocoder detection. JSUT (Japanese) numbers look artificially
good because the bonafide examples are English LJSpeech — the model is partly detecting
language and domain, not spoofing artifacts. The LJSpeech-based numbers are the
methodologically meaningful results.
""")
with gr.Row():
with gr.Column(elem_classes=["chart-wrap"]):
gr.Plot(value=make_wavefake_plot(), label=None)
# ============================================================
# Is the model overfit? (honest analysis)
gr.HTML("""
Plain-language analysis
So — is our model overfit?
A fair question to ask of any deep learning model. We'll explain what overfitting is,
walk through what our numbers show, and give you a straight answer.
Part 1
What is overfitting?
Overfitting is when a model memorises specific examples instead of
learning general patterns. Sometimes called "rote learning" — the model gets very good at
recognising things that look like its training data, but anything that looks even slightly
different feels wrong to it and it gets confused.
A good model learns the underlying signal. A deepfake detector should learn what makes a synthetic
voice sound synthetic — patterns that show up across many different fake-voice methods, not
just the specific ones it studied. If it only recognises fake voices that look exactly like the
ones it trained on, it has overfit.
The way you spot overfitting is to test the model on examples it has never seen — and ideally on
examples that are quite different from what it trained on. If performance drops gracefully,
the model is generalising. If it falls off a cliff, the model has overfit.
Part 2
Where does our model actually land?
We tested the detector on four progressively harder challenges. Each step further from what it
trained on tells us how well it generalises.
Two things to notice. First — the model degrades gradually, not catastrophically. It doesn't go
from 0.69% to 50% (which would mean random guessing on anything new). That tells us it
did learn real patterns, not just memorise specific clips.
Second — there's still a big gap. Going from 0.69% on familiar territory to 26.33%
on brand new fake-voice technology is a 38× jump. That's not catastrophic, but it's also not
great. The model clearly learned features that matter for the kinds of fake voices it studied —
and those features don't fully transfer to fake voices made by methods it has never seen.
Part 3
The honest verdict
The honest answer: it's a mix. The model learned real patterns
and generalises to most unseen attacks — but it has a genuine blind spot, and
its confidence can be dangerously high even when it's wrong.
What works well: When tested on 13 fake-voice methods it had never
seen during training, it achieved a 5.55% error rate — roughly 94 out of 100 predictions
correct on completely new fakes. It becomes appropriately uncertain on medium-difficulty
attacks (66% confidence on A07). And it handles noisy, real-world audio without
false-alarming (93.7% confidence on a noisy real voice). These are signs of a model
that learned real anti-spoofing patterns, not just memorised its training data.
What doesn't work: Two real problems. First, the model has a
complete blind spot for A10 attacks — it classifies the hardest
spoof example as "100% authentic," completely wrong. But there's a specific reason:
A10 is a Tacotron 2 + WaveRNN system whose output is so natural that even human
listeners cannot distinguish it from real speech. The ASVspoof 2019 paper
itself confirms that A10's acoustic features literally overlap with authentic speech
in feature space. Since our model relies on acoustic representations (Wav2Vec 2.0
features), it faces the same fundamental limit human ears do — there's no acoustic
signal to detect.
Second, on the WaveFake dataset (modern neural vocoders like MelGAN
and HiFi-GAN — the same technology used in real-world voice cloning today), the error
rate jumps to 26.33%. These vocoders produce different artifacts from what the model
trained on. Since our project's goal is detecting AI voice cloning broadly, this is
a real coverage gap.
What this means: The model is not classically "overfit" in the sense of
having memorised its training data — the 5.55% result on unseen attacks proves that. But
it does have limited coverage: it learned to detect certain types of
synthesis artifacts (the ones present in ASVspoof) and is blind to others (A10, neural
vocoders). For the project's stated goal of detecting AI voice cloning broadly, this is
a meaningful gap.
What our project actually demonstrates
1. Wav2Vec 2.0 features work for deepfake detection. Pretrained speech
representations carry strong anti-spoofing signal. With minimal fine-tuning (15% of the
model), we match or beat published neural baselines on the standard ASVspoof benchmarks.
This validates the transfer-learning approach.
2. Single-corpus training has real limits — and we measured exactly where.
The A10 blind spot reveals a fundamental challenge: when a synthesis system produces
speech that is acoustically indistinguishable from real speech (even to humans),
acoustic-feature-based detection reaches its theoretical limit. The WaveFake results
show that cross-family generalization requires cross-family training data. Both findings
are concrete, measured, and reproducible.
3. The path forward is clear. Universal AI voice cloning detection
requires multi-corpus, multi-family training — combining ASVspoof, WaveFake, and newer
datasets covering the latest synthesis methods. This project establishes the baseline
that such future work would build on, with measured evidence showing exactly where the
current approach succeeds and where it falls short.
We chose to include the failures (A10, WaveFake) rather than hide them because honest
evaluation is more valuable than inflated numbers. A detector that reports 5.55% EER
with known blind spots is more useful than one that reports 5.55% EER and pretends it
works on everything.
"Treat this as a research demonstration of how Wav2Vec features behave for deepfake detection,
not a security tool. If you need to verify whether a real-world recording is a deepfake, no
single model — including this one — should be trusted as the final answer."
""")
# ============================================================
# ============================================================
# TAB 4: TECHNICAL
# ============================================================
with gr.Tab("Under the hood", id=3):
gr.Markdown("## Architecture")
gr.HTML("""
""")
gr.Markdown("## Two-stage training rationale")
gr.HTML("""
""")
with gr.Row():
gr.HTML("""
Stage 1: frozen backbone, head only
Train only the linear classification head, keeping all 95M Wav2Vec parameters frozen.
This proves that pretrained Wav2Vec representations already carry strong anti-spoofing signal.
Result:10.09% dev EER
with just 1,538 trainable parameters.
""")
gr.HTML("""
Stage 2: top 2 layers unfrozen
Unfreeze top 2 transformer layers + final LayerNorm. Lower LR from 1e-3 to 1e-5
with 10% warmup + linear decay. Enable mixed precision (fp16) for speed.
Result:0.69% dev EER
a 93% relative error reduction with 14.18M trainable params (15% of model).
""")
gr.Markdown("## Key design decisions")
gr.Markdown("""
- **Class-weighted cross-entropy** to handle 9:1 spoof:bonafide imbalance (bonafide=4.92, spoof=0.56)
- **4-second windowing with 50% overlap** to handle clips of varying length
- **Mean aggregation** over per-window scores produces final utterance prediction
- **Mixed precision training** reduced wall-clock time from ~6h to 2h 56m on T4
""")
gr.Markdown("## Limitations (honest disclosure)")
gr.HTML("""
WaveFake out-of-domain generalization is poor (~29% EER on LJSpeech vocoders).
The model learned ASVspoof-specific synthesis artifacts, not universal vocoder detection.
Future work: train on a mixed corpus including pure vocoder samples.
Codec sensitivity: GSM and PSTN telephone codecs degrade EER by ~6 percentage points.
Codec augmentation during training would likely close this gap.
A10 attack family is consistently challenging (15.54% EER on this attack alone).
This is a stable model weakness across both 2019 and 2021 evaluations.
Not a production deepfake detector. Real-world deepfakes use synthesis methods this
model has never seen. Use this as a research demonstration, not for security-critical decisions.
""")
gr.Markdown("## Source and citations")
gr.Markdown("""
**Source code, training notebooks, full evaluation results:**
[github.com/Saracasm/deepfake-audio-detection](https://github.com/Saracasm/deepfake-audio-detection)
**Model weights and card:**
[huggingface.co/Sara1708/deepfake-audio-wav2vec2](https://huggingface.co/Sara1708/deepfake-audio-wav2vec2)
### Datasets used
- ASVspoof 2019 LA — Wang et al., 2020
- ASVspoof 2021 LA — Yamagishi et al., 2021
- WaveFake — Frank & Schonherr, 2021
### Backbone model
- Wav2Vec 2.0 Base — Baevski et al., 2020 (Facebook AI Research)
""")
# Wire up the CTA button to switch to the Detector tab
cta_btn.click(fn=lambda: gr.Tabs(selected=1), outputs=tabs)
if __name__ == "__main__":
demo.launch()