---
language: en
tags:
- deepfake-detection
- computer-vision
- image-classification
- xception
- pytorch
- binary-classification
- forensics
license: mit
datasets:
- xhlulu/140k-real-and-fake-faces
---

# Deepfake Detector — Xception CNN

A binary image classifier (real vs fake face) built **from scratch** in PyTorch using the Xception architecture. Trained on 100k images, achieving **99.36% validation accuracy**. This model is the CNN backbone of a larger RAG-powered forensic analysis system that pairs predictions with explanations grounded in peer-reviewed deepfake detection research.

---

## Model Architecture

The full Xception architecture was reimplemented from scratch in PyTorch — no pretrained weights, no transfer learning. The architecture follows Chollet (2017) and consists of three flows:

- **Entry flow**: Two standard convolutions followed by three residual blocks using depthwise separable convolutions with max pooling, progressively downsampling from 299x299 to 19x19 while increasing depth from 3 to 728 channels.
- **Middle flow**: Eight repeated residual blocks at 728 channels with no spatial downsampling — the bulk of the model's representational capacity.
- **Exit flow**: One residual block expanding from 728 to 1024 channels, two additional separable convolutions expanding to 1536 and 2048 channels, global average pooling, and a fully connected classification head.

Depthwise separable convolutions factorize standard convolutions into a depthwise spatial filter per channel followed by a pointwise 1x1 convolution for channel mixing, significantly reducing parameter count while maintaining representational power.

**Total parameters**: ~20M
**Input size**: 299x299x3
**Output**: 2-class softmax — fake (index 0), real (index 1)

![Architecture](architecture.jpg)
---

## Training

| Parameter | Value |
|---|---|
| Dataset | 140k Real and Fake Faces |
| Train set | 100,000 images (50k real, 50k fake) |
| Validation set | 20,000 images (10k real, 10k fake) |
| Test set | 20,000 images (10k real, 10k fake) |
| Epochs | 10 (resumed from epoch 7) |
| Batch size | 32 |
| Optimizer | Adam |
| Loss | CrossEntropyLoss |
| Hardware | Kaggle T4 GPU with mixed precision |
| Normalization | mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] |

### Results

| Split | Loss | Accuracy |
|---|---|---|
| Train | 0.0098 | 99.63% |
| Validation | 0.0172 | 99.36% |

![Training Curves](training_curves.png)

---

## Training Data

[140k Real and Fake Faces](https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces) by xhlulu on Kaggle.

- **Real faces**: 70,000 images from the [Flickr-Faces-HQ (FFHQ)](https://github.com/NVlabs/ffhq-dataset) dataset collected by Nvidia.
- **Fake faces**: 70,000 images sampled from the 1 Million Fake Faces dataset, generated using StyleGAN.
- All images resized to 256x256px, then resized to 299x299 during training preprocessing.

---

## Usage

```python
import torch
from torchvision import transforms
from PIL import Image
from huggingface_hub import hf_hub_download

# The Xception class must be available — copy models/xception.py from the project repo
from models.xception import Xception

# Download weights
weights_path = hf_hub_download(
    repo_id="RamadhanZome/deepfake-xception",
    filename="best_xception.pth"
)

# Load model
model = Xception(num_classes=2)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

# Preprocessing — must match training exactly
transform = transforms.Compose([
    transforms.Resize((299, 299)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# Predict
image = Image.open("face.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    output = model(input_tensor)
    probs = torch.softmax(output, dim=1)
    confidence, idx = torch.max(probs, dim=1)

labels = ["fake", "real"]
print(f"{labels[idx.item()]} ({confidence.item() * 100:.2f}%)")
```

---

## RAG-Powered Explanation System

This model is deployed as part of a larger forensic analysis pipeline combining CNN predictions with Retrieval-Augmented Generation (RAG). When an image is classified, the system:

1. Retrieves the most relevant chunks from a FAISS index built over 10 peer-reviewed deepfake detection papers
2. Constructs a prompt combining the prediction, confidence score, and retrieved research context
3. Sends the prompt to Llama 3.3 70B via Groq to generate a human-readable forensic explanation grounded in the literature

Knowledge base includes: FaceForensics++, Xception (Chollet 2017), RAG (Lewis et al. 2020), FreqNet, Deepfakes and Beyond survey, Deepfake Detection Reliability Survey, and others.

---

## Limitations

- Trained exclusively on StyleGAN-generated fakes — generalization to other generation methods (FaceSwap, diffusion-based) is not guaranteed.
- Performance may degrade on images that have been heavily compressed or resized before inference.
- Designed for still face images — not evaluated on video frames or non-face content.
- May be vulnerable to adversarial attacks as noted in Carlini & Farid (2020).

---

## Citation

```
@inproceedings{chollet2017xception,
  title={Xception: Deep Learning with Depthwise Separable Convolutions},
  author={Chollet, François},
  booktitle={CVPR},
  year={2017}
}

@misc{140kfaces,
  author={xhlulu},
  title={140k Real and Fake Faces},
  year={2020},
  publisher={Kaggle},
  howpublished={https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces}
}
```

---

## Author

**Ramadhan Zome**
GitHub: [RamadhanAdam](https://github.com/RamadhanAdam) | HuggingFace: [RamadhanZome](https://huggingface.co/RamadhanZome)