--- language: en tags: - deepfake-detection - computer-vision - image-classification - xception - pytorch - binary-classification - forensics license: mit datasets: - xhlulu/140k-real-and-fake-faces --- # Deepfake Detector — Xception CNN A binary image classifier (real vs fake face) built **from scratch** in PyTorch using the Xception architecture. Trained on 100k images, achieving **99.36% validation accuracy**. This model is the CNN backbone of a larger RAG-powered forensic analysis system that pairs predictions with explanations grounded in peer-reviewed deepfake detection research. --- ## Model Architecture The full Xception architecture was reimplemented from scratch in PyTorch — no pretrained weights, no transfer learning. The architecture follows Chollet (2017) and consists of three flows: - **Entry flow**: Two standard convolutions followed by three residual blocks using depthwise separable convolutions with max pooling, progressively downsampling from 299x299 to 19x19 while increasing depth from 3 to 728 channels. - **Middle flow**: Eight repeated residual blocks at 728 channels with no spatial downsampling — the bulk of the model's representational capacity. - **Exit flow**: One residual block expanding from 728 to 1024 channels, two additional separable convolutions expanding to 1536 and 2048 channels, global average pooling, and a fully connected classification head. Depthwise separable convolutions factorize standard convolutions into a depthwise spatial filter per channel followed by a pointwise 1x1 convolution for channel mixing, significantly reducing parameter count while maintaining representational power. **Total parameters**: ~20M **Input size**: 299x299x3 **Output**: 2-class softmax — fake (index 0), real (index 1) ![Architecture](architecture.jpg) --- ## Training | Parameter | Value | |---|---| | Dataset | 140k Real and Fake Faces | | Train set | 100,000 images (50k real, 50k fake) | | Validation set | 20,000 images (10k real, 10k fake) | | Test set | 20,000 images (10k real, 10k fake) | | Epochs | 10 (resumed from epoch 7) | | Batch size | 32 | | Optimizer | Adam | | Loss | CrossEntropyLoss | | Hardware | Kaggle T4 GPU with mixed precision | | Normalization | mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] | ### Results | Split | Loss | Accuracy | |---|---|---| | Train | 0.0098 | 99.63% | | Validation | 0.0172 | 99.36% | ![Training Curves](training_curves.png) --- ## Training Data [140k Real and Fake Faces](https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces) by xhlulu on Kaggle. - **Real faces**: 70,000 images from the [Flickr-Faces-HQ (FFHQ)](https://github.com/NVlabs/ffhq-dataset) dataset collected by Nvidia. - **Fake faces**: 70,000 images sampled from the 1 Million Fake Faces dataset, generated using StyleGAN. - All images resized to 256x256px, then resized to 299x299 during training preprocessing. --- ## Usage ```python import torch from torchvision import transforms from PIL import Image from huggingface_hub import hf_hub_download # The Xception class must be available — copy models/xception.py from the project repo from models.xception import Xception # Download weights weights_path = hf_hub_download( repo_id="RamadhanZome/deepfake-xception", filename="best_xception.pth" ) # Load model model = Xception(num_classes=2) model.load_state_dict(torch.load(weights_path, map_location="cpu")) model.eval() # Preprocessing — must match training exactly transform = transforms.Compose([ transforms.Resize((299, 299)), transforms.ToTensor(), transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) ]) # Predict image = Image.open("face.jpg").convert("RGB") input_tensor = transform(image).unsqueeze(0) with torch.no_grad(): output = model(input_tensor) probs = torch.softmax(output, dim=1) confidence, idx = torch.max(probs, dim=1) labels = ["fake", "real"] print(f"{labels[idx.item()]} ({confidence.item() * 100:.2f}%)") ``` --- ## RAG-Powered Explanation System This model is deployed as part of a larger forensic analysis pipeline combining CNN predictions with Retrieval-Augmented Generation (RAG). When an image is classified, the system: 1. Retrieves the most relevant chunks from a FAISS index built over 10 peer-reviewed deepfake detection papers 2. Constructs a prompt combining the prediction, confidence score, and retrieved research context 3. Sends the prompt to Llama 3.3 70B via Groq to generate a human-readable forensic explanation grounded in the literature Knowledge base includes: FaceForensics++, Xception (Chollet 2017), RAG (Lewis et al. 2020), FreqNet, Deepfakes and Beyond survey, Deepfake Detection Reliability Survey, and others. --- ## Limitations - Trained exclusively on StyleGAN-generated fakes — generalization to other generation methods (FaceSwap, diffusion-based) is not guaranteed. - Performance may degrade on images that have been heavily compressed or resized before inference. - Designed for still face images — not evaluated on video frames or non-face content. - May be vulnerable to adversarial attacks as noted in Carlini & Farid (2020). --- ## Citation ``` @inproceedings{chollet2017xception, title={Xception: Deep Learning with Depthwise Separable Convolutions}, author={Chollet, François}, booktitle={CVPR}, year={2017} } @misc{140kfaces, author={xhlulu}, title={140k Real and Fake Faces}, year={2020}, publisher={Kaggle}, howpublished={https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces} } ``` --- ## Author **Ramadhan Zome** GitHub: [RamadhanAdam](https://github.com/RamadhanAdam) | HuggingFace: [RamadhanZome](https://huggingface.co/RamadhanZome)