--- language: py tags: - vulnerability-detection - code-security - codebert - python - CWE-89 - CWE-78 - CWE-79 - CWE-352 - CWE-94 - CWE-22 - CWE-601 datasets: - vudenc license: cc-by-4.0 --- # PyGuard V4 — Python Vulnerability Detector ## Model Description PyGuard V4 is a fine-tuned Microsoft CodeBERT model for detecting security vulnerabilities in Python code. It improves upon VUDENC (Wartschinski et al. 2022) by replacing Word2Vec+LSTM with CodeBERT. ## Performance vs VUDENC | Metric | VUDENC (LSTM) | PyGuard V2 (CodeBERT) | Improvement | |-----------|--------------|----------------------|-------------| | Precision | 82-96% | 100.00% | +4-18% | | Recall | 78-87% | 100.00% | +13-22% | | F1 Score | 80-90% | 100.00% | +10-20% | | Accuracy | N/A | 100.00% | — | ## Training Dataset - **Source:** VUDENC Dataset by Wartschinski et al. 2022 - **DOI:** 10.5281/zenodo.3559841 - **Paper:** Information and Software Technology Journal, 2022 - **Total samples:** 2,457 (1,228 vulnerable + 1,229 safe) - **Split:** 80% train, 10% val, 10% test ## Vulnerabilities Detected (7 CWEs) - CWE-89: SQL Injection - CWE-78: Command Injection - CWE-79: Cross-Site Scripting (XSS) - CWE-352: Cross-Site Request Forgery (CSRF) - CWE-94: Remote Code Execution - CWE-22: Path Disclosure/Traversal - CWE-601: Open Redirect ## Architecture - Base model: microsoft/codebert-base - Classification head: Linear(768, 2) with Dropout(0.3) - Pooling: Mean pooling on last hidden state - Max sequence length: 256 tokens ## Citation ```bibtex @article{wartschinski2022vudenc, title={VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python}, author={Wartschinski, Laura and Noller, Yannic and Vogel, Thomas and Kehrer, Timo and Grunske, Lars}, journal={Information and Software Technology}, volume={144}, year={2022} } ```