AI Authorship Signals Are Review Prompts, Not Verdicts
Tags: ai-detection, code-review, software-engineering, writing-analysis, security I built a small Hugging Face dataset after running into the same problem from two sides. AI-assisted code can look clean before it has been tested. AI-written English can look polished before it says anything specific. The dataset is here: https://huggingface.co/datasets/yava-code/ai-authorship-signals-2026 It is intentionally small: 10 rows, JSONL, source-backed. Each row has a signal, why it matters, a risk level, and a review action. Why I avoided building a detector I did not want another binary "AI or human" tool. That framing breaks down quickly. A developer can edit generated code. A non-native English writer can be falsely flagged by AI-writing detectors. A human can write generic docs. A model can produce useful code that still has a security bug. For hiring, code review, and public writing, the useful question is more practical: What should I inspect next? What tends to expose AI-assisted code The strongest signals are not magic words. The useful review signals are closer to normal engineering hygiene: new dependencies that were not checked against the official registry code that handles the happy path but misses permission, validation, or boundary cases decorative comments that restate obvious code generic structure that does not match the surrounding codebase snippets that pass syntax checks but carry security weaknesses The dataset includes OpenSSF guidance on hallucinated dependencies and supply-chain risk. It also references empirical work on security weaknesses in AI-generated code snippets and code-stylometry work that studies function-level and class-level generated code. The review action is simple: verify dependencies, run static analysis, add failing edge-case tests, and remove comments that do not explain real tradeoffs. What tends to expose AI-written English For English writing, I focused on signals that are useful during editing: overused AI-era vocabulary body sections that stay too smooth and generic low variation across paragraphs missing concrete artifacts such as file names, metrics, error messages, screenshots, or links The PubMed vocabulary study is useful here because it looks at vocabulary shifts after ChatGPT became common. It does not prove a single sentence is AI-written, but it gives a practical list of words and patterns that now carry a synthetic feel in professional writing. The Stanford HAI piece is the caution label: detector scores can be biased against non-native English writers. That is why the dataset includes detector bias as a high-risk signal. Dataset format Each row looks like this: { "id": "code-dependency-hallucination", "domain": "code", "signal": "Generated code may introduce nonexistent or obscure dependencies.", "why_it_matters": "OpenSSF guidance highlights hallucinated package names and slopsquatting as supply-chain risks for AI coding assistants.", "review_action": "Verify every new dependency in the official package index, prefer standard-library or well-known packages, and pin versions where appropriate.", "risk_level": "high", "source_ids": ["openssf-ai-code-assistant"] } Files: signals.jsonl sources.json README.md How I would use it In code review: scan new dependencies check tests around edge cases inspect comments that explain nothing ask for security review where generated code touches auth, input parsing, file paths, network calls, or secrets In writing review: edit the middle section first replace filler with project facts add one artifact link add one constraint add one result avoid treating detector output as evidence by itself Related work I grouped the dataset with my small-model and deployment artifacts here: https://huggingface.co/collections/yava-code/applied-small-ai-portfolio-6a304c83f9f1d089a28c101b This collection includes a small EuroSAT classifier, a Gradio Space, the AI-authorship dataset, and compact coding model work. Sources Automatic Detection of LLM-Generated Code: https://arxiv.org/html/2409.01382v2 Security Weaknesses of Copilot-Generated Code: https://arxiv.org/html/2310.02059v4 OpenSSF AI code assistant guide: https://best.openssf.org/Security-Focused-Guide-for-AI-Code-Assistant-Instructions.html Stanford HAI detector bias note: https://hai.stanford.edu/news/ai-detectors-biased-against-non-native-english-writers PubMed vocabulary study: https://pmejournal.org/articles/10.5334/pme.1929 AI writing segment study: https://arxiv.org/html/2501.19301v2