--- base_model: Qwen/Qwen3-1.7B datasets: - open-r1/DAPO-Math-17k-Processed - mandarjoshi/trivia_qa language: - en license: apache-2.0 pipeline_tag: text-classification library_name: transformers tags: - gnosis - sft - qwen3 - triviaqa - dapo-math - correctness-detection - dapo - hallucination - selfevaluation - reward-model pretty_name: Gnosis — Qwen3-1.7B (Self-Awareness Correctness Head) --- # Gnosis — Qwen3-1.7B (Self-Awareness Correctness Head) Gnosis is a lightweight self-awareness head that attaches to a **frozen** LLM and predicts a **scalar correctness probability** for a generated response. It reads the backbone’s internal signals—**hidden-state features (latent dynamics)** and **attention-map patterns**—to learn reliable hallucination / error cues directly from the model. - **Paper:** [Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits](https://huggingface.co/papers/2512.20578) - **Project code & instructions:** [GitHub Repository](https://github.com/Amirhosein-gh98/Gnosis) ## Why it matters - **Strong verifier signal without a large external reward model** (no RM routing / no judge LLM calls). - **~1000× smaller** than ~8B reward-model verifiers (**~5M params vs ~8B**). - **~100× faster** than routing through an ~8B reward model. - **Early error detection**: can flag likely errors before generation finishes. ## Evaluated backbones & benchmarks (from the paper) - **Backbones**: Qwen3 family + OpenAI gpt-oss-20B. - **Benchmarks**: Math-Reasoning (AMC12 2022/2023, AIME 2024/2025, HMMT Feb 2025), Open-Domain QA (18k held-out TriviaQA), Academic Knowledge Reasoning (MMLU-Pro). ## Training data Mixed **math + trivia** training corpus: - **Math**: English portion of DAPO-Math-17k (~14k). - **Trivia**: 40k subsample from TriviaQA training set. ## Usage (inference) This repo requires the **local Transformers fork with Gnosis integrated** (see the GitHub repo instructions). After installing it, run: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM from src.demo import build_chat_prompt, generate_with_hf, correctness_prob GNOSIS_MODEL_ID = "AmirhoseinGH/Gnosis-Qwen3-1.7B-Hybrid" tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( GNOSIS_MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True ).cuda().eval() prompt = build_chat_prompt( tokenizer, question="How many r's are in strawberry?", system_prompt="Please reason step by step, and put your final answer within \\boxed{}.", ) answer = generate_with_hf(model, tokenizer, prompt, torch.device("cuda"), max_new_tokens=2048) p_correct = correctness_prob(model, tokenizer, prompt + answer, torch.device("cuda")) print("Answer: ", answer) print("Gnosis correctness probability:", f"{p_correct:.4f}") ``` ## Citation ```bibtex @misc{ghasemabadi2025llmspredictfailures, title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits}, author={Amirhosein Ghasemabadi and Di Niu}, year={2025}, eprint={2512.20578}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```