---
base_model: Qwen/Qwen3-1.7B
datasets:
- open-r1/DAPO-Math-17k-Processed
- mandarjoshi/trivia_qa
language:
- en
license: apache-2.0
pipeline_tag: text-classification
library_name: transformers
tags:
- gnosis
- sft
- qwen3
- triviaqa
- dapo-math
- correctness-detection
- dapo
- hallucination
- selfevaluation
- reward-model
pretty_name: Gnosis — Qwen3-1.7B (Self-Awareness Correctness Head)
---

# Gnosis — Qwen3-1.7B (Self-Awareness Correctness Head)

Gnosis is a lightweight self-awareness head that attaches to a **frozen** LLM and predicts a **scalar correctness probability** for a generated response. It reads the backbone’s internal signals—**hidden-state features (latent dynamics)** and **attention-map patterns**—to learn reliable hallucination / error cues directly from the model.

- **Paper:** [Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits](https://huggingface.co/papers/2512.20578)
- **Project code & instructions:** [GitHub Repository](https://github.com/Amirhosein-gh98/Gnosis)

## Why it matters
- **Strong verifier signal without a large external reward model** (no RM routing / no judge LLM calls).
- **~1000× smaller** than ~8B reward-model verifiers (**~5M params vs ~8B**).
- **~100× faster** than routing through an ~8B reward model.
- **Early error detection**: can flag likely errors before generation finishes.

## Evaluated backbones & benchmarks (from the paper)
- **Backbones**: Qwen3 family + OpenAI gpt-oss-20B.
- **Benchmarks**: Math-Reasoning (AMC12 2022/2023, AIME 2024/2025, HMMT Feb 2025), Open-Domain QA (18k held-out TriviaQA), Academic Knowledge Reasoning (MMLU-Pro).

## Training data
Mixed **math + trivia** training corpus:
- **Math**: English portion of DAPO-Math-17k (~14k).
- **Trivia**: 40k subsample from TriviaQA training set.

## Usage (inference)
This repo requires the **local Transformers fork with Gnosis integrated** (see the GitHub repo instructions). After installing it, run:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.demo import build_chat_prompt, generate_with_hf, correctness_prob

GNOSIS_MODEL_ID = "AmirhoseinGH/Gnosis-Qwen3-1.7B-Hybrid"

tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    GNOSIS_MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()

prompt = build_chat_prompt(
    tokenizer,
    question="How many r's are in strawberry?",
    system_prompt="Please reason step by step, and put your final answer within \\boxed{}.",
)

answer = generate_with_hf(model, tokenizer, prompt, torch.device("cuda"), max_new_tokens=2048)
p_correct = correctness_prob(model, tokenizer, prompt + answer, torch.device("cuda"))

print("Answer:
", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")
```

## Citation

```bibtex
@misc{ghasemabadi2025llmspredictfailures,
      title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits}, 
      author={Amirhosein Ghasemabadi and Di Niu},
      year={2025},
      eprint={2512.20578},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```