---
license: apache-2.0
base_model: Octen/Octen-Embedding-8B
library_name: transformers
tags:
  - embeddings
  - quantized
  - w4a16
  - auto-round
  - auto-gptq
---

# Octen-Embedding-8B W4A16

This repo contains a W4A16 quantized version of `Octen/Octen-Embedding-8B` in the validated `auto-round-auto-gptq` format.

## Quantization

| Item | Value |
|---|---|
| Base model | `Octen/Octen-Embedding-8B` |
| Quantization | W4A16, 4-bit weights / 16-bit activations |
| Tooling | AutoRound 0.12.2, transformers 5.6.2, torch 2.6.0+cu124 |
| Calibration | 8 samples, seqlen 512, 200 iterations, float32 tuning |
| Quantized size | 8.1 GB, 2 shards |
| Base size | 15.0 GB |
| Compression | ~1.9x |
| Embedding dim | 4096 |
| Layers quantized | 252/253; `lm_head` skipped |

## Validation vs base model

Evaluation used a small retrieval set of 5 query-document pairs, last-token pooling, L2 normalization, and cosine similarity.

| Metric | Base | W4A16 | Delta |
|---|---:|---:|---:|
| Recall@1 | 0.8 | 1.0 | +0.2 |
| Recall@5 | 1.0 | 1.0 | 0.0 |
| Mean query cosine, base vs quant | — | 0.9840 | — |
| Mean doc cosine, base vs quant | — | 0.9820 | — |

Assessment: this model passed all validation gates cleanly, with >0.98 mean cosine to the base model and no retrieval degradation on the validation set.

See `validation-8b-auto-round-auto-gptq.json` for the raw metrics.

## RTX 3060 smoke test

This quantized model was loaded and run on an RTX 3060 12GB GPU.

| Result | Value |
|---|---:|
| VRAM after load | 4.53 GB |
| Single short-query forward pass | 0.9s smoke test; later benchmark ~612ms |
| Output shape | `[1, 4, 4096]` |
| Embeddings | Valid normalized vectors; no NaNs observed |

## Recommended usage

```python
import torch
from transformers import AutoModel, AutoTokenizer

model_id = "groxaxo/octen-embedding-8b-w4a16"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
).cuda().eval()

texts = ["how to implement binary search"]
tokens = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
tokens = {k: v.cuda() for k, v in tokens.items()}

with torch.no_grad():
    out = model(**tokens)

emb = torch.nn.functional.normalize(out.last_hidden_state[:, -1, :], p=2, dim=-1)
```

Note: the model card records local validation and smoke-test results. For production use, evaluate on your own retrieval distribution.