---
license: apache-2.0
base_model: Octen/Octen-Embedding-4B
library_name: transformers
tags:
  - embeddings
  - quantized
  - w4a16
  - auto-round
  - auto-gptq
---

# Octen-Embedding-4B W4A16

This repo contains a W4A16 quantized version of `Octen/Octen-Embedding-4B` in the validated `auto-round-auto-gptq` format.

## Quantization

| Item | Value |
|---|---|
| Base model | `Octen/Octen-Embedding-4B` |
| Quantization | W4A16, 4-bit weights / 16-bit activations |
| Tooling | AutoRound 0.12.2, transformers 5.6.2, torch 2.6.0+cu124 |
| Calibration | 8 samples, seqlen 512, 200 iterations, float32 tuning |
| Quantized size | 3.3 GB |
| Base size | 7.6 GB |
| Compression | ~2.3x |
| Embedding dim | 2560 |
| Layers quantized | 252/253; `lm_head` skipped due shape divisibility |

## Validation vs base model

Evaluation used a small retrieval set of 5 query-document pairs, last-token pooling, L2 normalization, and cosine similarity.

| Metric | Base | W4A16 | Delta |
|---|---:|---:|---:|
| Recall@1 | 1.0 | 1.0 | 0.0 |
| Recall@5 | 1.0 | 1.0 | 0.0 |
| Mean query cosine, base vs quant | — | 0.9783 | — |
| Mean doc cosine, base vs quant | — | 0.9777 | — |

Assessment: retrieval did not degrade on the validation set. The embedding cosine is marginally below a strict 0.98 threshold, but Recall@1/Recall@5 were unchanged. A prior validation run showed query cosine 0.9799 and doc cosine 0.9790.

See `validation-4b-auto-round-auto-gptq.json` for the raw metrics.

## RTX 3060 smoke test

This quantized model was loaded and run on an RTX 3060 12GB GPU.

| Result | Value |
|---|---:|
| VRAM after load | 2.50 GB |
| Single short-query forward pass | 0.6s smoke test; later benchmark ~329ms |
| Output shape | `[1, 4, 2560]` |
| Embeddings | Valid normalized vectors; no NaNs observed |

## Recommended usage

```python
import torch
from transformers import AutoModel, AutoTokenizer

model_id = "groxaxo/octen-embedding-4b-w4a16"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
).cuda().eval()

texts = ["how to implement binary search"]
tokens = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
tokens = {k: v.cuda() for k, v in tokens.items()}

with torch.no_grad():
    out = model(**tokens)

emb = torch.nn.functional.normalize(out.last_hidden_state[:, -1, :], p=2, dim=-1)
```

Note: the model card records local validation and smoke-test results. For production use, evaluate on your own retrieval distribution.