--- license: apache-2.0 base_model: Octen/Octen-Embedding-4B library_name: transformers tags: - embeddings - quantized - w4a16 - auto-round - auto-gptq --- # Octen-Embedding-4B W4A16 This repo contains a W4A16 quantized version of `Octen/Octen-Embedding-4B` in the validated `auto-round-auto-gptq` format. ## Quantization | Item | Value | |---|---| | Base model | `Octen/Octen-Embedding-4B` | | Quantization | W4A16, 4-bit weights / 16-bit activations | | Tooling | AutoRound 0.12.2, transformers 5.6.2, torch 2.6.0+cu124 | | Calibration | 8 samples, seqlen 512, 200 iterations, float32 tuning | | Quantized size | 3.3 GB | | Base size | 7.6 GB | | Compression | ~2.3x | | Embedding dim | 2560 | | Layers quantized | 252/253; `lm_head` skipped due shape divisibility | ## Validation vs base model Evaluation used a small retrieval set of 5 query-document pairs, last-token pooling, L2 normalization, and cosine similarity. | Metric | Base | W4A16 | Delta | |---|---:|---:|---:| | Recall@1 | 1.0 | 1.0 | 0.0 | | Recall@5 | 1.0 | 1.0 | 0.0 | | Mean query cosine, base vs quant | — | 0.9783 | — | | Mean doc cosine, base vs quant | — | 0.9777 | — | Assessment: retrieval did not degrade on the validation set. The embedding cosine is marginally below a strict 0.98 threshold, but Recall@1/Recall@5 were unchanged. A prior validation run showed query cosine 0.9799 and doc cosine 0.9790. See `validation-4b-auto-round-auto-gptq.json` for the raw metrics. ## RTX 3060 smoke test This quantized model was loaded and run on an RTX 3060 12GB GPU. | Result | Value | |---|---:| | VRAM after load | 2.50 GB | | Single short-query forward pass | 0.6s smoke test; later benchmark ~329ms | | Output shape | `[1, 4, 2560]` | | Embeddings | Valid normalized vectors; no NaNs observed | ## Recommended usage ```python import torch from transformers import AutoModel, AutoTokenizer model_id = "groxaxo/octen-embedding-4b-w4a16" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModel.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.float16, ).cuda().eval() texts = ["how to implement binary search"] tokens = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") tokens = {k: v.cuda() for k, v in tokens.items()} with torch.no_grad(): out = model(**tokens) emb = torch.nn.functional.normalize(out.last_hidden_state[:, -1, :], p=2, dim=-1) ``` Note: the model card records local validation and smoke-test results. For production use, evaluate on your own retrieval distribution.