--- license: apache-2.0 base_model: Octen/Octen-Embedding-8B library_name: transformers tags: - embeddings - quantized - w4a16 - auto-round - auto-gptq --- # Octen-Embedding-8B W4A16 This repo contains a W4A16 quantized version of `Octen/Octen-Embedding-8B` in the validated `auto-round-auto-gptq` format. ## Quantization | Item | Value | |---|---| | Base model | `Octen/Octen-Embedding-8B` | | Quantization | W4A16, 4-bit weights / 16-bit activations | | Tooling | AutoRound 0.12.2, transformers 5.6.2, torch 2.6.0+cu124 | | Calibration | 8 samples, seqlen 512, 200 iterations, float32 tuning | | Quantized size | 8.1 GB, 2 shards | | Base size | 15.0 GB | | Compression | ~1.9x | | Embedding dim | 4096 | | Layers quantized | 252/253; `lm_head` skipped | ## Validation vs base model Evaluation used a small retrieval set of 5 query-document pairs, last-token pooling, L2 normalization, and cosine similarity. | Metric | Base | W4A16 | Delta | |---|---:|---:|---:| | Recall@1 | 0.8 | 1.0 | +0.2 | | Recall@5 | 1.0 | 1.0 | 0.0 | | Mean query cosine, base vs quant | — | 0.9840 | — | | Mean doc cosine, base vs quant | — | 0.9820 | — | Assessment: this model passed all validation gates cleanly, with >0.98 mean cosine to the base model and no retrieval degradation on the validation set. See `validation-8b-auto-round-auto-gptq.json` for the raw metrics. ## RTX 3060 smoke test This quantized model was loaded and run on an RTX 3060 12GB GPU. | Result | Value | |---|---:| | VRAM after load | 4.53 GB | | Single short-query forward pass | 0.9s smoke test; later benchmark ~612ms | | Output shape | `[1, 4, 4096]` | | Embeddings | Valid normalized vectors; no NaNs observed | ## Recommended usage ```python import torch from transformers import AutoModel, AutoTokenizer model_id = "groxaxo/octen-embedding-8b-w4a16" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModel.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.float16, ).cuda().eval() texts = ["how to implement binary search"] tokens = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") tokens = {k: v.cuda() for k, v in tokens.items()} with torch.no_grad(): out = model(**tokens) emb = torch.nn.functional.normalize(out.last_hidden_state[:, -1, :], p=2, dim=-1) ``` Note: the model card records local validation and smoke-test results. For production use, evaluate on your own retrieval distribution.