zandzpider
/

Qwen3-30B-A3B-abliterated-erotic-autoround-int4

4-bit precision

Not-For-All-Audiences

Model card Files Files and versions

zandzpider commited on Nov 14, 2025

Commit

481e49d

·

verified ·

1 Parent(s): 91162ca

Create README.md

Files changed (1) hide show

README.md +57 -0

README.md ADDED Viewed

	@@ -0,0 +1,57 @@

+---
+  license: other
+  base_model: Ewere/Qwen3-30B-A3B-abliterated-erotic
+  tags:
+    - quantized
+    - autoround
+    - int4
+    - qwen3
+    - vllm
+  quantization:
+    method: autoround
+    bits: 4
+---
+# Qwen3-30B AutoRound Int4
+4-bit AutoRound quantization of Qwen3-30B-A3B-abliterated-erotic.
+## Quantization Details
+- Method: AutoRound (SignRound optimization)
+- Bits: 4-bit (W4A16 symmetric)
+- Group Size: 128
+- Calibration: 512 samples from NeelNanda/pile-10k
+- Iterations: 200 (light mode)
+## Model Size
+- Original (FP16): ~60GB
+- Quantized (Int4): ~17GB
+- Compression: 3.5x
+## Usage with vLLM
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="yourusername/Qwen3-30B-AutoRound-Int4",
+    quantization="auto-round",
+    dtype="float16",
+    gpu_memory_utilization=0.9,
+    max_model_len=16384,
+    trust_remote_code=True,
+)
+sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
+outputs = llm.generate(["Your prompt here"], sampling_params)
+Hardware Requirements
+- VRAM: 18-20GB (single GPU) or 2x RTX 3090 with tensor parallelism
+- Inference Speed: 35-55 tokens/sec on 2x RTX 3090
+Base Model
+Based on https://huggingface.co/Ewere/Qwen3-30B-A3B-abliterated-erotic