zandzpider commited on
Commit
481e49d
·
verified ·
1 Parent(s): 91162ca

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ base_model: Ewere/Qwen3-30B-A3B-abliterated-erotic
4
+ tags:
5
+ - quantized
6
+ - autoround
7
+ - int4
8
+ - qwen3
9
+ - vllm
10
+ quantization:
11
+ method: autoround
12
+ bits: 4
13
+ ---
14
+
15
+ # Qwen3-30B AutoRound Int4
16
+
17
+ 4-bit AutoRound quantization of Qwen3-30B-A3B-abliterated-erotic.
18
+
19
+ ## Quantization Details
20
+
21
+ - Method: AutoRound (SignRound optimization)
22
+ - Bits: 4-bit (W4A16 symmetric)
23
+ - Group Size: 128
24
+ - Calibration: 512 samples from NeelNanda/pile-10k
25
+ - Iterations: 200 (light mode)
26
+
27
+ ## Model Size
28
+
29
+ - Original (FP16): ~60GB
30
+ - Quantized (Int4): ~17GB
31
+ - Compression: 3.5x
32
+
33
+ ## Usage with vLLM
34
+
35
+ ```python
36
+ from vllm import LLM, SamplingParams
37
+
38
+ llm = LLM(
39
+ model="yourusername/Qwen3-30B-AutoRound-Int4",
40
+ quantization="auto-round",
41
+ dtype="float16",
42
+ gpu_memory_utilization=0.9,
43
+ max_model_len=16384,
44
+ trust_remote_code=True,
45
+ )
46
+
47
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
48
+ outputs = llm.generate(["Your prompt here"], sampling_params)
49
+
50
+ Hardware Requirements
51
+
52
+ - VRAM: 18-20GB (single GPU) or 2x RTX 3090 with tensor parallelism
53
+ - Inference Speed: 35-55 tokens/sec on 2x RTX 3090
54
+
55
+ Base Model
56
+
57
+ Based on https://huggingface.co/Ewere/Qwen3-30B-A3B-abliterated-erotic