--- license: apache-2.0 base_model: Qwen/Qwen3-8B-Base tags: - pruning - layer-pruning - laco - compressed - qwen3 - llm - efficient library_name: transformers pipeline_tag: text-generation language: - en - zh - multilingual datasets: - wikipedia model-index: - name: Qwen3-8B-LaCo-Pruned results: - task: type: text-generation name: Text Generation dataset: name: PIQA type: piqa metrics: - type: accuracy_norm value: 71.38 name: Accuracy (Normalized) - task: type: text-generation name: Text Generation dataset: name: HellaSwag type: hellaswag metrics: - type: accuracy_norm value: 61.98 name: Accuracy (Normalized) - task: type: text-generation name: Text Generation dataset: name: BoolQ type: boolq metrics: - type: accuracy value: 64.95 name: Accuracy - task: type: text-generation name: Text Generation dataset: name: WinoGrande type: winogrande metrics: - type: accuracy value: 62.83 name: Accuracy - task: type: text-generation name: Text Generation dataset: name: ARC-Challenge type: arc_challenge metrics: - type: accuracy_norm value: 36.09 name: Accuracy (Normalized) - task: type: text-generation name: Text Generation dataset: name: ARC-Easy type: arc_easy metrics: - type: accuracy_norm value: 58.04 name: Accuracy (Normalized) - task: type: text-generation name: Text Generation dataset: name: MMLU type: mmlu metrics: - type: accuracy value: 31.30 name: Accuracy (5-shot) --- # Qwen3-8B-LaCo-Pruned This model is a **layer-pruned** version of [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) using the [LaCo (Layer Collapse)](https://arxiv.org/abs/2402.11187) structured pruning method. ## Model Summary | Attribute | Value | |-----------|-------| | **Base Model** | [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) | | **Pruning Method** | LaCo (Layer Collapse) | | **Original Layers** | 36 | | **Pruned Layers** | 30 | | **Layers Removed** | 6 | | **Compression** | 16.7% | ## Key Results This model achieves **16.7% compression** while retaining: - **~90% of physical reasoning** (PIQA) - **~94% of commonsense reasoning** (WinoGrande) - **~79% of common sense completion** (HellaSwag) - **~41% of factual knowledge** (MMLU) This is a **raw pruned model without post-training**. Fine-tuning can further recover lost capabilities. --- ## Benchmark Results (Pre-Training) **Note:** All benchmarks below are evaluated on the pruned model **without any post-training or fine-tuning**. These results represent the raw performance after pruning only. Post-training is expected to improve these scores, particularly on knowledge-intensive tasks like MMLU. ### Comparison with Original Qwen3-8B-Base | Benchmark | Original | Pruned | Retention | |-----------|----------|--------|-----------| | **PIQA** (acc_norm) | 79.54% | 71.38% | 89.7% | | **WinoGrande** | 67.0% | 62.83% | 93.8% | | **ARC-Challenge** (acc_norm) | 42.0% | 36.09% | 85.9% | | **ARC-Easy** (acc_norm) | 72.0% | 58.04% | 80.6% | | **HellaSwag** (acc_norm) | 78.55% | 61.98% | 78.9% | | **BoolQ** | 83.09% | 64.95% | 78.2% | | **MMLU** (5-shot) | 76.89% | 31.30% | 40.7% | *Original scores from [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)* ### Benchmark Interpretation | Capability | Benchmarks | Retention | Status | |------------|------------|-----------|--------| | Physical Reasoning | PIQA | 89.7% | Excellent | | Commonsense Reasoning | WinoGrande | 93.8% | Excellent | | Basic Reasoning | ARC-Challenge | 85.9% | Good | | Reading Comprehension | BoolQ | 78.2% | Good | | Common Sense | HellaSwag | 78.9% | Good | | Factual Knowledge | MMLU | 40.7% | Degraded | --- ## The "Knowledge Cliff" Our experiments reveal a critical finding: **factual knowledge collapses catastrophically between 16-22% compression**. | Compression | Layers | MMLU | Status | |-------------|--------|------|--------| | **16.7%** | **30** | **31.30%** | Partial retention | | 22.2% | 28 | 25.89% | Random chance | | 27.8% | 26 | 25.12% | Random chance | While reasoning capabilities degrade gradually with compression, factual knowledge encoded in specific layers is lost abruptly when those layers are removed. --- ## Intended Use This model is suitable for: - **Research** on model compression and efficiency - **Fine-tuning base** for domain-specific applications - **Inference optimization** where speed/memory matters - **Applications prioritizing reasoning over factual recall** ## Limitations **Important:** This is a raw pruned model without post-training. | Use Case | Recommendation | |----------|----------------| | Physical/commonsense reasoning | Recommended | | Reading comprehension | Recommended | | General text understanding | Recommended | | Factual question answering | Fine-tune first | | Knowledge-intensive tasks | Fine-tune first | --- ## Pruning Details ### LaCo Hyperparameters | Parameter | Value | Description | |-----------|-------|-------------| | MERGE_LAYERS (C) | 3 | Layers merged per operation | | LOWEST_LAY (L) | 4 | Minimum layer index for merging | | HIGHEST_LAY (H) | 28 | Maximum layer index for merging | | INTERVAL (I) | 2 | Minimum gap between merge points | | THRESHOLD (T) | 0.85 | Cosine similarity threshold | | MAX_COMPRESSION | 20% | Maximum allowed compression | ### Pruning Statistics | Metric | Value | |--------|-------| | Successful Merges | 3 | | Rejected Merges | 0 | | Total Iterations | 4 | | Final Compression | 16.7% | --- ## Usage ### Basic Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Mercity/Qwen3-8B-LaCo-Pruned" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True ) # Text generation prompt = "The process of photosynthesis" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### With 4-bit Quantization (Further Compression) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype="float16", bnb_4bit_quant_type="nf4", ) model = AutoModelForCausalLM.from_pretrained( "Mercity/Qwen3-8B-LaCo-Pruned", quantization_config=quantization_config, device_map="auto", trust_remote_code=True ) ``` --- ## Recovery Recommendations To improve factual knowledge after pruning: ### LoRA Fine-tuning (Recommended) ```python from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=32, lora_alpha=64, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, ) model = get_peft_model(model, lora_config) # Fine-tune on OpenOrca, Alpaca, or domain-specific data ``` **Expected recovery:** MMLU could reach 45-55% with fine-tuning. --- ## Technical Specifications | Attribute | Value | |-----------|-------| | Architecture | Transformer decoder-only | | Layers | 30 | | Hidden Size | 4096 | | Attention Heads (Q) | 32 | | Attention Heads (KV) | 8 (GQA) | | Intermediate Size | 12288 | | Vocabulary Size | 151,669 | | Max Context Length | 32,768 tokens | | Precision | bfloat16 | --- ## Citation If you use this model, please cite the original LaCo paper and Qwen3: ```bibtex @article{yang2024laco, title={LaCo: Large Language Model Pruning via Layer Collapse}, author={Yang, Yifei and Cao, Zouying and Zhao, Hai}, journal={arXiv preprint arXiv:2402.11187}, year={2024} } @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388} } ``` ## References - [LaCo Paper](https://arxiv.org/abs/2402.11187) - [LaCo Official Implementation](https://github.com/yangyifei729/LaCo) - [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388) - [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) ## License Apache 2.0 (same as base Qwen3 model) ## Acknowledgments - Qwen Team for the excellent Qwen3-8B-Base model - LaCo authors for the pruning methodology - Hugging Face for model hosting