---
license: apache-2.0
base_model: Qwen/Qwen3-8B-Base
tags:
  - pruning
  - layer-pruning
  - laco
  - compressed
  - qwen3
  - llm
  - efficient
library_name: transformers
pipeline_tag: text-generation
language:
  - en
  - zh
  - multilingual
datasets:
  - wikipedia
model-index:
  - name: Qwen3-8B-LaCo-Pruned
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: PIQA
          type: piqa
        metrics:
          - type: accuracy_norm
            value: 71.38
            name: Accuracy (Normalized)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - type: accuracy_norm
            value: 61.98
            name: Accuracy (Normalized)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BoolQ
          type: boolq
        metrics:
          - type: accuracy
            value: 64.95
            name: Accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: WinoGrande
          type: winogrande
        metrics:
          - type: accuracy
            value: 62.83
            name: Accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Challenge
          type: arc_challenge
        metrics:
          - type: accuracy_norm
            value: 36.09
            name: Accuracy (Normalized)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Easy
          type: arc_easy
        metrics:
          - type: accuracy_norm
            value: 58.04
            name: Accuracy (Normalized)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU
          type: mmlu
        metrics:
          - type: accuracy
            value: 31.30
            name: Accuracy (5-shot)
---

# Qwen3-8B-LaCo-Pruned

This model is a **layer-pruned** version of [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) using the [LaCo (Layer Collapse)](https://arxiv.org/abs/2402.11187) structured pruning method.

## Model Summary

| Attribute | Value |
|-----------|-------|
| **Base Model** | [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) |
| **Pruning Method** | LaCo (Layer Collapse) |
| **Original Layers** | 36 |
| **Pruned Layers** | 30 |
| **Layers Removed** | 6 |
| **Compression** | 16.7% |

## Key Results

This model achieves **16.7% compression** while retaining:
- **~90% of physical reasoning** (PIQA)
- **~94% of commonsense reasoning** (WinoGrande)
- **~79% of common sense completion** (HellaSwag)
- **~41% of factual knowledge** (MMLU)

This is a **raw pruned model without post-training**. Fine-tuning can further recover lost capabilities.

---

## Benchmark Results (Pre-Training)

**Note:** All benchmarks below are evaluated on the pruned model **without any post-training or fine-tuning**. These results represent the raw performance after pruning only. Post-training is expected to improve these scores, particularly on knowledge-intensive tasks like MMLU.

### Comparison with Original Qwen3-8B-Base

| Benchmark | Original | Pruned | Retention |
|-----------|----------|--------|-----------|
| **PIQA** (acc_norm) | 79.54% | 71.38% | 89.7% |
| **WinoGrande** | 67.0% | 62.83% | 93.8% |
| **ARC-Challenge** (acc_norm) | 42.0% | 36.09% | 85.9% |
| **ARC-Easy** (acc_norm) | 72.0% | 58.04% | 80.6% |
| **HellaSwag** (acc_norm) | 78.55% | 61.98% | 78.9% |
| **BoolQ** | 83.09% | 64.95% | 78.2% |
| **MMLU** (5-shot) | 76.89% | 31.30% | 40.7% |

*Original scores from [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)*

### Benchmark Interpretation

| Capability | Benchmarks | Retention | Status |
|------------|------------|-----------|--------|
| Physical Reasoning | PIQA | 89.7% | Excellent |
| Commonsense Reasoning | WinoGrande | 93.8% | Excellent |
| Basic Reasoning | ARC-Challenge | 85.9% | Good |
| Reading Comprehension | BoolQ | 78.2% | Good |
| Common Sense | HellaSwag | 78.9% | Good |
| Factual Knowledge | MMLU | 40.7% | Degraded |

---

## The "Knowledge Cliff"

Our experiments reveal a critical finding: **factual knowledge collapses catastrophically between 16-22% compression**.

| Compression | Layers | MMLU | Status |
|-------------|--------|------|--------|
| **16.7%** | **30** | **31.30%** | Partial retention |
| 22.2% | 28 | 25.89% | Random chance |
| 27.8% | 26 | 25.12% | Random chance |

While reasoning capabilities degrade gradually with compression, factual knowledge encoded in specific layers is lost abruptly when those layers are removed.

---

## Intended Use

This model is suitable for:
- **Research** on model compression and efficiency
- **Fine-tuning base** for domain-specific applications
- **Inference optimization** where speed/memory matters
- **Applications prioritizing reasoning over factual recall**

## Limitations

**Important:** This is a raw pruned model without post-training.

| Use Case | Recommendation |
|----------|----------------|
| Physical/commonsense reasoning | Recommended |
| Reading comprehension | Recommended |
| General text understanding | Recommended |
| Factual question answering | Fine-tune first |
| Knowledge-intensive tasks | Fine-tune first |

---

## Pruning Details

### LaCo Hyperparameters

| Parameter | Value | Description |
|-----------|-------|-------------|
| MERGE_LAYERS (C) | 3 | Layers merged per operation |
| LOWEST_LAY (L) | 4 | Minimum layer index for merging |
| HIGHEST_LAY (H) | 28 | Maximum layer index for merging |
| INTERVAL (I) | 2 | Minimum gap between merge points |
| THRESHOLD (T) | 0.85 | Cosine similarity threshold |
| MAX_COMPRESSION | 20% | Maximum allowed compression |

### Pruning Statistics

| Metric | Value |
|--------|-------|
| Successful Merges | 3 |
| Rejected Merges | 0 |
| Total Iterations | 4 |
| Final Compression | 16.7% |

---

## Usage

### Basic Inference

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Mercity/Qwen3-8B-LaCo-Pruned"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

# Text generation
prompt = "The process of photosynthesis"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### With 4-bit Quantization (Further Compression)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "Mercity/Qwen3-8B-LaCo-Pruned",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)
```

---

## Recovery Recommendations

To improve factual knowledge after pruning:

### LoRA Fine-tuning (Recommended)

```python
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# Fine-tune on OpenOrca, Alpaca, or domain-specific data
```

**Expected recovery:** MMLU could reach 45-55% with fine-tuning.

---

## Technical Specifications

| Attribute | Value |
|-----------|-------|
| Architecture | Transformer decoder-only |
| Layers | 30 |
| Hidden Size | 4096 |
| Attention Heads (Q) | 32 |
| Attention Heads (KV) | 8 (GQA) |
| Intermediate Size | 12288 |
| Vocabulary Size | 151,669 |
| Max Context Length | 32,768 tokens |
| Precision | bfloat16 |

---

## Citation

If you use this model, please cite the original LaCo paper and Qwen3:

```bibtex
@article{yang2024laco,
  title={LaCo: Large Language Model Pruning via Layer Collapse},
  author={Yang, Yifei and Cao, Zouying and Zhao, Hai},
  journal={arXiv preprint arXiv:2402.11187},
  year={2024}
}

@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}
```

## References

- [LaCo Paper](https://arxiv.org/abs/2402.11187)
- [LaCo Official Implementation](https://github.com/yangyifei729/LaCo)
- [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)
- [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)

## License

Apache 2.0 (same as base Qwen3 model)

## Acknowledgments

- Qwen Team for the excellent Qwen3-8B-Base model
- LaCo authors for the pruning methodology
- Hugging Face for model hosting