# 🔧 Quantization & Memory-Optimization für HF Spaces

## Problem: Memory Limit Überschritten

Zephyr-7B braucht normalerweise **15GB VRAM** → HF Spaces haben nur ~8-16GB (je nach Plan).

## Lösung: Quantization

**Quantization = Modell komprimieren ohne großen Quality-Loss**

| Methode | Memory | Quality | Speed | Code |
|---------|--------|---------|-------|------|
| **Original (fp16)** | 15GB | 100% | ⚡⚡⚡ | `app.py` (auto) |
| **8-bit (bitsandbytes)** | 7-8GB | 98% | ⚡⚡ | `load_in_8bit=True` |
| **4-bit AWQ** | 5GB | 95% | ⚡⚡ | `TheBloke/.../AWQ` |
| **4-bit GGUF** | 3.5GB | 95% | ⚡ | `llama-cpp-python` |

---

## 🚀 AUTO-SELECTION (EASIEST!)

Dein Backend (`app.py` & `app_fastapi.py`) **wählt automatisch**:

```python
# Auto-Detection - wird mit jedem Start neu evaluiert!
if gpu_memory >= 20GB:
    → HuggingFaceH4/zephyr-7b-beta     # Original
elif gpu_memory >= 10GB:
    → TheBloke/zephyr-7B-beta-AWQ      # 4-bit
else:
    → TheBloke/zephyr-7B-beta-GGUF     # Kompressest
```

**Einfach deployen auf HF Spaces, der Rest passiert automatisch!** ✅

---

## 🎯 Manuelle Konfiguration

### Option 1: Environment Variable

```bash
# Auf HF Spaces: Repository Secrets setzen
# Dann in app.py:
MODEL_NAME = os.getenv("MODEL_NAME", "auto")
```

### Option 2: Direkt in Code

```python
# app.py ODER app_fastapi.py

# Nutze diese Modelle direkt:
MODEL_NAME = "TheBloke/zephyr-7B-beta-AWQ"      # 4-bit quantized
MODEL_NAME = "TheBloke/zephyr-7B-beta-GGUF"     # 4-bit GGUF
MODEL_NAME = "HuggingFaceH4/zephyr-7b-alpha"    # Schnellere Alternative
```

### Option 3: 8-bit Quantization

```python
# In load_model_optimized():
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16,
    bnb_8bit_use_double_quant=True,  # Extra Kompression
)
```

---

## 📊 Quality Vergleich

```
Original (fp16)           |████████████████| 100% Quality
8-bit (bitsandbytes)      |███████████████ | 98% Quality (~50% Memory gespart)
AWQ 4-bit                 |████████████    | 95% Quality (~67% Memory gespart)
GGUF 4-bit                |████████████    | 95% Quality (~77% Memory gespart)
```

---

## 🔌 Warum InferenceClient NICHT funktioniert

```python
# ❌ FALSCH:
from huggingface_hub import InferenceClient
client = InferenceClient("HuggingFaceH4/zephyr-7b-beta")
# Das macht API-Calls, kostet Credits, ist zu langsam für HF Spaces
```

**Besser:** Modell lokal laden mit Quantization! ✅

---

## 🐛 Troubleshooting

### "Out of Memory" Error

→ Auto-Select macht nicht, was du willst?

```python
# Erzwinge kleinstes Modell:
MODEL_NAME = "TheBloke/zephyr-7B-beta-GGUF"
```

### "module not found: bitsandbytes"

```bash
# Installiere:
pip install bitsandbytes
```

### GGUF Support (falls nötig)

```bash
# Uncomment in requirements.txt:
# llama-cpp-python==0.2.45

# Dann uncomment in app.py:
# MODEL_NAME = "TheBloke/zephyr-7B-beta-GGUF"
```

---

## 📈 Performance Tipps

### 1. Caching aktivieren
```bash
# Modelle cachen sich automatisch nach erstem Download
# (~20GB auf Disk, aber schneller beim Restart)
```

### 2. Batch Processing
```python
# Mehrere Requests parallel verarbeiten
# → FastAPI macht das automatisch!
```

### 3. Quantized Models
```bash
# Nutze bereits quantisierte Modelle
# → Kein On-the-fly Quantization nötig!
```

---

## 🎓 Welche Kombination für MEIN Setup?

| Szenario | Empfehlung |
|----------|------------|
| **HF Spaces Standard** | Auto-Select (default) |
| **HF Spaces mit Prob.** | `TheBloke/.../AWQ` |
| **Lokal mit RTX 3090** | Original (fp16) |
| **Lokal mit 8GB GPU** | 8-bit oder AWQ |
| **CPU only** | `zephyr-7b-alpha` |

---

## 🚀 Quick Start für Memory-Fehler

1. **Ändere nichts, deploye einfach** - Auto-Select macht's!
2. **Wenn immer noch Fehler**: Setze `MODEL_NAME = "TheBloke/zephyr-7B-beta-AWQ"`
3. **Wenn immer noch Fehler**: Setze `MODEL_NAME = "HuggingFaceH4/zephyr-7b-alpha"`

**Dein Backend ist bereits optimiert!** 🎯

---

Fragen? Guck in `app.py` oder `app_fastapi.py` - die haben ausführliche Comments!