---
license: mit
base_model: stepfun-ai/Step1X-Edit
tags:
  - depth-estimation
  - normal-estimation
  - quantized
  - int8
---

# FE2E INT8 (Pre-quantized for CPU)

Pre-quantized INT8 model for [FE2E](https://github.com/AMAP-ML/FE2E) (CVPR 2026) monocular depth + surface normal estimation from a single image.

**Demo Space:** [WeReCooking2/FE2E-CPU](https://huggingface.co/spaces/WeReCooking2/FE2E-CPU)

## Files

| File | Size | Description |
|------|------|-------------|
| `dit_int8_full.pt` | 12.4 GB | Step1X-Edit DiT (12.4B params) + LDRN LoRA merged, dynamic INT8 quantized |
| `vae_full.pt` | 335 MB | AutoEncoder, FP32 |

Both files are saved with `torch.save(model)` (full model, not state_dict). Load with `torch.load(..., mmap=True)` to avoid doubling memory.

## How it was made

1. Loaded FP32 base model (`step1x-edit-i1258.safetensors`) on GPU
2. Cast to FP32 on CPU
3. Merged LDRN LoRA in full precision
4. Applied `torch.quantization.quantize_dynamic` (INT8 on all `nn.Linear` layers)
5. Saved full model with `torch.save(model)`

## Usage

```python
import torch

dit = torch.load("dit_int8_full.pt", map_location="cpu", weights_only=False, mmap=True)
vae = torch.load("vae_full.pt", map_location="cpu", weights_only=False, mmap=True)
```

Requires ~12 GB RAM with mmap loading.

## Performance

| Platform | Time per image |
|----------|---------------|
| GPU (RTX 5090, FP8 original) | ~2s |
| CPU (HF free Space, INT8) | ~29 min (768x1024) |

Single denoise step, outputs both depth and surface normal maps simultaneously.

> No ONNX: PyTorch dynamo exporter produces a broken graph (100% NaN output).

## Credits

- [FE2E](https://github.com/AMAP-ML/FE2E) (CVPR 2026)
- [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit) base model
- [rkfg/Step1X-Edit-FP8](https://huggingface.co/rkfg/Step1X-Edit-FP8) FP8 quantization