--- license: mit base_model: stepfun-ai/Step1X-Edit tags: - depth-estimation - normal-estimation - quantized - int8 --- # FE2E INT8 (Pre-quantized for CPU) Pre-quantized INT8 model for [FE2E](https://github.com/AMAP-ML/FE2E) (CVPR 2026) monocular depth + surface normal estimation from a single image. **Demo Space:** [WeReCooking2/FE2E-CPU](https://huggingface.co/spaces/WeReCooking2/FE2E-CPU) ## Files | File | Size | Description | |------|------|-------------| | `dit_int8_full.pt` | 12.4 GB | Step1X-Edit DiT (12.4B params) + LDRN LoRA merged, dynamic INT8 quantized | | `vae_full.pt` | 335 MB | AutoEncoder, FP32 | Both files are saved with `torch.save(model)` (full model, not state_dict). Load with `torch.load(..., mmap=True)` to avoid doubling memory. ## How it was made 1. Loaded FP32 base model (`step1x-edit-i1258.safetensors`) on GPU 2. Cast to FP32 on CPU 3. Merged LDRN LoRA in full precision 4. Applied `torch.quantization.quantize_dynamic` (INT8 on all `nn.Linear` layers) 5. Saved full model with `torch.save(model)` ## Usage ```python import torch dit = torch.load("dit_int8_full.pt", map_location="cpu", weights_only=False, mmap=True) vae = torch.load("vae_full.pt", map_location="cpu", weights_only=False, mmap=True) ``` Requires ~12 GB RAM with mmap loading. ## Performance | Platform | Time per image | |----------|---------------| | GPU (RTX 5090, FP8 original) | ~2s | | CPU (HF free Space, INT8) | ~29 min (768x1024) | Single denoise step, outputs both depth and surface normal maps simultaneously. > No ONNX: PyTorch dynamo exporter produces a broken graph (100% NaN output). ## Credits - [FE2E](https://github.com/AMAP-ML/FE2E) (CVPR 2026) - [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit) base model - [rkfg/Step1X-Edit-FP8](https://huggingface.co/rkfg/Step1X-Edit-FP8) FP8 quantization