# ERNIE Image Turbo NF4 - Core E Generation Bundle

This folder is the local generation bundle for ERNIE Image Turbo under the 16GB VRAM rule.

## Active Mode

Baseline generation is prepared for:

```text
use_pe = false
num_inference_steps = 4-8
guidance_scale = 1.0
transformer NF4 compute = torch.bfloat16
dense BNB linear fallback = enabled
```

## Active Transformer Path

The NF4 transformer bundle loads, but the native BNB 4-bit matmul path on the 16GB Quadro RTX 5000 stack collapses into a near-constant prediction field. Core_E repairs the NF4 lane by:

```text
CORE_E_TRANSFORMER_QUANT=nf4
CORE_E_NF4_DENSE_LINEAR=1
```

This keeps the packed NF4 weights resident, but dequantizes quantized linear layers into dense GPU matmuls at forward time. The result is slower than the native 4-bit kernel path but restores semantic output.

The previous alternate semantic path used an int8 transformer load from:

```text
models/Ernie/transformer
```

while continuing to use the NF4 text encoder from this bundle. That folder is optional; the active repaired lane is the NF4 transformer in this bundle.

## NVFP4 / FP8 Checkpoint Guard

The downloaded single-file ERNIE checkpoint under:

```text
models/Ernie-nvfp4/ernie-nvfp8.safetensors
```

is not BitsAndBytes NF4. Its safetensors metadata marks the transformer linears as:

```text
format = nvfp4
weights = uint8
scales = float8_e4m3fn + float32
```

That format needs a Comfy-style quantized tensor runtime and native NVFP4 kernels. On this Quadro RTX 5000 / SM75 machine it is not a valid generation path. Core_E now accepts the request names:

```text
CORE_E_TRANSFORMER_QUANT=nvfp4
CORE_E_TRANSFORMER_QUANT=nvfp8
CORE_E_TRANSFORMER_QUANT=fp8
```

only to inspect the checkpoint and fail cleanly with a hardware/runtime explanation. It does not route the file through the BNB NF4 loader, because that would be a false load path and can produce invalid output.

To disable dense fallback for diagnostics:

```text
CORE_E_NF4_DENSE_LINEAR=0
```

For faster draft strikes, Core_E now honors ERNIE step counts below 8 instead of silently flooring them. In smoke tests, 4-step 512px output remained coherent and ran at roughly half the 8-step denoise time. Use 8 steps for the official Turbo baseline and 4-6 steps for draft iteration.

Warm ERNIE strikes also skip redundant post-strike CUDA cache evacuation by default. Set this only when debugging memory fragmentation:

```text
CORE_E_EMPTY_CACHE_AFTER_STRIKE=1
```

## GPU-Only Runtime Rule

Core_E generation keeps the text encoder, transformer, denoising timesteps, latents, and VAE decode on CUDA. The request seed now uses a CUDA generator, and timestep batches stay as CUDA tensors to avoid per-step CPU scalar syncs.

One scheduler setup input remains host-side because Diffusers' `FlowMatchEulerDiscreteScheduler.set_timesteps(sigmas=...)` converts the supplied sigma list through NumPy before placing scheduler timesteps on the requested CUDA device.

The manifold does not CPU-park or vault `ernie-turbo` on model switches; it is either resident on GPU for warm strikes or fully released. The only expected CPU work in a normal strike is string tokenization and the final tensor-to-PNG transfer required to save a PIL image.

The active `model_index.json` intentionally sets `pe` and `pe_tokenizer` to null. The official model index is preserved as:

```text
model_index.official.json
```

Use the official index later only after downloading and validating the PE prompt enhancer.

## Components

```text
transformer/      BNB NF4, BF16 compute, dense linear fallback, ~3.86 GiB packed
models/Ernie/transformer
                  BNB INT8 runtime load, ~7.86 GiB (optional alternate if restored)
text_encoder/     BNB NF4, FP16 compute, ~2.41 GiB
vae/              FP16 source VAE, ~160 MiB
tokenizer/        official ERNIE Turbo tokenizer
scheduler/        official FlowMatchEulerDiscreteScheduler config
```

## Verified Loads

The following components loaded successfully from this folder:

```text
tokenizer: TokenizersBackend, vocab_size=131072
scheduler: FlowMatchEulerDiscreteScheduler
vae: AutoencoderKLFlux2, latent_channels=32
text_encoder: Mistral3Model, ~2.41 GiB VRAM
transformer: ErnieImageTransformer2DModel, ~3.86 GiB packed NF4 VRAM
```

Known non-blocking warnings:

```text
Mistral3 YaRN rope key: llama_4_scaling_beta is ignored
ERNIE transformer config keys: lora_rank/use_lora are ignored by the current vendored class
```