# ERNIE Image Turbo NF4 - Core E Generation Bundle This folder is the local generation bundle for ERNIE Image Turbo under the 16GB VRAM rule. ## Active Mode Baseline generation is prepared for: ```text use_pe = false num_inference_steps = 4-8 guidance_scale = 1.0 transformer NF4 compute = torch.bfloat16 dense BNB linear fallback = enabled ``` ## Active Transformer Path The NF4 transformer bundle loads, but the native BNB 4-bit matmul path on the 16GB Quadro RTX 5000 stack collapses into a near-constant prediction field. Core_E repairs the NF4 lane by: ```text CORE_E_TRANSFORMER_QUANT=nf4 CORE_E_NF4_DENSE_LINEAR=1 ``` This keeps the packed NF4 weights resident, but dequantizes quantized linear layers into dense GPU matmuls at forward time. The result is slower than the native 4-bit kernel path but restores semantic output. The previous alternate semantic path used an int8 transformer load from: ```text models/Ernie/transformer ``` while continuing to use the NF4 text encoder from this bundle. That folder is optional; the active repaired lane is the NF4 transformer in this bundle. ## NVFP4 / FP8 Checkpoint Guard The downloaded single-file ERNIE checkpoint under: ```text models/Ernie-nvfp4/ernie-nvfp8.safetensors ``` is not BitsAndBytes NF4. Its safetensors metadata marks the transformer linears as: ```text format = nvfp4 weights = uint8 scales = float8_e4m3fn + float32 ``` That format needs a Comfy-style quantized tensor runtime and native NVFP4 kernels. On this Quadro RTX 5000 / SM75 machine it is not a valid generation path. Core_E now accepts the request names: ```text CORE_E_TRANSFORMER_QUANT=nvfp4 CORE_E_TRANSFORMER_QUANT=nvfp8 CORE_E_TRANSFORMER_QUANT=fp8 ``` only to inspect the checkpoint and fail cleanly with a hardware/runtime explanation. It does not route the file through the BNB NF4 loader, because that would be a false load path and can produce invalid output. To disable dense fallback for diagnostics: ```text CORE_E_NF4_DENSE_LINEAR=0 ``` For faster draft strikes, Core_E now honors ERNIE step counts below 8 instead of silently flooring them. In smoke tests, 4-step 512px output remained coherent and ran at roughly half the 8-step denoise time. Use 8 steps for the official Turbo baseline and 4-6 steps for draft iteration. Warm ERNIE strikes also skip redundant post-strike CUDA cache evacuation by default. Set this only when debugging memory fragmentation: ```text CORE_E_EMPTY_CACHE_AFTER_STRIKE=1 ``` ## GPU-Only Runtime Rule Core_E generation keeps the text encoder, transformer, denoising timesteps, latents, and VAE decode on CUDA. The request seed now uses a CUDA generator, and timestep batches stay as CUDA tensors to avoid per-step CPU scalar syncs. One scheduler setup input remains host-side because Diffusers' `FlowMatchEulerDiscreteScheduler.set_timesteps(sigmas=...)` converts the supplied sigma list through NumPy before placing scheduler timesteps on the requested CUDA device. The manifold does not CPU-park or vault `ernie-turbo` on model switches; it is either resident on GPU for warm strikes or fully released. The only expected CPU work in a normal strike is string tokenization and the final tensor-to-PNG transfer required to save a PIL image. The active `model_index.json` intentionally sets `pe` and `pe_tokenizer` to null. The official model index is preserved as: ```text model_index.official.json ``` Use the official index later only after downloading and validating the PE prompt enhancer. ## Components ```text transformer/ BNB NF4, BF16 compute, dense linear fallback, ~3.86 GiB packed models/Ernie/transformer BNB INT8 runtime load, ~7.86 GiB (optional alternate if restored) text_encoder/ BNB NF4, FP16 compute, ~2.41 GiB vae/ FP16 source VAE, ~160 MiB tokenizer/ official ERNIE Turbo tokenizer scheduler/ official FlowMatchEulerDiscreteScheduler config ``` ## Verified Loads The following components loaded successfully from this folder: ```text tokenizer: TokenizersBackend, vocab_size=131072 scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLFlux2, latent_channels=32 text_encoder: Mistral3Model, ~2.41 GiB VRAM transformer: ErnieImageTransformer2DModel, ~3.86 GiB packed NF4 VRAM ``` Known non-blocking warnings: ```text Mistral3 YaRN rope key: llama_4_scaling_beta is ignored ERNIE transformer config keys: lora_rank/use_lora are ignored by the current vendored class ```