--- library_name: diffusers pipeline_tag: image-to-image tags: - remote-sensing - image-editing - stable-diffusion --- # RSEdit: Text-Guided Image Editing for Remote Sensing RSEdit is a unified framework that adapts pretrained text-to-image diffusion models into instruction-following editors for remote sensing (RS) imagery. By addressing the gap in RS world knowledge and misalignment in conditioning schemes, RSEdit achieves precise, physically coherent edits while preserving geospatial content across scenarios like urban growth, disaster impacts, and seasonal shifts. [[Paper](https://huggingface.co/papers/2603.13708)] [[Code](https://github.com/Bili-Sakura/RSEdit-Preview)] [[Project Page](https://bili-sakura.github.io/RSEdit-Preview/)] ## RSEdit-UNet Text Encoder Ablation Models This repository contains the UNet-based ablation models (text encoder variants) for RSEdit. These models use the standard InstructPix2Pix pipeline structure. ### Quick Start To generate an edited image using a pre-trained RSEdit UNet ablation model, you can use the `diffusers` library: ```python import torch from PIL import Image from diffusers import StableDiffusionInstructPix2PixPipeline, UNet2DConditionModel # Example: DGTRS-CLIP-ViT-L-14 ablation model # Each variant directory is self-contained with all components checkpoint_path = "BiliSakura/RSEdit-UNet-text-ablation" variant = "DGTRS-CLIP-ViT-L-14" # Load pipeline from checkpoint pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained( checkpoint_path, subfolder=variant, torch_dtype=torch.bfloat16, safety_checker=None, requires_safety_checker=False, ) # Optional: Override UNet with trained EMA weights if specifically required # pipe.unet = UNet2DConditionModel.from_pretrained( # f"{checkpoint_path}/{variant}/checkpoint-30000/unet_ema", # torch_dtype=torch.bfloat16, # ) pipe = pipe.to("cuda") # Load source satellite image source_image = Image.open("satellite_image.png").convert("RGB") # Edit with instruction prompt = "Flood the coastal area" edited_image = pipe( prompt=prompt, image=source_image, num_inference_steps=50, guidance_scale=7.5, image_guidance_scale=1.5, ).images[0] # Save result edited_image.save("edited_image.png") ``` ## Model Structure Each ablation model directory is self-contained and includes: - `text_encoder/`: The specific text encoder variant (e.g., CLIP, DGTRS). - `tokenizer/`: Associated tokenizer. - `vae/`: VAE component. - `scheduler/`: PNDM scheduler. - `unet/`: Base UNet weights. - `checkpoint-30000/unet_ema/`: Trained UNet EMA weights optimized for RS editing. ## Citation If you find this work useful, please cite: ```bibtex @misc{zhenyuan2026rsedittextguidedimageediting, title={RSEdit: Text-Guided Image Editing for Remote Sensing}, author={Chen Zhenyuan and Zhang Zechuan and Zhang Feng}, year={2026}, eprint={2603.13708}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.13708}, } ``` ## Acknowledgments This project builds upon [Diffusers](https://github.com/huggingface/diffusers), [Accelerate](https://github.com/huggingface/accelerate), and [Stable Diffusion](https://github.com/Stability-AI/generative-models).