InstanceControl: Controllable Complex Image Generation without Instance Labeling
Abstract
InstanceControl enables multi-instance image generation by using vision-language models to establish instance-level correspondences between text prompts and visual conditions, while employing adaptive mask refinement for improved accuracy.
Controllable image generation methods, such as ControlNet, have demonstrated a remarkable capacity to introduce visual conditions(e.g., depth maps) to guide image generation. However, these methods often struggle with complex multi-instance scenes, frequently leading to attribute confusion among instances. While recent approaches attempt to mitigate this via manual instance labeling, such requirements are labor-intensive. In this paper, we propose InstanceControl, a novel multi-instance controllable generation method that eliminates the need for instance labeling. We identify the primary bottleneck in existing methods as the inability to accurately associate instance descriptions with their corresponding regions within visual conditions. To address this, we leverage the Vision-Language Model (VLM) to establish instance-level correspondences between text prompts and visual conditions. Specifically, the VLM automatically parses instance descriptions from the text prompts and simultaneously predicts instance masks based on the visual conditions. Furthermore, since the predicted masks may contain noise, we introduce an adaptive mask refinement strategy that dynamically refines these instance masks during the generation process. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, achieving superior fidelity and precise instance-level control.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation (2026)
- UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization (2026)
- SteerSeg: Attention Steering for Reasoning Video Segmentation (2026)
- See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding (2026)
- Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation (2026)
- CR-Seg: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation (2026)
- EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.31924 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper