InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

* equal contribution
1KAIST AI    2SK Telecom
KAIST AI Logo SKT Logo
🧙‍♀️ Wicked
🪄 Harry Potter
💋 Mean Girls
Louis Vuitton Bag in Wicked
Chanel Ballerina Flats in Wicked
Woody in Harry Potter
Books in Harry Potter
Doritos in Mean Girls
Mug in Mean Girls

Abstract

Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present a new VOI framework, InsertAnywhere, that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our InsertAnywhere begins with a 4D-aware mask generation module that reconstructs the scene’s geometry and propagates user-specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion-based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination-aware synthetic dataset constructed by transforming the ROSE object-removal dataset into triplets of object-removed video, object-present video, and a VLM-generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real-world scenarios, significantly outperforming existing research and commercial models.

Overall Pipeline

Method overview
InsertAnywhere is a two-stage VOI framework that uses 4D scene reconstruction to generate a user controllable and geometrically consistent mask, which then conditions a diffusion model fine-tuned on ROSE++ to synthesize illumination-aware object insertions.

ROSE++ Dataset

Method overview
ROSE++ Dataset ROSE++ is a synthetic, illumination-aware dataset designed to support supervised Video Object Insertion (VOI) by providing all four essential components, object-removed videos, object-present videos, object masks, and clean reference object images. By enriching the original ROSE dataset with a VLM-based object retrieval process, ROSE++ delivers context-consistent and high-quality reference images, making it well-suited for VOI.

Qualitative Comparison

object
Source Pika Pro KlingAI Ours
"Using the context of the video, seamlessly place the image into the empty space of the bed."
object
Source Pika Pro KlingAI Ours
"Using the context of the video, seamlessly place the image into the empty space of the kitchen table."
object
Source Pika Pro KlingAI Ours
"Using the context of the video, seamlessly place the image in front of the left wall."
object
Source Pika Pro KlingAI Ours
"Using the context of the video, seamlessly place the image into the empty space on the table in front of the mirror."

Quantitative Results

Quantitative results. Our method consistently outperforms prior work across all quantitative metrics, achieving the highest subject consistency, background preservation, motion smoothness, and imaging quality on our VOIBench benchmarks.

Quantitative Results

Ablation Study

Ablation Qualitative Results. Camera-only masking fails to preserve the original scene under occlusions. Adding our 4D geometry-aware mask resolves most occlusion issues but still lacks strong object fidelity. First-frame inpainting improves identity consistency, yet temporal artifacts remain. With ROSE++ fine-tuning, the model produces natural lighting and shadows, and combining all components delivers the best geometric accuracy and fidelity.

Quantitative Results

Ablation Quantitative Results. Each component incrementally improves occlusion handling, object fidelity, and temporal stability, with earlier configurations suffering from identity drift and background distortion. Our full model combines all strengths to produce the most consistent and visually coherent insertions across the entire video sequence.

Quantitative Results