Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present a new VOI framework, InsertAnywhere, that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our InsertAnywhere begins with a 4D-aware mask generation module that reconstructs the scene’s geometry and propagates user-specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion-based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination-aware synthetic dataset constructed by transforming the ROSE object-removal dataset into triplets of object-removed video, object-present video, and a VLM-generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real-world scenarios, significantly outperforming existing research and commercial models.
Quantitative results. Our method consistently outperforms prior work across all quantitative metrics, achieving the highest subject consistency, background preservation, motion smoothness, and imaging quality on our VOIBench benchmarks.
Ablation Qualitative Results. Camera-only masking fails to preserve the original scene under occlusions. Adding our 4D geometry-aware mask resolves most occlusion issues but still lacks strong object fidelity. First-frame inpainting improves identity consistency, yet temporal artifacts remain. With ROSE++ fine-tuning, the model produces natural lighting and shadows, and combining all components delivers the best geometric accuracy and fidelity.
Ablation Quantitative Results. Each component incrementally improves occlusion handling, object fidelity, and temporal stability, with earlier configurations suffering from identity drift and background distortion. Our full model combines all strengths to produce the most consistent and visually coherent insertions across the entire video sequence.