KEMBAR78
Omni Paint | PDF | Computational Fluid Dynamics
0% found this document useful (0 votes)
94 views10 pages

Omni Paint

OmniPaint is a novel framework for object-oriented image editing that integrates realistic object removal and generative object insertion as interdependent processes, leveraging diffusion-based generative models. It addresses challenges in traditional methods by employing a progressive training pipeline and a new Context-Aware Feature Derivation (CFD) metric for evaluating object removal quality. The framework significantly enhances the fidelity of image editing by ensuring physical and geometric consistency in the manipulation of objects within images.

Uploaded by

cxy010728
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views10 pages

Omni Paint

OmniPaint is a novel framework for object-oriented image editing that integrates realistic object removal and generative object insertion as interdependent processes, leveraging diffusion-based generative models. It addresses challenges in traditional methods by employing a progressive training pipeline and a new Context-Aware Feature Derivation (CFD) metric for evaluating object removal quality. The framework significantly enhances the fidelity of image editing by ensuring physical and geometric consistency in the manipulation of objects within images.

Uploaded by

cxy010728
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

OmniPaint: Mastering Object-Oriented Editing via Disentangled

Insertion-Removal Inpainting

Yongsheng Yu1 * , Ziyun Zeng1 * , Haitian Zheng2 , Jiebo Luo1


1
University of Rochester, 2 Adobe Research
{yyu90,zzeng24}@ur.rochester.edu, hazheng@adobe.com, jluo@cs.rochester.edu
arXiv:2503.08677v2 [cs.CV] 12 Mar 2025

Realistic Object Removal Generative Object Insertion


Figure 1. Illustration of OmniPaint for object-oriented editing, including realistic object removal (left) and generative object insertion
(right). Masked regions are shown as semi-transparent overlays. In removal cases, the × marks the target object and its physical effects,
such as reflections, with the right column showing the results. In insertion cases, the reference object (inset) is placed into the scene,
indicated by a green arrow. Note that for model input, masked regions are fully removed rather than semi-transparent.

Abstract https://yeates.github.io/OmniPaint-Page/.

Diffusion-based generative models have revolutionized 1. Introduction


object-oriented image editing, yet their deployment in re-
alistic object removal and insertion remains hampered by Object-oriented image editing has evolved from simple
challenges such as the intricate interplay of physical effects pixel-level adjustments to complex scene manipulation
and insufficient paired training data. In this work, we intro- tasks, including object removal [7, 43, 50, 58] and in-
duce OmniPaint, a unified framework that re-conceptualizes sertion [5, 38, 47]. Classic approaches for object re-
object removal and insertion as interdependent processes moval/insertion in images have followed two distinct tech-
rather than isolated tasks. Leveraging a pre-trained diffu- nical routes without intersection, such as object harmoniza-
sion prior along with a progressive training pipeline com- tion [2, 46] and image completion [22, 28]. Recent ad-
prising initial paired sample optimization and subsequent vances in large diffusion-based generative models [19, 28]
large-scale unpaired refinement via CycleFlow, OmniPaint have broadened the horizons of object-oriented editing,
achieves precise foreground elimination and seamless ob- enabling not only high-fidelity inpainting of masked re-
ject insertion while faithfully preserving scene geometry gions [6, 37, 58] but also creative synthesis of new objects
and intrinsic properties. Furthermore, our novel CFD seamlessly integrated into existing images [5, 33, 34, 38].
metric offers a robust, reference-free evaluation of context These models further allow manual manipulation of object
consistency and object hallucination, establishing a new attributes and appearances through text prompts or refer-
benchmark for high-fidelity image editing. Project page: ence images, demonstrating unique industrial value for vi-

1
Masked Image LaMa FLUX-Inpainting OmniPaint (Ours)
sual content modification and creation.
Despite the transformative potential of diffusion-based
models, their application to general object editing presents
unique challenges. The first challenge lies in the path de-
ReMOVE: 0.8470, CFD: 0.3224 ReMOVE: 0.9419, CFD: 0.9666 ReMOVE: 0.9454, CFD: 0.1608
pendency on large-scale paired real-world datasets [43, 44]
or synthetic datasets [14, 20, 21]. For specific tasks like
object insertion, achieving correct geometric alignment and
realistic integration requires not only high-quality synthesis
but also deep understanding of complex physical effects like ReMOVE: 0.9351, CFD: 0.3287 ReMOVE: 0.9058, CFD: 1.2533 ReMOVE: 0.9000, CFD: 0.2795

shadows, reflections, and occlusions. Insufficient paired


training samples may lead to models lacking identity con-
sistency or failing to integrate objects with realistic physical
effects [43]. The second challenge involves ensuring reli- ReMOVE: 0.9630, CFD: 1.0410 ReMOVE: 0.9460, CFD: 1.0413 ReMOVE: 0.9523, CFD: 0.1631

able object removal that not only eliminates unwanted fore- Figure 2. Visualization of CFD metric assessment for object re-
ground elements but also maintains background continuity moval. The segmentation results are obtained using SAM [17]
and prevents unintended introduction of artifacts or halluci- with refinement, with purple masks for background, orange masks
nated objects [7] - particularly problematic given the lack of for segments fully within the original mask, and unmasked for
robust evaluation metrics for flagging ghost elements gen- those extending beyondReMOVE: 0.8363 CFD: 0.3169
the original mask. ReMOVE: 0.8521 CFD: 1.0488
Note that the orangeReMOVE: 0.8434 CFD: 0.1518
erated by large models’ random hallucinations. masked regions correspond to hallucinated objects. A higher Re-
These limitations have led to separate modeling of object MOVE [4] score is better, while a lower CFD score is preferable.
removal and insertion, whether text-driven [10, 38, 50, 54] In these cases, ReMOVE scores are too similar to indicate removal
success, while CFD score offers a clearer distinction.
or mask-guided [7, 33, 34, 43, 58]. However, deploy-
ing large generative models simultaneously across differ-
ent editing subtasks (e.g., removal and insertion) that cur- illumination consistency during insertion. Ablation studies
rently employ completely different technical implementa- reveal that omitting CycleFlow prevents full utilization of
tions risks potential conflicts and increased costs. unpaired data, leading to deficiencies in identity consistency
To address these challenges, we propose OmniPaint - and physical effect generation. In summary,
a framework that reconceptualizes object removal and in- • We propose a diffusion-based solution for object re-
sertion as interdependent tasks rather than isolated sub- moval/insertion with physical and geometric consistency
tasks. Leveraging pre-trained diffusion priors (employing in physical effects including shadows and reflections.
FLUX [19] in this work), we optimize LoRA [12] parame- • We introduce a progressive training pipeline, where the
ters through collected small-scale real-world paired samples proposed CycleFlow technique enables unpaired post-
while enabling easy task switching via learnable text em- training, minimizing reliance on paired data.
beddings. For realistic object removal, our model achieves • We further develop a novel no-reference metric called
semantic elimination of masked foreground elements while CFD for object removal quality through hallucination de-
removing their physical effects. For object insertion, we go tection and context coherence assessment.
beyond simple blending to achieve harmonious synthesis
respecting scene geometry and reference identity through 2. Related Works
our proposed CycleFlow mechanism. By incorporating
well-trained removal parameters into insertion training, 2.1. Image Inpainting
we enable the utilization of large-scale unpaired samples, The task of image inpainting, specifically filling in missing
significantly reducing dependence on massive real paired pixels within masked regions, has been extensively stud-
datasets. ied in the literature. End-to-end learning approaches [22,
A key innovation is our Context-Aware Feature Deriva- 35, 52], which aim for pixel-wise fidelity, produce blurry or
tion (CFD) score, a specialized no-reference metric for ob- repetitive patterns when dealing with large masks or com-
ject removal. As illustrated in Fig. 2, it evaluates object hal- plex backgrounds. Methods leveraging pre-trained genera-
lucinations and context coherence, setting a new standard tive models, such as GANs [51, 53, 56] and diffusion mod-
for realistic object-oriented editing. Experiments demon- els [1, 41], as priors to generate realistic content for miss-
strate OmniPaint’s significant improvements in both inter- ing regions. Recently, based on text-to-image models [28]
dependent editing tasks: the model better handles com- have enabled controllable inpainting [37, 43, 58], allowing
plex physical effects like shadows and reflections during guided synthesis within masked regions.
removal while achieving seamless background reconstruc- Key Differences from Conventional Inpainting. Our ap-
tion, and generates more natural geometric alignment and proach departs from traditional inpainting paradigms in two

2
Context Coherence Hallucination Penalty
Ω𝐌
DINOv2 Space

Ωℳ 𝑛
SAM Ω𝐌 derivation derivation Ωℳ1𝑛
Metrics Context Coh. Obj. Hal.
Ωℳ 𝑜
eval eval
Ωℳ3𝑜
FID ✘ ✘
ReMOVE ✓ ✘ Ω𝐁\𝐌
CFD (ours) ✓ ✓ Ω𝐁\𝐌 Ωℳ1𝑜 Ωℳ2𝑜

Figure 3. Illustration of the proposed CFD metric for evaluating object removal quality. Left: We apply SAM to segment the inpainted
image into object masks and classify them into nested (ΩMn ) and overlapping (ΩMo ) masks. Middle: The context coherence term
measures the feature deviation between the inpainted region (ΩM ) and its surrounding background (ΩB\M ) in the DINOv2 feature space.
Right: The hallucination penalty is computed by comparing deep features of detected nested objects (ΩMn ) with their adjacent overlapping
masks (ΩMo ) to assess whether unwanted object-like structures have emerged.

fundamental ways: Diffusion models offer a promising alternative, often us-


• Standard inpainting reconstructs masked images to match ing object-specific embeddings from CLIP [33, 47] or DI-
the original, while we explicitly model object removal NOv2 [5, 34] to preserve identity, attributes, and texture.
and insertion as distinct yet interdependent processes. Unlike these existing works, our approach builds on FLUX
• Traditional inpainting fills hole, while our approach may without additional feature extractors.
adjust surrounding content for seamless integration. Concurrent methods [32, 44] tackle generative object in-
sertion along the direction of text-driven subject generation.
2.2. Realistic Object Removal ObjectMate [44] constructs millions of paired samples cov-
Realistic object removal aims to eliminate foreground se- ering multiple reference subjects. In contrast, we focus
mantics while ensuring seamless background blending and on single-subject image-driven insertion, ensuring subject
preventing object hallucination. Existing methods fall into alignment and effect integration using only 3K real-world
two categories: text-driven and mask-guided. Text-driven paired samples as training data, followed by CycleFlow un-
approaches [10, 50, 54] specify objects for removal via paired post-training. This approach significantly alleviates
instructions but are constrained by text embedding per- the requirement for large paired datasets.
formance [25, 48], particularly in handling multiple ob-
jects and attribute understanding. Mask-guided meth- 3. Preliminaries
ods [6, 7, 43, 58] provide more precise control. Recent
advances, such as MagicEraser [20], generate removal data Flow Matching (FM) [23] is a generative modeling frame-
by shifting objects within an image, while SmartEraser [14] work that learns a velocity field ut (zt ) to map a source
synthesizes a million-sample dataset using alpha blending. distribution p0 to a target distribution p1 through a time-
RORem [21] leverages synthetic datasets like MULAN [40] dependent flow. The goal of FM is to train a neural net-
for training, though synthetic data limit a model’s ability to work θ to make prediction uθt (zt ) to approximate the veloc-
replicate realistic object effects. ity field ut (zt ). This is achieved by minimizing the Flow
We introduce the CFD score, a reference-free metric de- Matching Loss, defined as:
signed exclusively for object removal, to evaluate object
LFM (θ) = Et,zt ∼pt ∥uθt (zt ) − ut (zt )∥2 ,
 
hallucination and context coherence. This enables a more (1)
effective assessment of object removal techniques.
where zt ∼ pt and t ∼ U[0, 1]. Directly optimizing LFM (θ)
2.3. Generative Object Insertion is intractable due to the complexity of estimating the ground
truth velocity field ut (zt ) for arbitrary pt .
Object insertion aims to seamlessly integrate new objects
To simplify optimization, the Conditional Flow Match-
into existing scenes. Early methods focused on harmo-
ing (CFM) framework introduces conditional distributions
nization and blending [2, 39, 46] but struggled with com-
pt|1 (z|z1 ) = N z|tz1 , (1 − t)2 I , which focuses on paths
plex physical effects like shadows and reflections. Re-
conditioned on target samples z1 = Z1 . The velocity field
cent approaches leverage real-world datasets [43], synthetic
under this conditional setting is analytically given by:
blender-generated data [42], or test-time tuning [6] to im-
prove object–background interactions, yet they remain lim- z1 − z
ited in geometric alignment modeling. ut (z|z1 ) = . (2)
1−t

3
The conditional probability path pt|1 (z|z1 ) follows a linear we introduce two trainable vectors:
interpolation:
τremoval , τinsertion ∼ N (0, I), (5)
zt = tz1 + (1 − t)z0 zt = Zt ∼ pt|1 (·|z1 ) (3)
initialized from the embedding of an empty string and opti-
where z0 = Z0 ∼ p0 and z1 = Z1 ∼ p1 . Using this mized separately for each task. Inference switches between
formulation, the Conditional Flow Matching Loss is defined removal and insertion via embedding selection.
as: To facilitate computational efficiency, we freeze the
FLUX backbone and perform Parameter-Efficient Fine-
LCFM (θ) = Et,z0 ∼p0 ,z1 ∼p1 ∥uθt (zt ) − ut (zt |z1 )∥2 . (4)
 
Tuning (PEFT), optimizing two LoRA [12] parameter sets,
This loss avoids the need to estimate ut (zt ) directly by θ and ϕ, for object removal and insertion, respectively.
leveraging the known form of ut (z|z1 ).
4.2. Data Collection and Mask Augmentation
4. Methodology We collect a dataset of 3,300 real-world paired samples cap-
tured across diverse indoor and outdoor environments, en-
We frame image inpainting as a dual-path, object-oriented compassing various physical effects such as shadows, spec-
process that consists of two key directions: object removal ular reflections, optical distortions, and occlusions (see Ap-
and object insertion. Given an image I ∈ RH×W ×3 and pendix for examples). Each triplet ⟨I, Iremoved , M⟩ is metic-
a binary mask M ∈ {0, 1}H×W denoting the edited region ulously annotated to ensure high quality.
(where Mij = 1 indicates masked pixels), our model op- To enhance model robustness against diverse mask vari-
erates on the masked input X = I ⊙ (1 − M) to facilitate ations, we apply distinct augmentation strategies for object
targeted modifications. The object removal pathway sup- removal and insertion. For removal, we introduce segmen-
presses semantic traces within M, ensuring smooth bound- tation noise via morphological transformations, randomly
ary transitions while preventing unintended artifacts or hal- applying dilation or erosion with configurable parameters.
lucinations. Meanwhile, the object insertion pathway inte- Imprecise masks are simulated by perturbing boundaries
′ ′
grates a new object O ∈ RH ×W ×3 (H ′ < H, W ′ < W ), and adding or removing geometric shapes (e.g., circles,
maintaining global coherence and context-aware realism. rectangles). Augmented examples and the effectiveness
4.1. The OmniPaint Framework analysis are provided in the Appendix. For object insertion,
since explicit object detection is not required, we simplify
OmniPaint builds upon FLUX-1.dev [19], a diffusion-based mask augmentation by expanding segmentation masks to
architecture featuring a Multi-Modal Diffusion Transformer their bounding boxes or convex hulls, ensuring adaptabil-
(MM-DiT) [8] backbone. While preserving FLUX’s strong ity to various reference object formats. Reference object
text-to-image priors, we introduce the image-conditioning image augmentation follows prior work [34].
mechanisms used in [36] tailored for object-aware editing.
Masked Image Conditioning. The model refines Gaussian 4.3. Training Pipeline
noise z0 = Z0 ∼ p0 towards z1 , using the masked image In our experiments, we observe that the current training data
X as a denoising guide for object removal and insertion. are insufficient to maintain reference identity for object in-
We leverage the existing FLUX networks, including its VAE sertion, as in Fig. 7(b) and Table A in the Appendix. Boot-
encoder and 2 × 2 patchify layer, to map X into a shared strapping paired data via trained models, akin to Object-
feature space, yielding the conditioned token sequence zX c . Drop [43], is a straightforward solution but requires a reli-
Reference Object Conditioning. For object insertion, the able filtering mechanism, which remains an open challenge.
model conditions on both the masked image and a refer- Fortunately, object insertion and object removal are
ence object image O. To preserve object identity while mathematically complementary inverse problems (i.e., each
minimizing background interference, we preprocess O with can be viewed as inverting the other). Inspired by cycle-
Carvekit [31] for background removal before resizing it to consistency approaches [45, 57], we propose utilizing un-
match X’s spatial dimensions. The reference object un- paired data rather than relying on paired augmentations.
dergoes the same latent encoding and patchification as the In particular, we utilize large-scale object segmentation
masked image, producing a corresponding latent sequence datasets, which lack explicit removal pairs, to enhance ob-
zOc . The final condition token is obtained by concatenating ject insertion. This section presents our three-phase training
both sequences along the token dimension: zc = [zX O
c ; zc ]. pipeline: (1) inpainting pretext training, (2) paired warmup,
Prompt-Free Adaptive Control. Given the highly image- and (3) CycleFlow unpaired post-training.
conditioned nature of our task, textual prompts may in-
troduce ambiguity. To mitigate this, we adopt a prompt- 4.3.1. Inpainting Pretext Training
free adaptive control mechanism, replacing text embed- To endow our model with basic inpainting abilities, we first
dings with learnable task-specific parameters. Specifically, fine-tune it on a pretext inpainting task, initializing θ and

4
ϕ for later stages. Using a mask generator [35], we apply
random masks to LAION dataset [30] and train the model
to reconstruct missing regions by minimizing a CFM loss, add noise removal

Lpretext (θ, ϕ|zt , zX


c )=
(6)

cycle loss
h i
Et,z0 ,z1 ∥uθ,ϕ X
t (zt , zc ) − ut (zt |z1 )∥
2
, detach()

where z1 = Z1 ∼ p1 (I), which enforces the model to com- insertion add noise
plete the masked region so that the entire image approxi-
mates I. We show in the appendix that pretext training ben-
efits object editing performance.
4.3.2. Paired Warmup
Figure 4. Illustration of CycleFlow. The mapping F removes the
Next, we leverage our 3,000 paired samples for real-world
object, predicting an estimated target z′1 , while G reinserts the ob-
object insertion and removal training. In the Paired Warmup ject, generating estimated target z1 . Cycle consistency is enforced
stage, θ and ϕ are trained separately, enabling effect-aware by ensuring G reconstructs the original latent z1 from the effect re-
object removal (e.g., removing reflections and shadows) and moval output. Dashed arrows indicate the cycle loss supervision.
insertion with effect integration.
For insertion, z1 is drawn from Z1 ∼ p1 (I), where I segmentation datasets lack annotations for object effects,
means images retaining the foreground object. We optimize such as shadows and reflections, meaning that the masked
the following objective by modifying Equation 4: image input X still retains these effects. This suppresses the
model’s ability to synthesize realistic object effects, making
Lwarmup (θ|zt , zc , τ ) =
(7) insertions appear more like copy-paste operations of the ref-
Et,z0 ,z1 ∥uθt (zt , zc , τ ) − ut (zt |z1 )∥2 ,
 
erence object, as observed in the γ = 0 case of Fig. 7(b).
To overcome this limitation, we use our well-trained re-
where zc = [zX O
c ; zc ] represents the conditioning token se- moval parameters ϕ, which even at NFE = 1 remove object
quence, concatenating masked image and object identity effects (See Fig. 7(a)). Leveraging ϕ as a preprocessing step
features, and τ denotes the corresponding task-specific em- enables insertion training on latents with effects removed.
bedding. Thus, we introduce the CycleFlow mechanism, compris-
For removal, z1 is sampled from Z1 ∼ p1 (Iremoved ), ing two mappings: F (removal direction) and G (insertion
where Iremoved means images with the foreground object direction). These mappings predict the velocity field at zt ,
physically removed. Given conditioning on zX c , the opti- estimating their target samples z1 = Z1 , Z1 ∼ p1 :
mization objective becomes:
F :z′1 ← zt − uϕt (zt , zX
c , τremoval ) · t, (9)
Lwarmup (ϕ|zt , zX
c , τ) =
h i (8) G :z1 ← z′t − uθt (z′t , zc , τinsertion ) · t, (10)
Et,z0 ,z1 ∥uϕt (zt , zX
c , τ ) − ut (zt |z1 )∥
2
.
where z′1 and z1 denote the estimated target samples for
In practice, we assume a linear interpolation path for removal and insertion, respectively. Here, we also rely on
computational efficiency [24], setting ut (zt |z1 ) = (z1 −z0 ) the ut (zt |z1 ) = (z1 − z0 ) linear interpolation setting [24].
in both objectives. This warmup stage enhances object re- As illustrated in Fig. 4, we design a Remove-Insert cycle,
moval, effectively handling reflections and shadows (Fig.6). ensuring that reinserting a removed object approximately
However, with only 3,000 paired samples, it struggles to restores its original latent representation.
maintain reference identity in object insertion (Fig.7(b)). z1 → zt → F (zt ) → z′t → G(z′t ) ≈ z1 . (11)
4.3.3. CycleFlow Unpaired Post-Training
To enforce this cycle consistency, we define a Cycle Loss:
To enhance training for object insertion, we leverage h i
 2
large-scale object segmentation datasets, including COCO- Lcycle (θ) = Et, zt Gθ ⌊F (zt )⌋ − z1 , (12)
Stuff [3] and HQSeg [16], as unpaired data sources. These
datasets provide foreground object masks, enabling us to where ⌊·⌋ denotes a gradient truncation operator, treating
easily construct the model’s conditioning inputs X and O. its output as a constant during backprop to fix parameters
We continue tuning θ using the same objective as in ϕ. During CycleFlow post-training, we optimize an overall
Equation 7 on this larger dataset, improving identity preser- loss: Lwarmup (θ|zt , zc , τinsertion ) + γLcycle on unpaired train-
vation, as shown in Fig. 7(b). The case where γ = 0 corre- ing data, where γ controls the strength of cycle consistency
sponds to training solely with Equation 7. However, these (analyzed in Sec. 5.5).

5
Empirically, this work focuses solely on CycleFlow for Masked Image & Reference Object Image
object insertion, as warmup-alone suffices for removal.
4.4. Context-Aware Feature Deviation (CFD) Score
We introduce the Context-Aware Feature Deviation (CFD)
score to quantitatively assess object removal performance.
AnyDoor
As illustrated in Fig. 3, CFD comprises two components:
a hallucination penalty that detects and penalizes unwanted
object-like structures emerging in the removed region, and
a context coherence term that evaluates how well the in-
painted region blends with the surrounding background.
Hallucination Penalty. Given an object mask M, let IMPRINT
ΩM = {(i, j) | Mij = 1} denote the pixels of removed
region. Define B = bbox(M) as its bounding box. After
removal, we aim to identify whether the synthesized content
introduces spurious object-like structures.
We apply the off-the-shelf SAM-ViT-H [17] model to
segment the image into masks {Mk }K k=1 . Focusing on OmniPaint (Ours)
masks near M, we categorize them as:
• Nested masks, Mn = {Mnk | ΩMnk ⊆ ΩM }, entirely
contained within the removed region.
• Overlapping masks, Mo = {Mok | ΩMok ∩ ΩM ̸=
∅, ΩMok ̸⊆ ΩM }, partially overlapping ΩM but extend-
ing beyond.
Figure 5. Qualitative comparison on object insertion. Given
A naive hallucination penalty would simply count nested masked images and reference object images (top row), we com-
masks, but some may arise from segmentation noise. In- pare results from AnyDoor [5], IMPRINT [34], and OmniPaint.
stead, we leverage deep feature similarity to assess whether
a mask plausibly integrates into its context. To refine seg- Final CFD Metric. The final CFD score is computed as:
mentation, we merge overlapping masks adjacent to any
Mni ∈ Mn : CFD = dcontext + dhallucination . (16)
n [ o
Mpaired = (Mni , Mi ) | Mi = Moj , A lower CFD signifies better removal quality—minimal
adj(Mo n
j ,Mi ) hallucination and seamless contextual blending.
(13)
where Moj ∈ Mo denotes a overlapping mask, and
adj(Moj , Mni ) = 1 if the masks share a boundary pixel
5. Experiments
or their one-pixel dilation overlaps. 5.1. CFD Analysis
The hallucination penalty is then defined as:
X   We perform qualitative analyses to determine whether our
dhallucination = ωi · 1−f (ΩMni )⊤ f (ΩMi ) , CFD score effectively captures both contextual coherence
(Mn paired and hallucination artifacts, thereby offering a more reli-
i ,Mi )∈M
(14) able evaluation of object removal quality compared to ex-
|ΩMn | isting metrics such as ReMOVE [4]. As illustrated in
where ωi = P n |ΩMn | weights the contribution of each
i
M Fig. 2, FLUX-Inpainting [37] generates conspicuous hal-
nested mask. Feature embeddings f (Ω) are extracted from lucinations—phantom objects like ships, human figures, or
the pre-trained vision model DINOv2 [26]. floating canisters—yet still attains high ReMOVE scores. In
Context Coherence. Even when dhallucination = 0 (i.e., no contrast, CFD effectively penalizes these hallucinations by
nested objects are detected), the inpainted content may still using SAM to segment the inpainted region and by exam-
not align with the surrounding background. To quantify this ining feature-level discrepancies within nested and overlap-
structural consistency, we compute feature deviation: ping masks. Similarly, while LaMa [35] interpolates back-
dcontext = 1 − f (ΩM )⊤ f (ΩB\M ), (15) ground textures in the masked area, its limited generative
prior often leads to ghostly artifacts due to insufficient ob-
where B \ M denotes the bounding box excluding the ject effect detection. Conversely, our OmniPaint demon-
masked region. strates superior removal fidelity by completely eliminating

6
Masked Image FreeCompose PowerPaint CLIPAway FLUX-Inpainting OmniPaint (Ours)

Figure 6. Qualitative comparison of object removal in challenging scenarios. Top: Simultaneous removal of objects and glass reflections.
Middle: Shadow-free removal under real-world lighting. Bottom: Occlusion-robust inpainting, reconstructing background objects without
distortion. The compared methods include FreeCompose [6], PowerPaint [58], CLIPAway [7], and FLUX-Inpainting [37].

Table 1. Quantitative results on our 300-sample removal test set. Table 2. Quantitative results on the 1000-sample RORD test set.
Method FID ↓ CMMD ↓ CFD ↓ ReMOVE ↑ PSNR ↑ SSIM ↑ LPIPS ↓ Method FID ↓ CMMD ↓ CFD ↓ ReMOVE ↑ PSNR ↑ SSIM ↑ LPIPS ↓
LaMa 105.10 0.3729 0.3531 0.7311 20.8632 0.8278 0.1353 LaMa 49.20 0.4897 0.4660 0.8321 19.2941 0.5571 0.1075
MAT 147.37 0.6646 0.5104 0.6162 18.2229 0.7845 0.1900 MAT 86.33 0.8689 0.7723 0.7070 20.3080 0.7815 0.1429
SD-Inpainting 153.13 0.3997 0.4874 0.6234 18.8760 0.6932 0.1830 SD-Inpainting 75.31 0.4733 0.6648 0.8227 19.8308 0.6233 0.1235
FLUX-Inpainting 132.60 0.3257 0.4609 0.6765 20.8560 0.8002 0.1451 FLUX-Inpainting 62.24 0.3805 0.6077 0.8461 21.9159 0.7769 0.0975
CLIPAway 115.72 0.2919 0.5242 0.7396 19.5259 0.7085 0.1641 CLIPAway 49.07 0.4569 0.5442 0.8696 20.3077 0.6055 0.1132
PowerPaint 103.61 0.2182 0.4031 0.8013 19.4559 0.7102 0.1428 PowerPaint 42.65 0.4599 0.4128 0.8933 20.1832 0.6066 0.0968
FreeCompose 88.77 0.1790 0.3743 0.8654 21.2729 0.7320 0.1182 FreeCompose 46.37 0.5125 0.5215 0.9008 20.5678 0.6152 0.1090
OmniPaint (Ours) 51.66 0.0473 0.2619 0.8610 23.0797 0.8135 0.0738 OmniPaint (Ours) 19.17 0.2239 0.3682 0.9053 23.2334 0.7867 0.0424

target objects without introducing unwanted artifacts, as re-


IMPRINT do not have public implementations, we obtain
flected by its significantly lower CFD scores.
official code, checkpoints, and test sets from the authors.
By concurrently quantifying both the emergence of
Our insertion benchmark consists of 565 samples with 5122
unwanted objects and contextual alignment, CFD aligns
resolutions, combining the IMPRINT test set with real-
closely with human visual perception. These findings sub-
world cases we captured. Each sample includes a back-
stantiate CFD as a robust evaluation metric that helps ensure
ground image, a reference object image, and a binary mask.
that object removal not only achieves seamless blending but
Reference images are preprocessed using CarveKit [31] for
also minimizes erroneous content hallucination.
background removal. To evaluate identity consistency, we
5.2. Experimental Settings measure feature similarity between the inserted object and
its reference counterpart using CUTE [18], CLIP-I [27],
For removal, we compare against end-to-end inpainting
DINOv2 [26], and DreamSim [9], with the latter being
models MAT [22] and LaMa [35], the diffusion-based SD-
more aligned with human perception. Beyond local iden-
Inpaint [28], and FLUX-Inpainting [37] to ensure a fair
tity preservation, we assess overall image quality using non-
backbone comparison. Additionally, we include recent
reference metrics: MUSIQ [15] and MANIQA [49].
open-source object removal methods CLIPAway [7], Pow-
For fairness, we apply the same image-mask pairs across
erPaint [58], and FreeCompose [6]. Experiments are con-
all baselines and use official implementations with their de-
ducted on two benchmarks: a test set of 300 real-world ob-
fault hyperparameters, such as inference step counts. For
ject removal cases we captured, resized to 5122 for test-
OmniPaint, we employ the Euler Discrete Scheduler [8]
ing, and the RORD [29] dataset with 1,000 paired sam-
during inference and set the number of inference steps to 28
ples at their original 540 × 960 resolution, both provid-
for primary quantitative and qualitative experiments. Addi-
ing ground truth from physically removed objects. We re-
tional implementation details are provided in the Appendix.
port PSNR, SSIM, perceptual similarity metrics (FID [11],
CMMD [13], LPIPS [55]), and object removal-specific met-
5.3. Evaluation of Object Removal Performance
rics, including ReMOVE [4] and our CFD score.
For object insertion, we compare against Paint-by- We evaluate OmniPaint on realistic object removal, com-
Example (PbE) [47], ObjectStitch [33], FreeCompose [6], paring against inpainting and object removal methods. As
AnyDoor [5], and IMPRINT [34]. Since ObjectStitch and shown in Table 1 and Table 2, OmniPaint consistently out-

7
Table 3. Quantitative comparison of object insertion methods.
Object Identity Preservation Overall Image Quality
CLIP-I ↑ DINOv2 ↑ CUTE ↑ DreamSim ↓ MUSIQ ↑ MANIQA ↑
PbE 84.1265 50.0008 65.1053 0.3806 70.26 0.5088
ObjectStitch 86.4506 59.6560 74.0478 0.3245 68.87 0.4755
FreeCompose 88.1679 76.0085 82.8641 0.2134 66.67 0.4775
Anydoor 89.2610 76.9560 85.2566 0.2208 69.28 0.4593
IMPRINT 90.6258 76.8940 86.1511 0.1854 68.72 0.4711
OmniPaint 92.2693 84.3738 90.2936 0.1557 70.59 0.5209 NFE=1 NFE=4 NFE=18 NFE=28
(a)
performs prior approaches across all datasets, achieving the
lowest FID [11], CMMD [13], LPIPS [55], and CFD while
maintaining high PSNR, SSIM, and ReMOVE [4] scores.
These results highlight its ability to remove objects while
preserving structural and perceptual fidelity, effectively sup-
pressing object hallucination.
Input warmup 𝜸 = 𝟎. 𝟎 𝜸 = 𝟏. 𝟓 𝜸 = 𝟑. 𝟎
Fig. 6 provides a visual comparison in challenging real- (b)
world cases. In the first row, OmniPaint successfully re- Figure 7. Impact of inference steps and cycle loss weights. (a)
moves both objects and their glass reflections, a failure by Removal (top) and insertion (bottom) results across different neu-
all baselines. The second row highlights OmniPaint’s abil- ral function evaluations (NFE). (b) Insertion results with varying
ity to eliminate shadows under natural lighting, where other cycle loss weights γ, with OmniPaint defaulting to γ = 1.5.
methods leave residual artifacts. The third row demon-
strates robust inpainting in occlusion scenarios, ensuring This limits the model’s ability to learn effect generation, as
seamless background reconstruction without distortion. insertion training relies on input images that already con-
By effectively handling reflections, shadows, and occlu- tain these effects. Increasing γ enhances effect synthesis.
sions, OmniPaint surpasses prior methods in generating co- At γ = 1.5, OmniPaint achieves the optimal balance, effec-
herent and realistic object removal results. tively learning from unpaired data while preserving realistic
effect synthesis. However, further increasing γ to 3.0 over-
5.4. Evaluation of Object Insertion Performance relaxes effect generation, leading to unnatural artifacts like
We evaluate OmniPaint on object insertion, comparing it exaggerated shadows.
with advanced methods. As shown in Table 3, Omni- Neural Function Evaluation. We analyze the impact of
Paint achieves the highest scores across all object identity neural function evaluations (NFE) on object removal and
preservation metrics, including CLIP-I [27], DINOv2 [26], insertion, as illustrated in Fig. 7(a). Lower NFE values,
CUTE [18], and DreamSim [9], demonstrating superior such as 1 or 4, lead to noticeable blurring, especially within
alignment with the reference object. Additionally, it out- masked regions. Interestingly, for removal tasks, even
performs all baselines in overall image quality, as measured NFE=1 effectively eliminates the object and its associated
by MUSIQ [15] and MANIQA [49], indicating better per- effects. At NFE=18, objects are removed cleanly without
ceptual realism and seamless integration. residual artifacts, while inserted objects exhibit high fidelity
with realistic shading and reflections. Further increasing
Fig. 5 presents visual comparisons. Given a masked in-
NFE to 28 yields only marginal gains, indicating diminish-
put and a reference object, OmniPaint generates inserted ob-
ing returns. Nonetheless, we set NFE=28 as the default to
jects with more accurate shape, texture, and lighting consis-
ensure optimal visual quality.
tency. In contrast, other methods struggle with identity dis-
tortion, incorrect shading, or noticeable blending artifacts.
Notably, OmniPaint preserves fine details while ensuring
6. Conclusion
the inserted object naturally aligns with scene geometry and We present OmniPaint for object-oriented image editing
illumination. By maintaining high-fidelity identity preser- that reconceptualizes object removal and insertion as in-
vation and improving perceptual quality, OmniPaint sets a terdependent tasks. By leveraging a pre-trained diffusion
new standard for realistic object insertion. prior and a progressive training pipeline comprising initial
paired sample optimization and subsequent large-scale un-
5.5. Hyperparameter Analysis paired refinement via CycleFlow, OmniPaint achieves pre-
Cycle Loss Weight. We analyze the impact of the cycle loss cise foreground elimination and seamless object integra-
weight γ on object insertion by comparing results across tion while preserving scene geometry and other intrinsic
different values in Fig. 7(b). Lower γ values (e.g., γ = 0) properties. Extensive experiments demonstrate that Omni-
result in weak physical effect synthesis, as the unpaired Paint effectively suppresses object hallucination and miti-
training data (COCO-Stuff [3] and HQSeg [16]) lack ob- gates artifacts, with the novel CFD metric providing a ro-
ject effects segmentation such as shadows and reflections. bust, reference-free assessment of contextual consistency.

8
References Feng Yang. MUSIQ: multi-scale image quality transformer.
In ICCV, 2021. 7, 8
[1] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended
[16] Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing
latent diffusion. TOG, 2023. 2
Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in
[2] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and high quality. In NeurIPS, 2023. 5, 8
Trevor Darrell. Compositional gan: Learning image-
[17] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
conditional binary composition. IJCV, 2020. 1, 3
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
[3] Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
Coco-stuff: Thing and stuff classes in context. In CVPR, thing. In ICCV, 2023. 2, 6
2018. 5, 8 [18] Klemen Kotar, Stephen Tian, Hong-Xing Yu, Dan Yamins,
[4] Aditya Chandrasekar, Goirik Chakrabarty, Jai Bardhan, and Jiajun Wu. Are these the same apple? comparing images
Ramya Hebbalaguppe, and Prathosh AP. Remove: A based on object intrinsics. In NeurIPS, 2023. 7, 8
reference-free metric for object erasure. In CVPR Work- [19] Black Forest Labs. Flux. https://github.com/
shops, 2024. 2, 6, 7, 8 black-forest-labs/flux, 2024. 1, 2, 4
[5] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, [20] Fan Li, Zixiao Zhang, Yi Huang, Jianzhuang Liu, Renjing
and Hengshuang Zhao. Anydoor: Zero-shot object-level im- Pei, Bin Shao, and Songcen Xu. Magiceraser: Erasing any
age customization. In CVPR, 2024. 1, 3, 6, 7 objects via semantics-aware control. In ECCV, 2024. 2, 3
[6] Zhekai Chen, Wen Wang, Zhen Yang, Zeqing Yuan, Hao [21] Ruibin Li, Tao Yang, Song Guo, and Lei Zhang. Rorem:
Chen, and Chunhua Shen. Freecompose: Generic zero-shot Training a robust object remover with human-in-the-loop.
image composition with diffusion prior. In ECCV, 2024. 1, arXiv e-prints, pages arXiv–2501, 2025. 2, 3
3, 7 [22] Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya
[7] Yiğit Ekin, Ahmet Burak Yildirim, Erdem Eren Caglar, Jia. Mat: Mask-aware transformer for large hole image in-
Aykut Erdem, Erkut Erdem, and Aysegul Dundar. CLIP- painting. In CVPR, 2022. 1, 2, 7
Away: Harmonizing focused embeddings for removing ob- [23] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim-
jects via diffusion models. In NeurIPS, 2024. 1, 2, 3, 7 ilian Nickel, and Matthew Le. Flow matching for generative
[8] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim modeling. In ICLR, 2023. 3
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik [24] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow
Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim straight and fast: Learning to generate and transfer data with
Dockhorn, Zion English, and Robin Rombach. Scaling rec- rectified flow. In ICLR, 2023. 5
tified flow transformers for high-resolution image synthesis.
[25] Pablo Marcos-Manchón, Roberto Alcover-Couso, Juan C.
In ICML, 2024. 4, 7
SanMiguel, and Jose M. Martı́nez. Open-vocabulary atten-
[9] Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy tion maps with token optimization for semantic segmentation
Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- in diffusion models. In CVPR, 2024. 3
sim: Learning new dimensions of human visual similarity [26] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V.
using synthetic data. In NeurIPS, 2023. 7, 8 Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
[10] Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido
Yinfei Yang, and Zhe Gan. Guiding instruction-based im- Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes,
age editing via multimodal large language models. In ICLR, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab-
2024. 2, 3 bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou,
[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bo-
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a janowski. Dinov2: Learning robust visual features without
two time-scale update rule converge to a local nash equilib- supervision. TMLR, 2024. 6, 7, 8
rium. In NeurIPS, 2017. 7, 8 [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[12] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Lora: Low-rank adaptation of large language models. In Krueger, and Ilya Sutskever. Learning transferable visual
ICLR, 2022. 2, 4 models from natural language supervision. In ICML, 2021.
[13] Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, 7, 8
Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- [28] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
thinking FID: towards a better evaluation metric for image Patrick Esser, and Björn Ommer. High-resolution image syn-
generation. In CVPR, 2024. 7, 8 thesis with latent diffusion models. In CVPR, pages 10684–
[14] Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang 10695, 2022. 1, 2, 7
Zhou, Dongdong Chen, Lei Shi, Dong Chen, and Houqiang [29] Min-Cheol Sagong, Yoon-Jae Yeo, Seung-Won Jung, and
Li. Smarteraser: Remove anything from images using Sung-Jea Ko. RORD: A real-world object removal dataset.
masked-region guidance. arXiv preprint arXiv:2501.08279, In BMVC, 2022. 7
2025. 2, 3 [30] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
[15] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Cade Gordon, Ross Wightman, Mehdi Cherti, Theo

9
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- [44] Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael
man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Pritch, Alex Rav-Acha, and Yedid Hoshen. Objectmate: A
Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia recurrence prior for object insertion and subject-driven gen-
Jitsev. LAION-5B: an open large-scale dataset for training eration. CoRR, abs/2412.08645, 2024. 2, 3
next generation image-text models. In NeurIPS, 2022. 5 [45] Chen Henry Wu and Fernando De la Torre. A latent space of
[31] Nikita Selin. Carvekit: Automated high-quality back- stochastic diffusion models for zero-shot image editing and
ground removal framework. https://github.com/ guidance. In ICCV, 2023. 4
OPHoperHPO/image-background-remove-tool, [46] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang.
2023. 4, 7 gp-gan: Towards realistic high-resolution image blending. In
[32] Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sun- ACM MM, 2019. 1, 3
groh Yoon. Large-scale text-to-image model with inpaint- [47] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin
ing is a zero-shot subject-driven image generator. CoRR, Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by
abs/2411.15466, 2024. 3 example: Exemplar-based image editing with diffusion mod-
[33] Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian L. els. In CVPR, 2023. 1, 3, 7
Price, Jianming Zhang, Soo Ye Kim, and Daniel G. Aliaga. [48] Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei
Objectstitch: Object compositing with diffusion model. In Wang, Xiaoshuai Sun, and Rongrong Ji. Exploring phrase-
CVPR, 2023. 1, 2, 3, 7 level grounding with text-to-image diffusion model. In
[34] Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian L. ECCV, 2024. 3
Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, [49] Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao,
and Daniel G. Aliaga. IMPRINT: generative object com- Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu
positing by learning identity-preserving representation. In Yang. MANIQA: multi-dimension attention network for no-
CVPR, 2024. 1, 2, 3, 4, 6, 7 reference image quality assessment. In CVPR Workshops,
[35] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, 2022. 7, 8
Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, [50] Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut
Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Erdem, and Aysegul Dundar. Inst-inpaint: Instructing
Lempitsky. Resolution-robust large mask inpainting with to remove objects with diffusion models. arXiv preprint
fourier convolutions. In WACV, 2022. 2, 5, 6, 7 arXiv:2304.03246, 2023. 1, 2, 3
[36] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue,
[51] Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan
and Xinchao Wang. Ominicontrol: Minimal and uni-
Bilecen, and Aysegul Dundar. Diverse inpainting and editing
versal control for diffusion transformer. arXiv preprint
with gan inversion. In ICCV, 2023. 2
arXiv:2411.15098, 2024. 4
[52] Yongsheng Yu, Dawei Du, Libo Zhang, and Tiejian Luo.
[37] Alimama Creative Team. Flux-controlnet-inpainting.
Unbiased multi-modality guidance for image inpainting. In
https : / / github . com / alimama - creative /
ECCV, 2022. 2
FLUX-Controlnet-Inpainting, 2024. 1, 2, 6, 7
[53] Yongsheng Yu, Libo Zhang, Heng Fan, and Tiejian Luo.
[38] Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior
High-fidelity image inpainting with gan inversion. In ECCV,
Wolf, and Gal Chechik. Add-it: Training-free object in-
2022. 2
sertion in images with pretrained diffusion models. CoRR,
abs/2411.07232, 2024. 1, 2 [54] Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, and
[39] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Jiebo Luo. Promptfix: You prompt and we fix the photo.
Xin Lu, and Ming-Hsuan Yang. Deep image harmonization. In NeurIPS, 2024. 2, 3
In CVPR, 2017. 3 [55] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
[40] Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei man, and Oliver Wang. The unreasonable effectiveness of
Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Ia- deep features as a perceptual metric. In CVPR, 2018. 7, 8
cobacci, and Sarah Parisot. Mulan: A multi layer annotated [56] Haitian Zheng, Zhe Lin, Jingwan Lu, Scott Cohen, Eli
dataset for controllable text-to-image generation. In CVPR, Shechtman, Connelly Barnes, Jianming Zhang, Ning Xu,
pages 22413–22422, 2024. 3 Sohrab Amirghodsi, and Jiebo Luo. Image inpainting with
[41] Hao Wang, Yongsheng Yu, Tiejian Luo, Heng Fan, and Libo cascaded modulation gan and object-aware training. In
Zhang. Magic: Multi-modality guided image completion. In ECCV, 2022. 2
ICLR, 2024. 2 [57] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.
[42] Tianyu Wang, Jianming Zhang, Haitian Zheng, Zhihong Efros. Unpaired image-to-image translation using cycle-
Ding, Scott Cohen, Zhe L. Lin, Wei Xiong, Chi-Wing Fu, consistent adversarial networks. In ICCV, 2017. 4
Luis Figueroa, and Soo Ye Kim. Metashadow: Object- [58] Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan,
centered shadow detection, removal, and synthesis. CoRR, and Kai Chen. A task is worth one word: Learning with
abs/2412.02635, 2024. 3 task prompts for high-quality versatile image inpainting. In
[43] Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, ECCV, 2024. 1, 2, 3, 7
Alex Rav-Acha, and Yedid Hoshen. Objectdrop: Bootstrap-
ping counterfactuals for photorealistic object removal and in-
sertion. In ECCV, 2024. 1, 2, 3, 4

10

You might also like