inpaintingeditingflux-filllanpaintz-image

Inpaint Any Image with Any Model

LanPaint brings training-free inpainting to every model in modl. Remove people, change expressions, swap objects — no dedicated inpaint model required.

Mar 18, 2026 10 min read

Inpainting replaces a specific region of an image while preserving everything else. You provide a mask (white pixels = regenerate, black = keep) and a prompt describing what fills the gap. The best inpainting is invisible — you shouldn’t be able to tell the image was edited.

modl supports two main approaches: Flux Fill — a dedicated inpainting model trained for the task — and LanPaint — a training-free algorithm from a recent paper (Zheng et al., arXiv:2502.03491) that lets ANY standard generation model do inpainting. LanPaint is the more interesting story: it means Z-Image, which has no inpaint pipeline, can now do surgical image editing.

LanPaint: inpainting without an inpainting model

LanPaint (Langevin Painting) is a training-free inpainting algorithm published in 2025. Instead of requiring a model trained specifically for inpainting, it modifies the standard denoising loop: at each step, it runs inner Langevin dynamics iterations that balance two forces — the prompt guiding the masked region toward new content, and a BiG score (Bidirectional Guided) anchoring the unmasked region to the original image.

The result: generation models that have no inpaint pipeline — like Z-Image and Flux Klein — can now inpaint. No special weights, no fine-tuning.

  # LanPaint uses the standard Z-Image model — no inpaint model needed   
  $ modl generate "empty plaza walkway, Eiffel Tower, sunny day" \     
       --base z-image --inpaint lanpaint \  
       --init-image eiffel.png --mask person-mask.png \  
       --steps 30 --seed 42  

Here’s what it can do. Starting with a tourist photo at the Eiffel Tower:

Eiffel Tower scene with a woman taking a selfie in the foreground and tourists visible

Original image. Generated with Z-Image (30 steps). A woman in the foreground and tourists on the right.

Step 1: Find the people with `modl vision ground`

  $ modl vision ground "person" eiffel.png     
     ✓ Found 6 person instance(s)  
     person: bbox=[267, 705, 449, 999]   # foreground woman  
     person: bbox=[697, 729, 847, 999]   # right tourist  
     person: bbox=[103, 745, 165, 920]   # left walker  
     ...  

modl vision ground uses a vision-language model (Qwen3-VL) to locate objects by description. It returns bounding boxes in normalized coordinates that you convert to a mask.

Step 2: Create masks from the bounding boxes

  # Mask a single person (right tourist)   
  $ modl process segment eiffel.png --method bbox --bbox 803,652,977,896 --expand 15     
      
  # Or mask ALL detected people at once   
  $ modl process segment eiffel.png --method bbox --bbox 309,631,515,896 --expand 15  # foreground     
  $ modl process segment eiffel.png --method bbox --bbox 803,652,977,896 --expand 15  # right     
  # ... merge masks into one   

Single person mask (5.1%)

Eiffel Tower scene with red overlay on the right tourist

All people mask (14.8%)

Eiffel Tower scene with red overlay on all foreground people

Left: mask from a single ground detection. Right: merged mask covering all detected people. Both generated by modl vision ground + segment.

Step 3: Inpaint

  $ modl generate "empty plaza, Eiffel Tower, cobblestone walkway, sunny day" \     
       --base z-image --inpaint lanpaint \  
       --init-image eiffel.png --mask all-people-mask.png \  
       --steps 30 --seed 42  

Original (6 people)

Eiffel Tower with tourists in foreground

LanPaint — all people removed

Same Eiffel Tower scene with all foreground people removed, plaza reconstructed

LanPaint with Z-Image removed all foreground people and reconstructed the cobblestone plaza, grass, and walkway. 14.8% of the image regenerated using a model with no inpaint training.

Flux Fill handles the same task:

Original

Flux Fill — all people removed

Eiffel Tower with people removed by Flux Fill

Flux Fill on the same mask. Also clean — the cobblestone pattern and perspective are well reconstructed.

Both methods handle the multi-person removal well — and LanPaint is doing this with a standard Z-Image model that has zero inpaint training. That’s the breakthrough: any model you can generate with, you can now inpaint with.

Seed variation

Like all generation, LanPaint’s output varies with the seed. Here’s the same edit at two different seeds:

Seed 42

Seed 123

Same mask, same params, different seed. Seed 42 is cleaner. Seed 123 has some artifacts on the left side. Like standard generation, try a few seeds and pick the best.

This is normal variance, same as you’d see generating multiple images with any model. Try a few seeds and pick the best — just like you would with standard generation.

How LanPaint works:

At each denoising step, LanPaint runs 5 inner Langevin iterations. Each iteration computes a BiG (Bidirectional Guided) score that does two things simultaneously: (1) pulls the masked region toward the prompt’s predicted clean image, and (2) anchors the unmasked region to the original photo. The balance between these forces is controlled by lambda — higher lambda means stronger reference anchoring.

Flux Fill: the precision tool

Flux Fill Dev is purpose-built for inpainting — a 384-channel input model trained specifically on masked regions. It excels at small, precise edits where boundary blending matters most.

Change a facial expression

The most impressive small-mask demo: 4.1% of the image, dramatic visible change.

Original

Young man in navy blazer with neutral expression

Mask (mouth area)

Same portrait with red overlay on mouth and chin area

Original portrait and the mask — just the mouth and chin (4.1% of the image).

  $ modl generate "a warm natural smile, white teeth showing slightly" \     
       --base flux-fill-dev \  
       --init-image portrait.png --mask mouth-mask.png \  
       --steps 28 --seed 42  

Flux Fill

LanPaint (Z-Image)

Same mask, same prompt, both from the neutral original. Flux Fill produced a wider smile. LanPaint converged on a subtler expression.

Swap an object

Replace the coffee mug on a desk with a candle. The mask was generated with modl vision ground "coffee mug" — properly covering the entire mug:

Original

Minimalist desk with laptop, succulent, books, and ceramic coffee mug

Mask from modl vision ground

Same desk with red overlay covering the coffee mug

Original desk and the mask from modl vision ground (coffee mug query) — covers the full mug (5.8% of image).

Flux Fill

Desk with mug replaced by a tall candle in dark ceramic vessel

LanPaint (Z-Image)

Desk with mug replaced by a small tea light candle in a ceramic dish

Both replaced the mug cleanly. Flux Fill generated a tall candle in a dark vessel. LanPaint produced a smaller tea light in a shallow dish. Wood grain is continuous in both.

When to use which

Small precise edit (<5%)Flux FillBest boundary blending, dedicated architecture

 Person/object removal (5-15%)LanPaintMore aggressive — actually removes rather than preserving 

Expression/face editFlux FillFace coherence from dedicated training

Model without inpaint pipelineLanPaintZ-Image, Klein — no special inpaint weights needed

Style-matched editsLanPaintUses the same model as your generation — consistent aesthetic

Tip:

Auto-routing: If you pass —mask with a model, modl picks the best method automatically. Flux 1 models route to Flux Fill when installed. Klein and Z-Image models that lack standard inpainting route to LanPaint. You can override with —inpaint lanpaint or —inpaint standard.

Avoid distilled models:

Distilled/turbo models (Z-Image Turbo, Flux Schnell) produce lower quality with LanPaint. Distillation compresses the denoising schedule, and LanPaint’s algorithm relies on the score function that distillation breaks. Z-Image base gives the best LanPaint results.

Creating masks

Three approaches:

  # Auto-segment with SAM   
  $ modl process segment photo.png --method sam --point 400,500     
      
  # Text-grounded detection   
  $ modl vision ground "the person on the right" photo.png     
      
  # Manual: draw in any image editor, white = replace, black = keep   

For the detect → segment → inpaint pipeline, see the Image Primitives guide.

Lessons from testing

Things we learned building this guide:

LanPaint can handle larger masks than Flux Fill. At 15% mask coverage, LanPaint successfully removed a person while Flux Fill reconstructed them. The Langevin dynamics actively push toward the prompt rather than preserving nearby content.
Flux Fill excels at tiny, precise edits. Expression changes, small object swaps, blemish removal — anywhere the mask is under 5% and boundary blending is critical.
The prompt should describe what fills the gap. “Empty plaza, Eiffel Tower, sunny day” works better than “remove the person” — the model needs to know what to generate, not what to delete.
Mask placement matters more than mask shape. Rough ellipses work fine. But if the mask only partially covers an object, Flux Fill may keep it rather than removing it.
LanPaint supports Z-Image (base + turbo) and Flux Klein 9B. Each model needs a small adapter matching its output convention. Z-Image base gives the best results.

LanPaint across models

The same inpainting task — removing all foreground people from the Eiffel Tower scene — across three different models. Same mask, same prompt, same seed:

Z-Image (30 steps)

LanPaint person removal with Z-Image base model

Klein 9B (20 steps)

LanPaint person removal with Flux Klein 9B

Z-Image (non-distilled, bf16) vs Klein 9B (distilled, fp8). Z-Image produces a more natural reconstruction. Klein adds bollards and slightly different paving — its edit-style architecture interprets the scene differently.

Klein 4B (20 steps)

Z-Image Turbo (8 steps)

LanPaint person removal with Z-Image Turbo

Klein 4B (distilled, 4B params) adds a bus and bollards — its edit-style architecture interprets the prompt more creatively. Z-Image Turbo (distilled, 8 steps) is the fastest but adds phantom objects.

Z-ImageNon-distilled30BestRecommended for LanPaint

Klein 9BDistilled (fp8)20GoodEdit-style, different aesthetic

Klein 4BDistilled (fp8)20GoodSmallest model, fast

Z-Image TurboDistilled8FairFastest but adds artifacts

Flux FillDedicated inpaint28ExcellentNot LanPaint — separate model

Reference

Quick Reference

Flux Fill (small precise edits):

modl generate "what fills the gap" \
  --base flux-fill-dev \
  --init-image photo.png --mask mask.png \
  --steps 28

LanPaint (Z-Image, Klein — no inpaint model needed):

modl generate "what fills the gap" \
  --base z-image --inpaint lanpaint \
  --init-image photo.png --mask mask.png \
  --steps 30

Supported LanPaint models:

Z-Image (recommended — best quality)
Z-Image Turbo (fast, lower quality)
Flux Klein 4B and 9B (edit-style, different aesthetic)

LanPaint flag:

--inpaint lanpaint — force LanPaint (auto-selected for models without standard inpaint)
--inpaint standard — force standard diffusers/Flux Fill inpainting
--inpaint auto — (default) picks the best method for the model
Paper: arXiv:2502.03491

Inpaint Any Image with Any Model

LanPaint: inpainting without an inpainting model

Step 1: Find the people with modl vision ground

Step 2: Create masks from the bounding boxes

Step 3: Inpaint

Seed variation

Flux Fill: the precision tool

Change a facial expression

Swap an object

When to use which

Creating masks

Lessons from testing

LanPaint across models

Reference

Quick Reference

Step 1: Find the people with `modl vision ground`