← All Guides
modelscomparisongenerateeditinpaintcontrolnetguide

Which Model Should I Use?

Understand the six ways modl creates images, compare all 16 supported models by size, speed, quality, and VRAM — with side-by-side generated samples.

Mar 15, 2026 10 min read

modl supports 16 models across 6 families. Each model has different strengths — some are fast, some are precise, some can edit, some can train. This guide helps you pick the right one, with real side-by-side comparisons.

Six ways to create images

Before choosing a model, understand what modl can do. There are six creation modes, and not every model supports all of them.

1. Text to image (txt2img)

The simplest mode. Describe what you want, get an image.

$ modl generate "a cabin in the mountains at golden hour, \
cinematic lighting, wide angle"

Every model supports this. It’s the default mode when you run modl generate.

2. Image to image (img2img)

Start from an existing image and transform it. The AI uses your image as a starting point — keeping the composition but changing the style, colors, or content.

$ modl generate "oil painting, impressionist style" \
--init-image photo.png --strength 0.7

The --strength parameter controls how much the output differs from the input. At 0.3 it’s a subtle filter; at 0.9 it’s almost a new image.

Supported: flux-dev, flux-schnell, chroma, z-image, z-image-turbo, sdxl, sd-1.5

3. Inpainting

Paint over part of an image with a mask, then describe what should replace it. The AI fills in only the masked region while keeping everything else intact.

# Replace an object — mask the area, describe the replacement
$ modl generate "a vase of sunflowers" \
--init-image room.png --mask mask.png
 
# Klein 9b inpainting via LanPaint (auto-selected)
$ modl generate "a vase of sunflowers" \
--base flux2-klein-9b --init-image room.png --mask mask.png

You can create masks with modl process segment (automatic) or any image editor (manual). White pixels = replace, black pixels = keep.

Standard inpaint: flux-dev, flux-schnell, flux-fill-dev, chroma, z-image, z-image-turbo, sdxl, sd-1.5

LanPaint inpaint: z-image, z-image-turbo, flux2-klein-4b, flux2-klein-9b (auto-selected for models without standard inpaint)

Tip:

Flux Fill Dev is a dedicated inpainting model with the best edge blending. LanPaint is a training-free algorithm that lets any supported model inpaint — modl auto-selects it for Klein models. Use —inpaint lanpaint to force it on models like Z-Image that support both methods.

4. Edit (instruction-based)

Describe a change in natural language. No mask needed — the model figures out what to modify.

$ modl edit "add sunglasses" --image portrait.png \
--base flux2-klein-4b
 
$ modl edit "change the background to a beach" \
--image product.png --base qwen-image-edit

This is the most intuitive mode. You just say what you want changed, and the model handles the spatial reasoning. Klein 4B does it in 4 steps; Qwen Image Edit takes 50 steps but handles complex multi-region edits.

Supported: flux2-klein-4b, flux2-klein-9b, qwen-image-edit

5. ControlNet (structural control)

Extract a structural map (edges, depth, pose) from any image, then generate a completely new image that follows that structure.

# Extract edges from a sneaker photo
$ modl process preprocess canny sneaker.png
 
# Generate a crystal shoe with the same silhouette
$ modl generate "crystal shoe, magical, dark background" \
--controlnet sneaker_canny.png --base z-image-turbo

ControlNet is a two-step workflow: preprocess then generate. The preprocessing is model-agnostic — the same depth map works with any model that supports ControlNet.

Supported: flux-dev, flux-schnell, z-image, z-image-turbo, qwen-image, sdxl

Tip:

See the ControlNet guide for the full breakdown of preprocessing methods, strength tuning, and model-specific VRAM requirements.

6. Style reference

Feed a reference image and the model adopts its visual style — colors, composition patterns, artistic feel — while generating new content from your prompt.

$ modl generate "a mountain landscape" \
--style-ref monet_painting.png --style-strength 0.6 \
--base flux-dev

Two mechanisms: IP-Adapter on generate (Flux Dev, SDXL) and multi-image edit (Klein). Klein’s approach is through modl edit — pass the reference as a second --image.

IP-Adapter (--style-ref): flux-dev, sdxl

Multi-image edit (--image x2): flux2-klein-4b, flux2-klein-9b


Side-by-side: same prompt, different models

All images below were generated with the same prompt and seed (42) across six models. No cherry-picking — these are the raw results.

Portrait

“close-up portrait of an elderly man with deep wrinkles, silver beard, piercing blue eyes, natural window light, shallow depth of field, photorealistic”

1 Chroma (40 steps)
Portrait generated with Chroma
2 Flux Dev (28 steps)
Portrait generated with Flux Dev
3 Qwen Image (25 steps)
Portrait generated with Qwen Image
4 Klein 4B (4 steps)
Portrait generated with Klein 4B
5 Z-Image Turbo (8 steps)
Portrait generated with Z-Image Turbo
6 SDXL (30 steps)
Portrait generated with SDXL

Same prompt, same seed. Notice the differences in skin detail, lighting interpretation, and overall aesthetic. Klein 4B produces this in 4 steps — the others take 8-40.

Landscape

“vast mountain valley at sunrise, fog rolling between peaks, river reflecting golden light, pine forests, cinematic landscape photography”

1 Chroma
Landscape generated with Chroma
2 Flux Dev
Landscape generated with Flux Dev
3 Qwen Image
Landscape generated with Qwen Image
4 Klein 4B
Landscape generated with Klein 4B
5 Z-Image Turbo
Landscape generated with Z-Image Turbo
6 SDXL
Landscape generated with SDXL

Landscape at 16:9. Each model interprets 'cinematic' differently — from painterly to photographic.

Product photography

“premium leather watch on dark marble surface, dramatic side lighting, product photography, sharp focus, dark background, commercial quality”

1 Chroma
Product photo generated with Chroma
2 Flux Dev
Product photo generated with Flux Dev
3 Qwen Image
Product photo generated with Qwen Image
4 Klein 4B
Product photo generated with Klein 4B
5 Z-Image Turbo
Product photo generated with Z-Image Turbo
6 SDXL
Product photo generated with SDXL

Product shots test material rendering, lighting accuracy, and detail. Watch details push into the VAE's resolution limits — fine textures like gears and hands are where model quality differences become most visible.

Illustration

“a fox reading a book under a giant mushroom in an enchanted forest, watercolor illustration, storybook art style, warm colors, whimsical”

1 Chroma
Illustration generated with Chroma
2 Flux Dev
Illustration generated with Flux Dev
3 Qwen Image
Illustration generated with Qwen Image
4 Klein 4B
Illustration generated with Klein 4B
5 Z-Image Turbo
Illustration generated with Z-Image Turbo
6 SDXL
Illustration generated with SDXL

Artistic prompts reveal each model's default aesthetic. Some lean photographic even when asked for watercolor, others embrace the style fully.

Text rendering

“a neon sign that reads ‘OPEN 24/7’ hanging in a rainy window, cyberpunk aesthetic, reflections, moody atmosphere”

1 Chroma
Text rendering with Chroma
2 Flux Dev
Text rendering with Flux Dev
3 Klein 4B
Text rendering with Klein 4B
4 Z-Image Turbo
Text rendering with Z-Image Turbo
5 Qwen Image
Text rendering with Qwen Image
6 SDXL
Text rendering with SDXL

Text rendering is the hardest test. Most models scramble letters. Qwen Image is the only model specifically trained for readable text — compare its 'OPEN 24/7' to the others.


The models

2025 models (current generation)

ModelParamsVRAM (fp8)StepsQualitySpeed
flux2-klein-4b9B~10 GB43/55/5
flux2-klein-9b18B~16 GB44/54/5
z-image-turbo11B~14 GB83/54/5
z-image11B~14 GB204/52/5
chroma14B~16 GB404/52/5
flux2-dev46B~35 GB285/51/5
qwen-image27B~30 GB255/52/5
qwen-image-edit27B~30 GB505/51/5

2024 models (proven, large ecosystem)

ModelParamsVRAM (fp8)StepsQualitySpeed
flux-dev17B~20 GB284/52/5
flux-schnell17B~20 GB43/55/5
flux-fill-dev17B~20 GB505/52/5

Legacy (2022-2023)

ModelParamsVRAM (fp8)StepsQualitySpeed
sdxl3.7B~5 GB303/52/5
sd-1.51.1B~3 GB302/54/5
Param counts:

“Params” includes both the image model (transformer/UNet) and text encoder(s). The image model is the larger part — it’s what determines quality. Text encoders handle prompt understanding and are shared across model families.


Capability matrix

Modeltxt2imgimg2imginpainteditControlNetStyle refTrain
flux2-klein-4byes--LanPaintyes--via edityes
flux2-klein-9byes--LanPaintyes--via edityes
flux2-devyes----------yes
chromayesyesyes------yes
z-image-turboyesyesyes--yes--yes
z-imageyesyesyes--yes--yes
qwen-imageyes------yes--yes
qwen-image-edit------yes------
flux-devyesyesyes--yesIP-Adapteryes
flux-schnellyesyesyes--yes--yes
flux-fill-dev----yes--------
sdxlyesyesyes--yesIP-Adapteryes
sd-1.5yesyesyes------yes
Tip:

A means the mode isn’t supported for that model. modl validates this before running — if you try modl edit —base z-image, it’ll tell you which models support editing and suggest one.


Decision guide

”I just want to generate images”

Start with flux2-klein-9b. It’s the best balance of quality, speed (4 steps), and VRAM (~16GB fp8). If you’re on a 12GB card, use flux2-klein-4b or z-image-turbo.

”I need to edit/modify existing images”

Two options depending on what you’re doing:

  • Replace a region (remove an object, swap a background) → use inpainting with flux-fill-dev (best edges), flux-dev, or flux2-klein-9b (uses LanPaint automatically)
  • Transform the whole image (add sunglasses, change time of day) → use editing with flux2-klein-4b (fast, 4 steps) or qwen-image-edit (slower, more precise)

“I want structural control”

Use ControlNet with z-image-turbo (best ControlNet support, fits on 24GB with Union 2.1). Or use edit mode with flux2-klein-4b for ControlNet-like results without extra weights — see the structural editing guide.

”I’m training LoRAs”

Most models support training. Best choices:

  • flux-dev — best ecosystem, most community LoRAs for reference
  • flux2-klein-4b — fastest training and inference, good for rapid iteration
  • z-image / z-image-turbo — strong quality, smaller model = faster training
  • sdxl — if you need community LoRA compatibility

”I need text in images”

qwen-image is the only model with real text rendering capability (Chinese and English). All other models struggle with text — check the comparison above.

”I want open-source (Apache 2.0)”

chroma is the only Apache 2.0 model in the lineup. It’s a Flux fork with 8.9B params, supports negative prompts, img2img, and inpainting. Strong quality at 35 steps.

”I have limited VRAM”

VRAMBest choices
24GB+Any model (fp8)
16GBflux2-klein-9b, chroma, z-image-turbo
12GBflux2-klein-4b, z-image-turbo (fp8)
8GBsdxl, sd-1.5
4GBsd-1.5

Powerful combinations

The real power of modl is chaining commands. Here are workflows that combine multiple modes and models.

Generate → score → pick the best

# Generate 4 variations
$ modl generate "product photo on marble" --count 4 --base z-image-turbo
 
# Score them all
$ modl vision score ~/.modl/outputs/2026-03-15/*.png --json
[{"file": "001.png", "score": 7.2}, {"file": "002.png", "score": 5.8},
{"file": "003.png", "score": 8.1}, {"file": "004.png", "score": 6.4}]
 
# Upscale the best one to 4x
$ modl process upscale 003.png --scale 4

Find → mask → replace (object swap)

# Find all coffee cups in a cafe photo
$ modl vision ground "coffee cup" cafe.png --json
{"objects": [{"label": "coffee cup", "bbox": [120, 340, 280, 500]}]}
 
# Create a mask from the bounding box
$ modl process segment cafe.png --bbox 120,340,280,500 --expand 20 --feather 10
 
# Replace with wine glasses
$ modl generate "elegant wine glass, same lighting" \
--init-image cafe.png --mask cafe_mask.png --base flux-fill-dev

Preprocess → generate → restore → upscale (production pipeline)

# Extract depth from a portrait
$ modl process preprocess depth portrait.png
 
# Generate anime version with same spatial layout
$ modl generate "anime character, studio ghibli style" \
--controlnet portrait_depth.png --base z-image-turbo
 
# Fix any face artifacts
$ modl process face-restore output.png
 
# Upscale to print resolution
$ modl process upscale output_restored.png --scale 4

Train → generate → compare (LoRA iteration)

# Train a LoRA from 20 photos
$ modl dataset create my-dog --from ~/photos/
$ modl dataset caption my-dog
$ modl train --dataset my-dog --base flux-dev --name dog-v1
 
# Generate with the LoRA
$ modl generate "OHWX sitting on a throne" --lora dog-v1
 
# Compare to reference photo (CLIP similarity)
$ modl vision compare ~/photos/best.png output.png --json
{"similarity": 0.78}

What’s next

Explore from here

  1. Getting Started — Install modl and generate your first image
  2. ControlNet guide — Structural control with canny, depth, pose
  3. Structural Editing — ControlNet-like results without ControlNet weights
  4. Train a Style LoRA — Teach a model your visual style
  5. Image Primitives — Score, detect, segment, restore, upscale
  6. VL Primitives — Ground, describe, and tag with vision-language models

Ready to try?

If you haven’t installed modl yet, start with Getting Started. Already running? Try Shape Control with ControlNet to see what structural control can do.