← All Guides
controlnetpreprocessdepthcannyposez-image

Shape Control with ControlNet

Turn sketches into photos, swap materials, transfer compositions — use structural control to tell the AI where everything goes while changing everything else.

Mar 14, 2026 12 min read

You have a product photo of a sneaker. You want to see it as a crystal sculpture — same exact shape, completely different material. Without ControlNet, the AI invents its own shoe shape. With ControlNet, you lock the silhouette and change everything else.

1 Product photo
Original sneaker product photo
2 Edge extraction
Canny edge map of the sneaker
3 With ControlNet
Crystal ice shoe generated from sneaker edges — same shape, completely different material
4 Without ControlNet
Same prompt without ControlNet — AI invents its own shoe shape

Same prompt, same seed. With ControlNet the crystal shoe follows the sneaker's exact silhouette. Without it, the AI creates whatever shape it wants.

Two commands:

$ modl process preprocess canny sneaker.png
sneaker.png → sneaker_canny.png
 
$ modl generate "a shoe made of glowing blue crystal and ice, \
magical artifact, dark background, fantasy" \
--controlnet sneaker_canny.png --base z-image-turbo
✓ Generated 1 image(s)

More examples

ControlNet shines when the structural control and the creative prompt are dramatically different. Here are three patterns that show its power:

Portrait → Anime character

Extract soft edges from a real photo, generate as a completely different art style:

Soft edges
Soft edges extracted from a portrait photo
Anime character
Anime character generated with same facial structure

The soft edges preserve the face structure, pose, and hair outline. The model fills in anime-style rendering while following the exact composition.

$ modl process preprocess softedge portrait.png
$ modl generate "anime character portrait, studio ghibli art style, \
cel shading, colorful hair, vibrant, illustration" \
--controlnet portrait_softedge.png --cn-type hed \
--cn-strength 0.6 --base z-image-turbo

Depth map → Completely different scene

A depth map captures spatial layout without any visual detail. Use it to transfer the 3D composition of one scene onto something entirely unrelated:

Cafe depth map
Depth map from a cafe scene — brighter areas are closer
Underwater scene
Underwater coral reef generated with the cafe's spatial layout

The cafe's depth map — table in the foreground, people in the middle, background behind — becomes an underwater scene with the same spatial arrangement.

$ modl process preprocess depth cafe.png
$ modl generate "underwater coral reef, tropical fish, \
sunlight rays through water, vibrant colors" \
--controlnet cafe_depth.png --cn-strength 0.9 \
--size 16:9 --base z-image-turbo

Scribble → Product photo

Extract a rough scribble (like a hand drawing) and generate a photorealistic product from it:

Scribble
Binary scribble lines extracted from sneaker — looks like a rough sketch
Premium sneaker
Photorealistic leather sneaker generated from the scribble shape

The scribble provides a loose shape guide. At 0.4 strength, the model follows the sneaker outline while adding realistic leather materials and studio lighting — without the sketch lines bleeding through.

$ modl process preprocess scribble sneaker.png
$ modl generate "professional product photo of a premium sneaker, \
dark leather, studio lighting, white background, photorealistic" \
--controlnet sneaker_scribble.png --cn-type scribble \
--cn-strength 0.4 --base z-image-turbo

How it works

ControlNet is a two-step process:

  1. Preprocess — Extract a structural map (edges, depth, pose) from any image
  2. Generate — Feed that map to the model alongside your prompt

The preprocessing step is model-agnostic — the same depth map works with Z-Image, Flux, SDXL, or Qwen-Image. The generation step uses model-specific ControlNet weights.

Design principle:

Preprocessing is always explicit and separate from generation. You inspect every intermediate artifact before generating. No magic, no surprises.

Tip:

Match aspect ratios. If your source image is landscape, use —size 16:9 when generating. A landscape depth map squeezed into a square output will distort the spatial layout. The control image and generation size should have similar proportions.

Preprocessing methods

modl process preprocess extracts structural maps from any image. Each method captures different information, and that changes how the model interprets the same prompt.

Here’s the same sneaker through four methods, all generating “crystal shoe, dark background, fantasy” at the default strength (0.6), seed 42:

1 Canny edges
Canny edge detection — hard, precise edges
2 → Result
Crystal shoe from canny — precise silhouette, clean shape
3 Softedge (HED)
Soft edge detection — smooth, organic edges
4 → Result
Crystal shoe from softedge — similar to canny but slightly softer shape

Canny and softedge both follow the sneaker outline closely. Canny gives sharper edges, softedge is slightly more organic.

1 Depth map
Depth estimation — 3D spatial layout
2 → Result
Crystal shoe from depth — follows 3D volume, chunkier sole
3 Scribble
Binary scribble — thick rough lines like a hand drawing
4 → Result
Crystal shoe from scribble — loosest interpretation, beefier shape

Depth follows the 3D volume with slightly different proportions. Scribble gives the loosest interpretation — the model follows the general silhouette while taking more creative freedom with details.

1 Lineart
Clean line art — detailed but thin lines
2 → Result
Crystal shoe from lineart — precise shape, similar to canny

Lineart produces clean, uniform lines — less noisy than canny on complex textures. Use with --cn-type canny.

$ modl process preprocess canny photo.png # hard edges (no model needed, fast)
$ modl process preprocess softedge photo.png # soft edges (HED model)
$ modl process preprocess depth photo.png # depth map (Depth Anything V2)
$ modl process preprocess scribble photo.png # binary line drawing (HED + threshold)
$ modl process preprocess lineart photo.png # clean line art
$ modl process preprocess normal photo.png # surface normals (from depth)
$ modl process preprocess pose photo.png # body skeleton (DWPose, needs a person)

Output files follow the convention {stem}_{method}.png.

Choosing a method

MethodBest forStrictnessWatch out
cannyProduct silhouettes, architecture, clean outlinesStrictToo noisy on detailed textures (skin, fur). Use softedge instead.
softedgePortraits, organic shapes, natural scenesModeratePass --cn-type hed when generating (softedge is not a recognized control type).
depthScene layout, 3D composition, style transferLooseMatch --size to the source image aspect ratio or the depth map gets distorted.
scribbleRough shape guidance, creative interpretationLooseUse low strength (0.3-0.4) — higher values bleed the binary sketch lines into the output.
poseHuman figures, body positioningModerateNeeds a visible body. Won’t work on close-up portraits or objects.
lineartIllustration references, detailed line drawingsModerateSimilar to canny but with smoother, more uniform lines.
normalSurface-aware generation, material transferModerateBest for objects with clear 3D surfaces.
Tip:

Canny for hard-edged objects (products, architecture). Softedge for anything organic (faces, nature, characters). Depth for spatial layout when you want to change the content entirely. Scribble when you want the model to interpret a rough shape freely.

Batch processing

$ modl process preprocess depth ./photos/
✓ Preprocessed 12 image(s)
 
$ modl process preprocess canny ./photos/ --output ./edges/

Control parameters

FlagDefaultDescription
--controlnet--Path to control image (repeatable, max 2)
--cn-typeautocanny, depth, pose, softedge, scribble, hed, mlsd, gray
--cn-strength0.6How strongly the control image influences output (0.0-1.5)
--cn-end0.8Stop applying control at this fraction of total steps

Auto-detection

If your file follows {name}_{type}.png naming, modl vision detects the type automatically:

$ modl generate "a castle" --controlnet castle_depth.png --base z-image-turbo
CN: castle_depth.png (type: depth, strength: 0.6, end: 0.8)

Tuning strength

Strength controls how closely the output follows the control image. Same prompt, same seed, same canny edges — only --cn-strength changes:

1 Canny edges
Canny edge map of the sneaker
2 0.3 — loose
At 0.3, the model creates a chunky boot shape instead of a sneaker
3 0.5 — moderate
At 0.5, the sneaker shape is clearly followed with good crystal detail
4 0.6 — default
At 0.6 (default), the exact sneaker silhouette is preserved

Crystal sneaker at three strength levels. At 0.3 the model invents its own shape. At 0.5-0.6 it follows the canny edges while maintaining good material rendering.

  • 0.3-0.4 — Loose guidance. Good for scribbles and depth maps where you want creative interpretation without control artifacts bleeding through.
  • 0.5-0.7 — Balanced (default 0.6). Follows the structure while letting the model fill in details naturally. Best starting point for most uses.
  • 0.8-1.2 — Strict adherence. Useful for canny edges on clean subjects (products, architecture). Can over-constrain on noisy or organic inputs.

Model support

ModelControlNetSupported typesNotes
z-image-turboUnion 2.1canny, depth, pose, hed, mlsd, scribble, grayBest option for 24GB. 8 steps.
flux-devUnion Pro 2.0canny, depth, pose, softedge, gray28 steps, high quality
qwen-imageUnioncanny, depth, pose, softedge30 steps. Quality reduced with GGUF quantization.
sdxlUnioncanny, depth, pose, softedge, tile, scribble, hed, mlsd, normalMost types supported
flux-2-klein----No ControlNet yet. Use flux-dev instead.

VRAM guide

ControlNet loads additional weights alongside the base model. On consumer GPUs, the variant matters:

GPUBase modelControlNetFits?Peak VRAM
24 GBz-image-turbo (auto fp8)2.1 lite (2GB)Yes~16 GB
24 GBz-image-turbo (auto fp8)2.1 full (6GB)Yes~18.5 GB
24 GBqwen-image (GGUF q5)Union (3.5GB)Yes*~23 GB

The Z-Image-Turbo ControlNet Union 2.1 comes in two sizes:

  • Lite (2GB, 3 control layers) — Natural results, fits on 24GB GPUs. Recommended starting point.
  • Full (6GB, 15 control layers) — Strongest control, best detail preservation. Fits on 24GB with automatic text encoder offloading.
# Lite variant (recommended starting point)
$ modl pull z-image-turbo-controlnet-union-2.1 --variant lite-8steps
 
# Full variant (stronger control, still fits on 24GB)
$ modl pull z-image-turbo-controlnet-union-2.1 --variant full-8steps

Both variants fit on 24GB GPUs. When the full controlnet is active, modl automatically converts the base transformer to fp8 (~5.7GB instead of ~11.5GB) and offloads the text encoder to CPU during denoising. Peak VRAM usage is about 18.5GB with the full variant and 16GB with lite.

Tip:

Start with the lite variant. If you need more precise structural control, upgrade to full. The full variant preserves more detail and gives tighter shape adherence, but both produce good results at the default strength.

JSON output

Both commands support --json for scripting and agent pipelines:

$ modl process preprocess depth photo.png --json
{"method":"depth","processed":1,"outputs":[{"input":"photo.png",
"output":"photo_depth.png","resolution":[1024,768]}]}
 
$ modl generate "test" --controlnet photo_depth.png \
--base z-image-turbo --json
{"status":"completed","images":["~/.modl/outputs/2026-03-14/001.png"]}

Tips & gotchas

Things we learned the hard way so you don’t have to:

Strength is the most important parameter. The default (0.6) works for most cases, but different methods need different ranges. Canny and softedge work well at 0.5-0.7. Depth and scribble fall apart below 0.5 — they don’t carry enough structural information at low strength. Scribble above 0.6 bleeds binary sketch lines into the output.

Match your prompt to the control shape. If your scribble looks like a sneaker, don’t prompt for “a boot.” The model will fight between the control signal (low shoe) and the prompt (tall boot), producing a compromise that looks wrong. Work with the shape, not against it.

Canny is too noisy for detailed textures. On faces, skin, fur, or foliage, canny produces a dense mess of edges that overwhelms the model. Use softedge (HED) instead — it captures structure without the noise. Pass --cn-type hed since the Z-Image controlnet doesn’t have a softedge mode.

Match --size to your source aspect ratio. A landscape depth map squeezed into a square output distorts the spatial layout. Use --size 16:9 for landscape sources, --size 9:16 for portrait.

ControlNet metadata is saved. Every generated image stores the control type, strength, end value, and source filename in the PNG metadata. Use modl info <image> or any EXIF viewer to see the exact settings that produced an image.

What’s next

Quick reference

  • modl process preprocess canny|depth|pose|softedge|scribble|lineart|normal <image>
  • modl generate “prompt” —controlnet <control_image> —base <model>
  • —cn-type — auto-detected from filename or set explicitly
  • —cn-strength 0.6 — lower = looser, higher = stricter
  • —cn-end 0.8 — stop control early for more creative freedom in final details
  • Z-Image Turbo + CN fits on 24GB (lite ~16GB, full ~18.5GB peak)