trainingcharacterloraz-imagekleinflux

Train a Character LoRA — From 24 Photos to Infinite Scenes

I trained my Pomeranian on 5 models in one day. Real benchmarks, real failures, and the one setting that fixed Z-Image.

Mar 25, 2026 15 min read

TL;DR

Train a LoRA from 24 phone photos in one command. Works for people, pets, and products. Klein 4B gives the best quality/speed balance (85 min on a 4090). Use --class-word with the specific breed or type — “pomeranian” not “dog”. If using Z-Image, the prodigy optimizer is essential — adamw doesn’t converge. Full benchmarks and failure analysis below.

The goal

I wanted to train my Pomeranian across every viable model in a single day — same 24 photos, same trigger word, same GPU — and find out which combination actually produces the best character LoRA. Along the way, 8 of my training runs failed. Every failure taught me something.

This guide walks through the method, the results, and the mistakes. If you just want the settings, skip to the cheat sheet. If you want to understand why those settings work, read on.

You’ll need an NVIDIA GPU with 10+ GB VRAM (16-24 GB recommended), 15-30 photos of your subject, and modl v0.2.8+ installed (curl -fsSL https://modl.run/install.sh | sh). Under the hood, modl wraps ai-toolkit for training — if you prefer to use ai-toolkit directly, all the settings and findings here still apply.

12 AI-generated scenes of a Pomeranian from the same LoRA

The payoff: 12 scenes generated from a single LoRA trained on 24 phone photos. Space, throne, cherry blossoms, surfing, renaissance, detective, sunflowers, snow, road trip, mountain, library.

Works for products too:

The same technique applies to product photography and ecommerce. Train on 20-30 photos of your product and generate it in any setting — lifestyle shots, studio lighting, seasonal campaigns.

Step 1: Prepare your dataset

The dataset is where most character LoRAs succeed or fail. Getting the training settings right matters far less than having good, varied input photos.

Collect images

Aim for 20-30 images. Quality and variety matter more than quantity.

Angles: front, side, three-quarter, close-up, full body
Lighting: indoor, outdoor, natural, studio, dramatic
Backgrounds: varied (not all against the same wall)
Expressions: smiling, neutral, serious
Clothing: different outfits (prevents the LoRA from baking in one look)

Tip:

20 well-varied photos outperform 100 similar selfies. The model needs to see the subject from enough angles and contexts to separate identity from environment.

12 of the 24 training photos — varied poses, lighting, backgrounds

My training dataset (12 of 24 shown). Phone photos with varied poses, indoor/outdoor, different lighting. This is all the model needs.

Create and caption

  # Create dataset from your photos   
  $ modl dataset create my-character --from ~/photos/subject/     
     ✓ 24 images → ~/.modl/datasets/my-character/  
      
  # Resize to training resolution   
  $ modl dataset resize my-character     
     ✓ 24 images resized to 1024px  
      
  # Caption with Qwen3-VL (~5s per image, ~2 min for 24 photos)   
  $ modl dataset caption my-character --model qwen     
     ✓ 24/24 captions written  

Why Qwen3-VL? Florence-2 is faster (~1s/image) but hallucinates on people — it invents emotions, scenarios, and backstories. I captioned a person dataset with Florence-2 and got “sentenced to life in prison” on a photo of a TV host. Qwen3-VL produces accurate descriptions and doesn’t fabricate.

Captions should describe everything except the subject’s inherent identity. The model learns identity from the images; captions teach it everything else.

Good: “A fluffy, light-brown Pomeranian sits on a tiled floor, looking up at the camera. Behind it, a pet bed and wooden furniture are visible.”

Too vague: “A cute dog.” — model can’t separate background from subject.

Too long: 200+ words — gets truncated during training, wasting the detail.

Keep captions to 1-2 sentences, 30-50 words.

Trigger words go in training, not captions:

Don’t add trigger words (like OHWX) to your caption files. The training pipeline injects them automatically. Adding them manually doubles them.

Step 2: Choose your model and rank

I held everything constant across all five runs — same 24 images, same trigger word (OHWX), same class word, same captioning (Qwen3-VL). The only variable was the base model. All benchmarks are from a single RTX 4090.

SDXL~7 GB28 minGoodFast iteration, proven ecosystem

Z-Image~17 GB2 hrsExcellentBest skin texture and realism

Klein 4B~15 GB85 minVery goodBest quality/speed balance

Klein 9B~13 GB (quantized)3.3 hrsVery goodHigher capacity, slow on 24GB

Flux Schnell~20 GB64 minGood4-step fast inference

All times from an RTX 4090 with a 24-image dataset. On a 3060 (12 GB), expect 2-3x longer. Klein 4B and SDXL still fit comfortably on 12 GB. Klein 9B requires quantization and won’t run below 13 GB.

5-model comparison grid — SDXL, Klein 4B, Klein 9B, Schnell, Z-Image across 3 prompts

Same 24 photos, same trigger, best checkpoint from each model. Three prompts (identity, close-up, beach) across five models. Z-Image has the best skin/fur realism, Klein 4B the best quality/speed balance.

Choosing rank: 16 vs 32

Rank controls how much the LoRA can learn. This is a training decision, not an inference one — pick before you start.

File size (Klein 4B)~44 MB~88 MB

Best forSimple subjects, fast iterationComplex features, tattoos, fine detail

ConvergenceCan oscillate overfit/underfitMore stable, smoother training

Training speedBaseline~10% slower

When to useFirst attempt, small datasetsIf r16 is inconsistent, complex subjects

Klein 4B rank and class word comparison — r16+dog vs r32+dog vs r32+pomeranian

Klein 4B at step 2000, three configurations. Left: rank 16 with 'dog' (44 MB). Middle: rank 32 with 'dog' (89 MB). Right: rank 32 with 'pomeranian' (89 MB). The jump from left to right shows rank and class word working together.

If rank 16 gives you inconsistent likeness, rank 32 is the fix.

Base vs. distilled (Klein)

Klein models come in two variants. This matters because the same --base flag does different things depending on whether you’re training or generating:

modl train --base flux2-klein-4b → downloads the base (undistilled) variant. Slower inference (~50 steps), but the LoRA learns better.
modl generate --base flux2-klein-4b → uses the distilled variant. Fast (4 steps). LoRAs trained on base transfer to it at strength 1.0-1.5.

modl picks the right variant for each task. If using ai-toolkit directly, make sure you point to the base repo for training — the distilled repo doesn’t include the base transformer weights.

Tip:

If likeness seems weak during generation with a Klein LoRA, try --lora-strength 1.3. LoRAs trained on base sometimes need a slight strength bump on distilled.

Z-Image: the prodigy discovery

This was the single most important finding from the whole experiment.

Z-Image Base has a known issue with adamw8bit — it often fails to converge for character training. Hundreds of Reddit users spent months and real money failing to train Z-Image character LoRAs. The fix was one setting change: switch to the prodigy optimizer, which auto-tunes the learning rate internally.

adamw8bit vs prodigy optimizer comparison on Z-Image Base

Same dataset, same model, same step count. Top row (adamw8bit): wrong breed at every step — never converges. Bottom row (prodigy): correct breed from step 600, excellent likeness by step 1200. One setting change.

modl now defaults to prodigy for Z-Image Base. Because prodigy converges roughly twice as fast, Z-Image character training peaks around 1500-2000 steps instead of the 3000+ that adamw needed (and usually still failed at).

Step 3: Train

With the dataset ready and model chosen, training is one command. modl selects steps, learning rate, optimizer, and rank based on the model and dataset size.

  $ modl train \     
       --dataset my-character \  
       --base flux2-klein-4b \  
       --name my-character-v1 \  
       --trigger OHWX \  
       --class-word pomeranian \  
       --lora-type character  
      
   → Training started — job-my-character-v1-20260325  
   ⠁ [████████░░░░░░░░░░░░░░░░░░░░░░] 800/2000 steps loss: 0.0823  

What the parameters mean

--stepsAuto per modelTotal training iterations. For Klein 4B with 24 images: ~2000

--rank16LoRA capacity. 16 for simple subjects, 32 for complex (see above)

--lrAuto per modelLearning rate. 5e-5 for Klein 4B, 1.0 for Z-Image prodigy, 1e-4 for SDXL

--triggerOHWXWord that activates the LoRA in prompts

--class-word(none)Category word. Be specific: 'pomeranian' not 'dog'

--presetstandardquick/standard/advanced — adjusts steps and rank together

Always use --class-word, and be specific:

Using “dog” works but can produce the wrong breed or size. Use --class-word pomeranian and the model only needs to learn which Pomeranian — it already knows the breed. This was one of the clearest improvements in my tests.

Class word comparison grid — dog vs pomeranian across 4 steps

Klein 4B, same dataset. Top: 'dog' — generic, inconsistent size. Bottom: 'pomeranian' — correct breed from step 0. The class word gives the model a head start it never loses.

Step 4: Read your samples

During training, modl generates sample images every ~200 steps using 10 hardcoded test prompts. These test identity, generalization, style diversity, and bleed. They’re the best way to know when training is done — or when something is wrong.

Klein 4B training progression grid — 3 prompts across 4 steps

Klein 4B (rank 32, pomeranian): three prompts tracked across 2,000 steps. Identity locks in by step 600, refines through step 2000.

What to look for

Steps 0-500: Vague resemblance starts appearing. If nothing is recognizable, check your dataset.

Steps 500-1500: Identity strengthens. You should recognize the subject in most samples. With prodigy on Z-Image, you may already have excellent results at step 900.

Steps 1500+: For adamw models (Klein, SDXL, Schnell), best results typically land in this range. With prodigy (Z-Image), 1500+ may already be diminishing returns — check earlier checkpoints.

Don’t overtrain

More steps isn’t always better. With prodigy on Z-Image, I got excellent results at step 900. Training to 3000 didn’t damage it — no overfitting — but it didn’t meaningfully improve either. That was 2 extra hours of GPU time for nothing. Your mileage may vary with different datasets, but check your samples as they come in and stop when you’re happy.

Tip:

modl saves checkpoints every ~200 steps. If step 1200 looks great but step 2000 is overcooked, use the earlier checkpoint. You don’t have to use the final one.

The bleed check

The last two sample prompts don’t include your trigger word:

“a portrait of a woman smiling”
“a golden retriever in a garden”

These are hardcoded by modl. When training a dog LoRA, the retriever prompt is the meaningful check — if that retriever starts looking like your dog, the LoRA is bleeding into unrelated prompts. The woman prompt confirms the LoRA isn’t leaking across completely unrelated subjects.

Bleed check grid — step 0 vs step 3000, trigger vs non-trigger prompts

Top: step 0 (untrained). Bottom: step 3000 (fully trained). Left: trigger active — correct identity. Middle and right: no trigger — completely different subjects. Zero bleed after 3000 steps.

Step 5: Use your LoRA

$ modl generate 'a photo of OHWX pomeranian at the beach, sunset' \

--lora my-character-v1

Tip:

Klein trick: Combine a trained LoRA with a reference image for even better results. Klein supports multi-image reference natively — a weaker LoRA (strength 0.8) plus a good reference photo can beat a strong LoRA alone.

Recommended settings by model

Quick reference

SDXL1500161e-4adamw8bitFast, reliable baseline

Z-Image1500-200016-32auto (1.0)prodigyBest realism. prodigy is essential

Klein 4B200016-325e-5adamw8bitLR auto-clamped. Sensitive above 5e-5

Klein 9B2000161e-4adamw8bitQuantized on 24GB: 13 GB, 6.1 s/step

Flux Schnell1500164e-4adamw8bit4-step fast inference

What I learned the hard way

Eight of my runs failed before I got this right. I’m listing these in order of how likely you are to hit them — the first two affect everyone, the rest are edge cases that modl now handles automatically.

Z-Image + adamw8bit = no convergence. This is the big one. I trained Z-Image with adamw8bit and got the wrong breed at every step count. The community spent months on this. Switching to prodigy fixed it immediately — correct identity from step 600. modl now defaults to prodigy for Z-Image Base.

--class-word dog produced the wrong breed. The model sometimes generated a larger dog — it learned the identity but not the body type. Using --class-word pomeranian fixed the proportions immediately. Be as specific as your subject allows.

Florence-2 hallucinated on people. I captioned a person dataset with Florence-2 and got “sentenced to life in prison” and “just won a boxing match” — completely fabricated. Qwen3-VL fixed it instantly. Florence-2 is fine for objects and scenes, not for people.

Stale GPU memory killed 3 runs in a row. I queued Klein 4B → Klein 9B → Z-Image. All three OOMed because a previous modl serve worker was holding 14.7 GB of GPU memory. Fix: check nvidia-smi before training and kill stale processes.

Klein 9B OOMed without quantization. The 9B base model at bf16 plus the Qwen3-8B text encoder exceeds 23 GB — past the 4090’s limit. With quantization enabled: 13 GB, plenty of room, but 2.5x slower (6.1 s/step). Klein 4B is often just as good and much faster.

Klein trained on the wrong HuggingFace repo. Both Klein 4B and 9B crashed with a 404 — ai-toolkit downloaded from the distilled repo instead of the base repo. If using ai-toolkit directly, double-check the repo URL points to the base variant.

Qwen3-8B tokenizer wasn’t cached. Klein 9B crashed because ai-toolkit needed the full tokenizer directory. Make sure config.json, tokenizer.json, etc. are present — not just the safetensors weights.

Resume used the wrong checkpoint. --resume maxi-zimage.safetensors (final file, no step suffix) defaulted to step 0, restarting from scratch. Fix: always use numbered checkpoints like my-lora_000001350.safetensors so the trainer knows where to resume from.

Troubleshooting

Samples look nothing like my subject — Check your dataset. Are the images varied enough? Are captions accurate? Try fewer, higher-quality images (15 good photos > 50 bad ones).

Samples oscillate between overfit and underfit — Increase rank from 16 to 32. More capacity = more stable convergence.

Z-Image doesn’t converge — Make sure prodigy is the optimizer (modl sets this by default). If you manually set --optimizer adamw8bit, switch to prodigy.

Klein LoRA looks distorted — Klein 4B is sensitive to learning rate. modl clamps to 5e-5 automatically. If you overrode --lr, lower it.

“OHWX” appears as text in images — Text-capable models sometimes render the trigger word literally. Cosmetic — doesn’t affect training.

OOM during training — Close other GPU apps. Check nvidia-smi for stale processes. Klein 9B needs quantization on 24GB (modl enables this automatically).

Cheat sheet

Everything above, compressed into one card. Pin it or screenshot it.

Open full-size version for sharing →

Character LoRA — Quick Reference

RTX 4090 benchmarks · 24-image dataset · modl v0.2.8+

modl

Dataset

Images20–30 varied

CaptionsQwen3-VL (~5s/image)

Caption style1–2 sentences, natural language

Trigger wordTraining config only

Class wordBe specific (pomeranian, not dog)

Training

StepsAuto (modl calculates per model)

Rank16 (simple) / 32 (complex)

Z-Image optimizerprodigy (essential)

Klein trickTrain base → infer distilled

EMAOFF (slows training, minimal benefit for LoRAs)

SDXL
7 GB · 28 min

Klein 4B ★
15 GB · 85 min

Klein 9B
13 GB qnt · 3.3h

Schnell
20 GB · 64 min

Z-Image
17 GB · 2 hrs

modl.run/guides/train-character-lora

What’s next

The biggest surprise was the prodigy optimizer. One setting change turned Z-Image from “doesn’t work for characters” into the best realism I’ve tested. The class word specificity was the second — “pomeranian” vs “dog” is a free quality boost that costs nothing.

Things I’d explore next: LoKr instead of LoRA (community reports even better character capture on Z-Image), training at 512px resolution first then multi-res (some users swear by it), and pushing these LoRAs to the hub for cloud inference.

If you train a character LoRA and find something these benchmarks missed — a better optimizer setting, a model that surprised you, a failure mode I didn’t hit — I’d like to hear about it. Open an issue at github.com/modl-org/modl or share your results on Reddit.

Once you have a LoRA you’re happy with, modl push lora my-character-v1 --name my-character uploads it to the hub. For style training, see the Style LoRA guide.