← All Guides
trainingstylesdxllora

Train Your First Style LoRA

Go from a folder of kids' drawings to a working SDXL style LoRA — dataset creation, captioning strategy, training, and testing.

Mar 3, 2026 12 min read

What you’ll build

By the end of this guide you’ll have a style LoRA trained on SDXL that turns any prompt into children’s crayon-style art. The same workflow applies to any visual style — watercolor, pixel art, ink drawings, whatever you have images of.

Step 0 — Base SDXL
Mountain landscape generated with base SDXL, no LoRA
Step 14k — With LoRA
Same prompt with the trained style LoRA applied

Same prompt, same seed. The LoRA transforms the output into the children's drawing style from your dataset.

We’re using a real dataset: 47 kids’ drawings sorted by emotion (happy, sad, angry, fearful). The entire process takes about 20 minutes of active work plus training time.

Style vs. Subject LoRA:

A style LoRA learns a visual aesthetic (line quality, color palette, texture) rather than a specific object or person. This changes how you caption, how many steps you need, and what rank to use.

Prerequisites

  • modl installedcurl -fsSL https://modl.run/install.sh | sh
  • A base model — we’ll use SDXL: modl pull sdxl-base-1.0
  • GPU with 12+ GB VRAM — modl auto-quantizes for 24GB cards, works on 12GB with reduced batch size
  • 30–100 images representing the style you want to capture

1. Create a dataset

Point modl at a folder of images. It copies them into the managed dataset directory, normalizes formats, and sets up the structure for captioning.

$ modl dataset create kids-art --from ~/drawings/
✓ 47 images → ~/.modl/datasets/kids-art/

Your dataset is now at ~/.modl/datasets/kids-art/ with all images ready for captioning.

Tip:

For style LoRAs, variety matters more than quantity. 30–60 diverse images showing different subjects in the same style work better than 200 images of the same thing. Our 47 images cover faces, houses, vehicles, families, and landscapes — all drawn in crayon by kids.

Kids art dataset — 47 children's drawings showing houses, families, faces, and vehicles in crayon style

The kids-art dataset: 47 drawings across four emotion categories — happy, sad, angry, fearful. Diverse subjects, consistent style.

2. Caption your images

Captioning is where style LoRAs diverge from subject LoRAs. With a subject LoRA, you describe what the subject looks like. With a style LoRA, you describe what’s in the image without mentioning the style itself.

$ modl dataset caption kids-art --style
▸ Captioning 47 images with style-aware prompts...
✓ 47/47 captions written

The --style flag tells the captioner to describe content, not appearance. Here’s what the captions look like:

happy_h1.txt
A family standing in front of a house
happy_h3.txt
House with a rainbow in the sky
angry_a3.txt
A farm scene with a tractor and a house
sad_s22.txt
A woman’s face with blue eyes
happy_h5.txt
A family with a rainbow and a house
angry_a9.txt
A robot with a speech bubble

Notice: no mention of “crayon”, “kid drawing”, “childlike”, or “pencil art”. Just the content. This is the key insight.

Why captions matter so much

This is the most important concept in style LoRA training. The caption tells the model what the text encoder should already handle vs. what the LoRA needs to learn. Get this wrong and your LoRA either doesn’t work or only works in narrow scenarios.

❌ Captions with style words

”A crayon pencil drawing of a family standing in front of a house"
"A kid’s drawing of a house with a rainbow, childlike art style”
LoRA results trained with style words in captions — style only activates with specific prompt words

Results from a LoRA trained with style descriptors in the captions. The style only shows up when you explicitly prompt for 'kid drawing' or 'crayon art'.

What happens

  • The model attributes the visual style to the words “crayon”, “kid’s drawing”
  • The LoRA has less to learn — the text encoder already “explains” the look
  • Result: the LoRA only activates strongly when you use those exact style words in your prompt
  • Without the style words, the LoRA barely changes the output
  • You’ve trained a “caption helper” not a style layer

✓ Good: Captions without style words

”A family standing in front of a house"
"House with a rainbow in the sky”
LoRA results trained with content-only captions — style applies universally to any prompt

Results from a LoRA trained with content-only captions. The kids' art style applies to every prompt — even concepts nowhere in the training data.

What happens

  • There’s a gap between what the caption describes (normal scene) and what the image looks like (crayon art)
  • The LoRA must learn the visual difference — that’s the style
  • The style becomes the LoRA’s “default lens” for everything
  • Any prompt gets the style treatment — “a cyberpunk city” becomes crayon cyberpunk
  • The trigger word (OHWX) gives you an on/off switch
The rule:

If the style is in the caption, the LoRA doesn’t learn it. If the style is not in the caption, the LoRA has to learn it. The gap between “what the text says” and “what the image shows” is exactly what the LoRA learns to fill.

This is why the --style flag exists in modl dataset caption — it instructs the vision model to describe content only, stripping out any style descriptors. You can manually review and edit the .txt files alongside each image if you want finer control.

3. Train the LoRA

With the dataset captioned, training is a single command. modl handles the Python runtime, ai-toolkit configuration, and VRAM optimization.

$ modl train --dataset kids-art --base sdxl-base-1.0 --type style --name kids-art-v1
▸ Base model: SDXL Base 1.0
▸ LoRA type: style
▸ Preset: standard
▸ Images: 47 → Steps: 7990 Rank: 64 LR: 1e-4
▸ Trigger word: OHWX
▸ Quantize: yes (24GB VRAM detected)
 
▸ Step 1000/7990 loss=0.089 lr=1.0e-4
▸ Step 2000/7990 loss=0.064 lr=1.0e-4
▸ Step 4000/7990 loss=0.048 lr=1.0e-4
▸ Step 7990/7990 loss=0.042 lr=1.0e-4
✓ Saved kids-art-v1.safetensors (86 MiB)

Let’s break down what modl chose and why.

Understanding the training presets

modl uses opinionated presets to calculate training parameters. For this run (47 images, style type, SDXL, Standard preset), the math works out as:

ParameterValueWhy
Steps7,990170 steps/image × 47 images = 7,990. Standard style uses ~170 steps/img for heterogeneous datasets.
Rank64Standard style preset. Higher rank = more capacity to represent complex styles. Quick uses 32, Advanced uses 128.
Learning Rate1e-4Standard for SDXL/Flux LoRA training. Consistent across presets.
OptimizerAdamW 8-bitMemory-efficient optimizer. Cuts VRAM usage with negligible quality loss.
QuantizeYesLoads base model in 8-bit. Required for 24GB VRAM, optional for 48GB+.

Preset comparison for style LoRAs (SDXL/Flux)

PresetSteps/imgRankStep range
Quick100323,000 – 8,000
Standard170644,000 – 20,000
Advanced2001286,000 – 30,000
When to use each:

Start with Standard. If you’re experimenting and want fast iteration, use Quick (add --preset quick). Advanced is for large, diverse datasets where you need the model to absorb more variation.

Steps/image varies by architecture

The 170 steps/img number is for SDXL and Flux. Other architectures converge differently:

ModelSteps/imgLRRankNotes
SDXL150–2001e-464UNet + dual text encoder, mature arch
Flux150–2001e-464Similar capacity to SDXL at same rank
Z-Image25–751e-4 max16–32Distilled model, converges ~3-5× faster/step
Chroma~similar to Flux1e-464Same backbone as Flux
Qwen-Imageunknown2e-416VLM-based, uses literal captions, no trigger word

4. Preview your results

Generate a quick preview to see what your LoRA learned.

$ modl generate "a castle on a hill with a dragon flying overhead" --lora kids-art-v1
▸ Loading sdxl-base-1.0 + kids-art-v1...
▸ Generating ████████████████ 28/28 steps
✓ Generated 1 image:
~/.modl/outputs/2026-03-03/001.png

The prompt describes a scene the model has never seen in the training data — but the output should still have the crayon art style. That’s how you know the LoRA captured style rather than content.

5. Evaluate your LoRA

A trained LoRA is a starting point. Now test it systematically.

Find the sweet spot strength

The LoRA weight controls how strongly the style applies. Try the same prompt at different weights:

0.3
Subtle texture
Hint of the style. Good for blending with photorealism.
0.6
Balanced
Clear style without losing composition control.
1.2
Overdriven
May get repetitive artifacts. Tests the ceiling.

You’ll probably find 0.6–0.9 is the most usable range.

Check for overfitting

Test prompts that are far from your training data. If they still get the style while keeping the concept readable, your LoRA is solid:

“a cyberpunk city at night"
"an astronaut riding a horse"
"a medieval castle interior"
"a dragon made of stained glass”

Good sign: The concept is clearly identifiable but rendered in crayon style. Lines are wobbly, colors are bright and flat, textures look hand-drawn.

Bad sign: Every output looks the same regardless of prompt, faces/shapes collapse into similar blobs, or the LoRA only works with specific prompt words.

Stress-test composition

Try prompts with clear structural requirements:

“a person holding a sign that says HELLO"
"a map of an island with labels"
"a room with a table, chair, window, lamp"
"a comic panel with two characters talking”

Style LoRAs (especially children’s art) often break text rendering and precise layout. Seeing how it breaks tells you what to improve — better captioning, more dataset variety, or adding ControlNet for structure.

6. Experiments to learn more

Once your first LoRA works, these experiments teach you the most about how training parameters affect output.

Find where training should’ve stopped

Style typically locks in around ~60–70% of total steps. The later steps add refinement but risk making outputs same-y. For our 8k run, that’s around steps 4,800–5,600.

Try re-training with fewer steps and compare:

$ modl train --dataset kids-art --base sdxl-base-1.0 --type style --name kids-art-early --preset quick
▸ Images: 47 → Steps: 4700 Rank: 32 LR: 1e-4

Often the earlier checkpoint is more flexible — less overfit, works with more diverse prompts.

Two variants to compare

These two re-trains teach the most about rank and steps tradeoffs:

Variant A — More flexible

Rank32
Steps6,000–8,000
LR8e-5

Lower capacity, earlier stopping. Produces subtler, more adaptable style that blends well with other LoRAs and detailed prompts.

Variant B — Stronger stylization

Rank64
Steps8,000–10,000
LR1e-4

More capacity, longer training. Stronger style takeover, but may lose flexibility on compositionally complex prompts.

Mix with another LoRA

A good test of whether your LoRA acts as a “true style layer” — combine it with a subject LoRA:

  • Your kids LoRA at 0.6–0.9 weight
  • A subject LoRA (cats, spaceships, architecture) at 0.2–0.5 weight

If the subject comes through clearly but rendered in kids’ style, you have a clean style LoRA. If everything gets overridden, try lowering the weight or re-training at lower rank.

Quick reference

The full workflow

# 1. Create dataset from your images
$ modl dataset create my-style --from ~/my-images/
 
# 2. Caption for style (content only, no style words)
$ modl dataset caption my-style --style
 
# 3. Train
$ modl train --dataset my-style --base sdxl-base-1.0 --type style --name my-style-v1
 
# 4. Generate
$ modl generate "your prompt here" --lora my-style-v1

Key rules for style LoRAs

  1. Never put the style in captions. Use --style flag or manually write content-only descriptions.
  2. Diverse subjects, consistent style. 30–60 images showing different things in the same style beat 200 images of the same subject.
  3. Start with Standard preset. 170 steps/img at rank 64 with 1e-4 LR is a solid baseline for SDXL/Flux.
  4. Test at 0.6–0.9 weight. Full weight (1.0) is often too strong for style LoRAs.
  5. Test far-from-dataset prompts. The best style LoRAs transform any subject, not just ones similar to training data.

Next steps

Once you’ve trained and tested your style LoRA, try training on Flux for comparison — same steps/img ratio, often cleaner results on complex compositions. Or try Z-Image for much faster iteration (3-5× faster convergence).