← All Guides
datasetstrainingcaptioninglora

Datasets

Create, caption, and manage training datasets for LoRA training with modl

Mar 25, 2026 8 min read

Quick start

Two commands to go from a folder of images to a captioned dataset ready for training:

# Create a dataset from your images
$ modl dataset create my-dataset --from ~/my-images/
✓ 25 images → ~/.modl/datasets/my-dataset/
 
# Caption every image
$ modl dataset caption my-dataset
✓ 25/25 captions written

Or do it all in one shot with prepare — it creates, resizes, and captions in a single pipeline:

$ modl dataset prepare my-dataset --from ~/my-images/
✓ 25 images imported
✓ 25 images resized to 1024px
✓ 25/25 captions written

Your dataset is now ready for modl train.

What’s on disk

Datasets live at ~/.modl/datasets/<name>/. There’s no database — it’s all filesystem. Each image gets a paired .txt caption file with the same name:

$ ls ~/.modl/datasets/my-dataset/
photo_001.jpg
photo_001.txt
photo_002.png
photo_002.txt
photo_003.jpg
photo_003.txt
...

The .txt file contains the caption — one line of text that describes the image. You can edit these by hand at any time. No rebuild step, no recompilation. Just edit the text file and train.

Plain files, no lock-in:

Because datasets are just images + text files in a folder, you can use them with any training tool. There’s no proprietary format. Move them, copy them, version them with git — whatever works for you.

If you organize images into subfolders, the folder name becomes a tag prefix in the caption. This is useful for style datasets organized by category (e.g., happy/, sad/, landscape/).

Captioning models

modl ships three captioning backends. The right choice depends on your dataset type and GPU.

ModelVRAMSpeedQualityBest for
Florence-2~1.5 GBFastGenericStyle datasets (with --style)
BLIP-2~6 GBMediumGoodGeneral use
Qwen3-VL~4 GBMediumBestCharacter datasets
# Use a specific captioning model
$ modl dataset caption my-dataset --model qwen
$ modl dataset caption my-dataset --model florence-2
$ modl dataset caption my-dataset --model blip-2

Florence-2 is the fastest option and uses minimal VRAM. Good enough for style datasets where you’re using the --style flag anyway. One warning: it hallucinates narratives on photos of people — it will invent emotions, relationships, and backstories that aren’t there. Don’t use it for character datasets.

BLIP-2 is a solid middle ground. Slightly better accuracy than Florence-2, especially for scene descriptions. Uses more VRAM.

Qwen3-VL produces the best captions overall. It follows instructions well, produces accurate and concise descriptions, and doesn’t hallucinate on people. This is the recommended choice for character/subject LoRAs where caption accuracy directly affects training quality.

Tip:

For character datasets, always use Qwen3-VL. The accuracy difference matters — a hallucinated caption teaches the LoRA the wrong thing.

Character LoRA datasets

Character LoRAs teach a model to render a specific person, animal, or object. Caption accuracy is critical here — you’re teaching the model what this subject looks like.

Image selection

Aim for 15-30 images. Vary these across your set:

  • Poses (front, side, three-quarter, full body, close-up)
  • Lighting (natural, studio, outdoor, indoor)
  • Backgrounds (plain, varied environments)
  • Expressions (neutral, smiling, serious)
  • Clothing (different outfits prevent the LoRA from baking in one look)

Captioning

Use Qwen3-VL for character datasets. It describes what it actually sees without inventing details.

$ modl dataset caption my-character --model qwen
Trigger words go in training, not captions:

Don’t put trigger words (like OHWX) in your caption files. The ai-toolkit training pipeline injects trigger words automatically during training. If you put them in captions manually, they’ll be doubled.

Augment with face-crop

Small dataset? Use face-crop to augment it. It detects faces in your images and creates tightly cropped close-up versions, effectively giving the model more face detail to learn from.

# Create face crops from existing images
$ modl dataset face-crop my-character
✓ Detected 22 faces in 18 images
✓ 22 cropped images added to dataset
 
# Control crop padding (default 1.8 = head+shoulders)
$ modl dataset face-crop my-character --padding 2.5 # upper body
 
# Add trigger/class words to face crop captions
$ modl dataset face-crop my-character --trigger OHWX --class-word dog

The --padding option is a bbox expansion multiplier: 1.0 = tight face, 1.8 = head and shoulders (default), 2.5 = upper body. The --trigger and --class-word options are used in generated captions for the cropped images.

Style LoRA datasets

Style LoRAs learn a visual aesthetic — line quality, color palette, texture, rendering approach. The dataset strategy is fundamentally different from character LoRAs.

Image selection

Aim for 50-200 images. The key principle: consistent style, varied content.

  • All images should share the same visual style
  • Subjects should be as diverse as possible (people, objects, landscapes, abstract)
  • More variety in content = more flexible LoRA

Captioning with —style

The --style flag is critical for style datasets. It tells the captioner to describe what’s in the image without mentioning how it looks. This forces the LoRA to learn the visual gap between “normal” caption and “stylized” image.

$ modl dataset caption my-style --style
✓ 85/85 captions written (style mode)

Florence-2 with --style is fine for style datasets. When you have 100+ images, speed matters more than per-caption perfection, and the --style flag does the heavy lifting.

Tip:

For the full explanation of why style captions should omit style words, see the Train Your First Style LoRA guide — the “Why captions matter so much” section covers this in detail.

Subfolder organization

If your style images have natural categories, organize them into subfolders. The folder name becomes a tag prefix in the caption:

$ ls ~/.modl/datasets/kids-art/
happy/
sad/
angry/
fearful/

This gives the model category awareness during training — it learns both the overall style and the emotional subcategories.

Command reference

Dataset commands

# Create a dataset from a folder of images
$ modl dataset create <name> --from <path>
 
# One-shot: create + resize + caption
$ modl dataset prepare <name> --from <path>
 
# Caption all uncaptioned images in a dataset
$ modl dataset caption <name>
$ modl dataset caption <name> --style # style mode (content only)
$ modl dataset caption <name> --model qwen # specific model
$ modl dataset caption <name> --overwrite # re-caption all
 
# Auto-tag images with structured labels
$ modl dataset tag <name> # florence-2 default
$ modl dataset tag <name> --model wd-tagger # anime-focused
 
# Resize images to training resolution
$ modl dataset resize <name> # default: 1024px
$ modl dataset resize <name> --resolution 512
 
# Generate face crops for character datasets
$ modl dataset face-crop <name>
$ modl dataset face-crop <name> --padding 1.8 --trigger OHWX
 
# Validate dataset (check for issues)
$ modl dataset validate <name>
 
# List all datasets
$ modl dataset ls
 
# Delete a dataset
$ modl dataset rm <name>

Best practices

Image quality matters. Remove blurry, watermarked, or duplicate images before training. Every bad image dilutes the dataset signal. A small, clean dataset beats a large, noisy one.

Caption quality over quantity. 20 well-captioned images produce better results than 100 poorly captioned ones. Spot-check your captions — open a few .txt files alongside the images and make sure the descriptions are accurate.

Always resize to training resolution. Mixed resolutions slow down training and waste VRAM on downscaling. The default is 1024px, which matches SDXL and Flux native resolution.

$ modl dataset resize my-dataset
✓ 25 images resized to 1024px

Review before training. After captioning, open the dataset folder and check 5-10 image-caption pairs. Look for hallucinated details, missed subjects, or style words that leaked into style-mode captions.

Dataset size guidelines

LoRA TypeImagesNotes
Character15–30Vary poses, lighting, backgrounds, clothing
Style50–200Consistent style, diverse subjects and compositions

Fewer images with good variety and accurate captions will always outperform a larger dataset of repetitive or poorly described images. When in doubt, curate harder.

Next steps

With your dataset ready, head to Train Your First Style LoRA for a full training walkthrough. Or jump straight to modl train —dataset your-dataset —base z-image —name my-lora-v1 and see what happens.