← All Guides
datasetstrainingcaptioninglora

Datasets

Create, caption, and manage training datasets for LoRA training with modl

Mar 25, 2026 11 min read

Quick start

Two commands to go from a folder of images to a captioned dataset ready for training:

# Create a dataset from your images
$ modl dataset create my-dataset --from ~/my-images/
✓ 25 images → ~/.modl/datasets/my-dataset/
 
# Caption every image
$ modl dataset caption my-dataset
✓ 25/25 captions written

Or do it all in one shot with prepare — it creates, resizes, and captions in a single pipeline:

$ modl dataset prepare my-dataset --from ~/my-images/
✓ 25 images imported
✓ 25 images resized to 1024px
✓ 25/25 captions written

The prepare command accepts the same flags as its individual steps:

FlagDefaultDescription
--modelqwenCaptioning model: qwen, florence-2, blip
--styleoffStyle-mode captioning (describe content, not aesthetics)
--resolution1024Resize longest edge to this value in pixels
--overwriteoffRe-caption images that already have .txt files

Your dataset is now ready for modl train.

What’s on disk

Datasets live at ~/.modl/datasets/<name>/. There’s no database — it’s all filesystem. Each image gets a paired .txt caption file with the same name:

$ ls ~/.modl/datasets/my-dataset/
photo_001.jpg
photo_001.txt
photo_002.png
photo_002.txt
photo_003.jpg
photo_003.txt
...

The .txt file contains the caption — one line of text that describes the image. You can edit these by hand at any time. No rebuild step, no re-indexing. Just edit the text file and train.

If you run caption on a dataset that already has captions, existing .txt files are skipped — only uncaptioned images get new captions. Use --overwrite to re-caption everything.

When to edit captions manually: After auto-captioning, spot-check 5-10 files. Fix hallucinated details, remove leaked style descriptions, and correct any misidentified subjects. Run modl dataset validate to check caption coverage, then open any uncaptioned or short-caption files first.

Plain files, no lock-in:

Because datasets are just images + text files in a folder, you can use them with any training tool. There’s no proprietary format. Move them, copy them, version them with git — whatever works for you.

If you organize images into subfolders, the folder name becomes a tag prefix in the caption. This is useful for style datasets organized by category (e.g., happy/, sad/, landscape/).

Captioning models

modl ships three captioning backends. The right choice depends on your dataset type and GPU.

ModelVRAMSpeedQualityBest for
Florence-2~1.5 GBFastGenericStyle datasets (with --style)
BLIP~6 GBMediumGoodGeneral use
Qwen3-VL~4 GBMediumBestCharacter datasets
# Use a specific captioning model
$ modl dataset caption my-dataset --model qwen
$ modl dataset caption my-dataset --model florence-2
$ modl dataset caption my-dataset --model blip

Florence-2 is the fastest option and uses minimal VRAM. Good enough for style datasets where you’re using the --style flag anyway. One warning: it hallucinates narratives on photos of people — it will invent emotions, relationships, and backstories that aren’t there. Don’t use it for character datasets.

BLIP (Salesforce BLIP-2) is a solid middle ground. Slightly better accuracy than Florence-2, especially for scene descriptions. Uses more VRAM.

Qwen3-VL produces the best captions overall. It follows instructions well, produces accurate and concise descriptions, and doesn’t hallucinate on people. It also uses less VRAM than BLIP (~4 GB vs ~6 GB) — this isn’t a tradeoff, it’s just the better default unless you need Florence-2’s speed. This is the recommended choice for character/subject LoRAs where caption accuracy directly affects training quality.

Same image, three models

To see the difference yourself, here’s the same photo captioned by each model:

# Florence-2
The image shows a man sitting at a table with a plate of food in front of
him, surrounded by photo frames and other objects. In the background, there
is a wall with a window and a door.
 
# BLIP
a man sitting in front of a television with a picture of him on it
 
# Qwen3-VL
A man with blonde hair and a beard, wearing a dark jacket, is seated and
gesturing with his right hand, appearing to speak or explain something. He
is in a studio setting with a glass wall and a framed portrait visible in
the background.

Florence-2 hallucinated a plate of food and a door that aren’t there. BLIP got the general scene but missed the details. Qwen3-VL nailed the specifics — blonde hair, beard, dark jacket, gesturing, studio setting — the kind of precision that matters when training a LoRA on someone’s face.

Tip:

For character datasets, always use Qwen3-VL. The accuracy difference matters — a hallucinated caption teaches the LoRA the wrong thing.

Character LoRA datasets

Character LoRAs teach a model to render a specific person, animal, or object. Caption accuracy is critical here — you’re teaching the model what this subject looks like.

Image selection

Aim for 15-30 images. Vary these across your set:

  • Poses (front, side, three-quarter, full body, close-up)
  • Lighting (natural, studio, outdoor, indoor)
  • Backgrounds (plain, varied environments)
  • Expressions (neutral, smiling, serious)
  • Clothing (different outfits prevent the LoRA from baking in one look)

Captioning

Use Qwen3-VL for character datasets. It describes what it actually sees without inventing details.

$ modl dataset caption my-character --model qwen
Trigger words go in training, not captions:

Don’t put trigger words (like OHWX) in your caption files. The ai-toolkit training pipeline injects trigger words automatically during training. If you put them in captions manually, they’ll be doubled.

Augment with face-crop

Small dataset? Use face-crop to augment it. It detects faces in your images and creates tightly cropped close-up versions, effectively giving the model more face detail to learn from.

# Create face crops from existing images
$ modl dataset face-crop my-character
✓ Detected 22 faces in 18 images
✓ 22 cropped images added to dataset
 
# Control crop padding (default 1.8 = head+shoulders)
$ modl dataset face-crop my-character --padding 2.5 # upper body
 
# Add trigger/class words to face crop captions
$ modl dataset face-crop my-character --trigger OHWX --class-word dog

The --padding option is a bbox expansion multiplier: 1.0 = tight face, 1.8 = head and shoulders (default), 2.5 = upper body. The --trigger and --class-word options are injected into the generated captions for the cropped images:

# Original caption (photo_001.txt)
A man with a beard and short dark hair, wearing a dark suit and shirt,
sits at a desk in a studio setting.
 
# Generated face-crop caption (photo_001_facecrop_0.txt)
a close-up photo of OHWX man, A man with a beard and short dark hair,
wearing a dark suit and shirt, sits at a desk in a studio setting.

The face crop prepends a close-up photo of [trigger] [class-word], to the original caption. The filename gets a _facecrop_0 suffix.

Style LoRA datasets

Style LoRAs learn a visual aesthetic — line quality, color palette, texture, rendering approach. The dataset strategy differs significantly from character LoRAs — you need more images, different captioning, and the training dynamics change because the model is learning a global transformation rather than a specific subject.

Image selection

Aim for 50-200 images. The key principle: consistent style, varied content.

  • All images should share the same visual style
  • Subjects should be as diverse as possible (people, objects, landscapes, abstract)
  • More variety in content = more flexible LoRA

Captioning with —style

The --style flag is critical for style datasets. It tells the captioner to describe what’s in the image without mentioning how it looks. This forces the LoRA to learn the visual gap between “normal” caption and “stylized” image.

$ modl dataset caption my-style --style
✓ 85/85 captions written (style mode)

Florence-2 with --style is fine for style datasets. When you have 100+ images, speed matters more than per-caption perfection, and the --style flag does the heavy lifting.

Tip:

For the full explanation of why style captions should omit style words, see the Train Your First Style LoRA guide — the “Why captions matter so much” section covers this in detail.

Subfolder organization

If your style images have natural categories, organize them into subfolders. The folder name becomes a tag prefix in the caption:

$ ls ~/.modl/datasets/kids-art/
happy/
sad/
angry/
fearful/

This gives the model category awareness during training — it learns both the overall style and the emotional subcategories.

Tagging

Tagging is different from captioning. Captions are natural-language descriptions (“a woman sitting at a café table”). Tags are structured labels (“1girl, sitting, café, red_dress, indoor”) — the format used by booru-style training datasets and many anime/illustration models.

$ modl dataset tag my-dataset
✓ 25/25 images tagged (florence-2)

By default, tag uses Florence-2 for general-purpose tagging. For anime or illustration datasets, use WD Tagger — it produces the danbooru-style tags that anime models expect:

$ modl dataset tag my-dataset --model wd-tagger
✓ 25/25 images tagged (wd-tagger)

Tags are written to the same .txt files as captions. If you want both tags and captions, caption first, then tag with --append to add tags as a comma-separated suffix.

Tip:

Most users should stick with captioning. Tagging is mainly useful for anime-style LoRAs trained on models that were originally trained on tagged data.

Command reference

Create & import

$ modl dataset create <name> --from <path>
$ modl dataset prepare <name> --from <path> # create + resize + caption

Caption & tag

$ modl dataset caption <name> # default: qwen
$ modl dataset caption <name> --model florence-2 # faster, less accurate
$ modl dataset caption <name> --style # style mode (content only)
$ modl dataset caption <name> --overwrite # re-caption all
$ modl dataset tag <name> # structured labels (florence-2)
$ modl dataset tag <name> --model wd-tagger # anime/booru tags

Transform

$ modl dataset resize <name> # default: 1024px longest edge
$ modl dataset resize <name> --resolution 512
$ modl dataset face-crop <name> # generate face close-ups
$ modl dataset face-crop <name> --padding 2.5 --trigger OHWX --class-word dog

Manage

$ modl dataset validate <name> # check for issues
$ modl dataset ls # list all datasets
$ modl dataset rm <name> # delete a dataset

Best practices

Image quality matters. Remove blurry, watermarked, or duplicate images before training. Every bad image dilutes the dataset signal. A small, clean dataset beats a large, noisy one.

Caption quality over quantity. 20 well-captioned images produce better results than 100 poorly captioned ones. Spot-check your captions — open a few .txt files alongside the images and make sure the descriptions are accurate.

Always resize to training resolution. Mixed resolutions slow down training and waste VRAM on downscaling. The default is 1024px, which matches SDXL and Flux native resolution. The resize command fits images to the longest edge — a 1920×1280 landscape becomes 1024×683, preserving the original aspect ratio without cropping or padding.

$ modl dataset resize my-dataset
✓ 25 images resized to 1024px

Review before training. After captioning, open the dataset folder and check 5-10 image-caption pairs. Look for hallucinated details, missed subjects, or style words that leaked into style-mode captions.

Run validate before training. The validate command checks your dataset and reports caption coverage — how many images have matching .txt files. It’s a quick sanity check before you start a training run.

$ modl dataset validate my-dataset
✓ Dataset is valid
Images: 25
Captions: 23 / 25 (92%)

If coverage is below 100%, find the uncaptioned images and either caption them with --overwrite or remove them from the dataset.

Common dataset problems

These are the failure modes that show up most often in training:

  • Repetitive poses or angles — if 20 of your 25 photos are front-facing headshots, the LoRA will struggle with any other angle. Diversity matters more than quantity.
  • Inconsistent resolution — mixing 4K photos with 800px screenshots creates uneven training signal. Always resize first.
  • Style words in style-mode captions — if your style captions say “watercolor painting of a cat” instead of “a cat,” the LoRA learns nothing because the caption already describes the style. Check that --style was used.
  • Near-duplicate images — slightly different crops of the same shot look like variety but teach the model to memorize one scene. Remove perceptual duplicates, not just exact matches.

Expanding small datasets

If you have fewer than 15 images for a character LoRA, face-crop is the first tool to reach for — it creates close-up variations from existing photos. Beyond that, you can use modl edit to create synthetic variations while preserving the subject:

# Generate a background variation from an existing photo
$ modl edit photo_001.jpg --prompt "same person standing in a park" --output park_variant.jpg

This works best when your originals are high quality but few — synthetic augmentation amplifies whatever signal is already there, noise included. Horizontal flips are free augmentation for symmetric subjects (skip them for text or logos).

Next steps

With your dataset ready, head to Train Your First Style LoRA for a full training walkthrough. Or jump straight to modl train —dataset your-dataset —base z-image —name my-lora-v1 and iterate from there.