Quick start
Two commands to go from a folder of images to a captioned dataset ready for training:
Or do it all in one shot with prepare — it creates, resizes, and captions in a single pipeline:
The prepare command accepts the same flags as its individual steps:
Your dataset is now ready for modl train.
What’s on disk
Datasets live at ~/.modl/datasets/<name>/. There’s no database — it’s all filesystem. Each image gets a paired .txt caption file with the same name:
The .txt file contains the caption — one line of text that describes the image. You can edit these by hand at any time. No rebuild step, no re-indexing. Just edit the text file and train.
If you run caption on a dataset that already has captions, existing .txt files are skipped — only uncaptioned images get new captions. Use --overwrite to re-caption everything.
When to edit captions manually: After auto-captioning, spot-check 5-10 files. Fix hallucinated details, remove leaked style descriptions, and correct any misidentified subjects. Run modl dataset validate to check caption coverage, then open any uncaptioned or short-caption files first.
Because datasets are just images + text files in a folder, you can use them with any training tool. There’s no proprietary format. Move them, copy them, version them with git — whatever works for you.
If you organize images into subfolders, the folder name becomes a tag prefix in the caption. This is useful for style datasets organized by category (e.g., happy/, sad/, landscape/).
Captioning models
modl ships three captioning backends. The right choice depends on your dataset type and GPU.
Florence-2 is the fastest option and uses minimal VRAM. Good enough for style datasets where you’re using the --style flag anyway. One warning: it hallucinates narratives on photos of people — it will invent emotions, relationships, and backstories that aren’t there. Don’t use it for character datasets.
BLIP (Salesforce BLIP-2) is a solid middle ground. Slightly better accuracy than Florence-2, especially for scene descriptions. Uses more VRAM.
Qwen3-VL produces the best captions overall. It follows instructions well, produces accurate and concise descriptions, and doesn’t hallucinate on people. It also uses less VRAM than BLIP (~4 GB vs ~6 GB) — this isn’t a tradeoff, it’s just the better default unless you need Florence-2’s speed. This is the recommended choice for character/subject LoRAs where caption accuracy directly affects training quality.
Same image, three models
To see the difference yourself, here’s the same photo captioned by each model:
Florence-2 hallucinated a plate of food and a door that aren’t there. BLIP got the general scene but missed the details. Qwen3-VL nailed the specifics — blonde hair, beard, dark jacket, gesturing, studio setting — the kind of precision that matters when training a LoRA on someone’s face.
For character datasets, always use Qwen3-VL. The accuracy difference matters — a hallucinated caption teaches the LoRA the wrong thing.
Character LoRA datasets
Character LoRAs teach a model to render a specific person, animal, or object. Caption accuracy is critical here — you’re teaching the model what this subject looks like.
Image selection
Aim for 15-30 images. Vary these across your set:
- Poses (front, side, three-quarter, full body, close-up)
- Lighting (natural, studio, outdoor, indoor)
- Backgrounds (plain, varied environments)
- Expressions (neutral, smiling, serious)
- Clothing (different outfits prevent the LoRA from baking in one look)
Captioning
Use Qwen3-VL for character datasets. It describes what it actually sees without inventing details.
Don’t put trigger words (like OHWX) in your caption files. The ai-toolkit training pipeline injects trigger words automatically during training. If you put them in captions manually, they’ll be doubled.
Augment with face-crop
Small dataset? Use face-crop to augment it. It detects faces in your images and creates tightly cropped close-up versions, effectively giving the model more face detail to learn from.
The --padding option is a bbox expansion multiplier: 1.0 = tight face, 1.8 = head and shoulders (default), 2.5 = upper body. The --trigger and --class-word options are injected into the generated captions for the cropped images:
The face crop prepends a close-up photo of [trigger] [class-word], to the original caption. The filename gets a _facecrop_0 suffix.
Style LoRA datasets
Style LoRAs learn a visual aesthetic — line quality, color palette, texture, rendering approach. The dataset strategy differs significantly from character LoRAs — you need more images, different captioning, and the training dynamics change because the model is learning a global transformation rather than a specific subject.
Image selection
Aim for 50-200 images. The key principle: consistent style, varied content.
- All images should share the same visual style
- Subjects should be as diverse as possible (people, objects, landscapes, abstract)
- More variety in content = more flexible LoRA
Captioning with —style
The --style flag is critical for style datasets. It tells the captioner to describe what’s in the image without mentioning how it looks. This forces the LoRA to learn the visual gap between “normal” caption and “stylized” image.
Florence-2 with --style is fine for style datasets. When you have 100+ images, speed matters more than per-caption perfection, and the --style flag does the heavy lifting.
For the full explanation of why style captions should omit style words, see the Train Your First Style LoRA guide — the “Why captions matter so much” section covers this in detail.
Subfolder organization
If your style images have natural categories, organize them into subfolders. The folder name becomes a tag prefix in the caption:
This gives the model category awareness during training — it learns both the overall style and the emotional subcategories.
Tagging
Tagging is different from captioning. Captions are natural-language descriptions (“a woman sitting at a café table”). Tags are structured labels (“1girl, sitting, café, red_dress, indoor”) — the format used by booru-style training datasets and many anime/illustration models.
By default, tag uses Florence-2 for general-purpose tagging. For anime or illustration datasets, use WD Tagger — it produces the danbooru-style tags that anime models expect:
Tags are written to the same .txt files as captions. If you want both tags and captions, caption first, then tag with --append to add tags as a comma-separated suffix.
Most users should stick with captioning. Tagging is mainly useful for anime-style LoRAs trained on models that were originally trained on tagged data.
Command reference
Create & import
Caption & tag
Transform
Manage
Best practices
Image quality matters. Remove blurry, watermarked, or duplicate images before training. Every bad image dilutes the dataset signal. A small, clean dataset beats a large, noisy one.
Caption quality over quantity. 20 well-captioned images produce better results than 100 poorly captioned ones. Spot-check your captions — open a few .txt files alongside the images and make sure the descriptions are accurate.
Always resize to training resolution. Mixed resolutions slow down training and waste VRAM on downscaling. The default is 1024px, which matches SDXL and Flux native resolution. The resize command fits images to the longest edge — a 1920×1280 landscape becomes 1024×683, preserving the original aspect ratio without cropping or padding.
Review before training. After captioning, open the dataset folder and check 5-10 image-caption pairs. Look for hallucinated details, missed subjects, or style words that leaked into style-mode captions.
Run validate before training. The validate command checks your dataset and reports caption coverage — how many images have matching .txt files. It’s a quick sanity check before you start a training run.
If coverage is below 100%, find the uncaptioned images and either caption them with --overwrite or remove them from the dataset.
Common dataset problems
These are the failure modes that show up most often in training:
- Repetitive poses or angles — if 20 of your 25 photos are front-facing headshots, the LoRA will struggle with any other angle. Diversity matters more than quantity.
- Inconsistent resolution — mixing 4K photos with 800px screenshots creates uneven training signal. Always resize first.
- Style words in style-mode captions — if your style captions say “watercolor painting of a cat” instead of “a cat,” the LoRA learns nothing because the caption already describes the style. Check that
--stylewas used. - Near-duplicate images — slightly different crops of the same shot look like variety but teach the model to memorize one scene. Remove perceptual duplicates, not just exact matches.
Expanding small datasets
If you have fewer than 15 images for a character LoRA, face-crop is the first tool to reach for — it creates close-up variations from existing photos. Beyond that, you can use modl edit to create synthetic variations while preserving the subject:
This works best when your originals are high quality but few — synthetic augmentation amplifies whatever signal is already there, noise included. Horizontal flips are free augmentation for symmetric subjects (skip them for text or logos).