Quick start
Two commands to go from a folder of images to a captioned dataset ready for training:
Or do it all in one shot with prepare — it creates, resizes, and captions in a single pipeline:
Your dataset is now ready for modl train.
What’s on disk
Datasets live at ~/.modl/datasets/<name>/. There’s no database — it’s all filesystem. Each image gets a paired .txt caption file with the same name:
The .txt file contains the caption — one line of text that describes the image. You can edit these by hand at any time. No rebuild step, no recompilation. Just edit the text file and train.
Because datasets are just images + text files in a folder, you can use them with any training tool. There’s no proprietary format. Move them, copy them, version them with git — whatever works for you.
If you organize images into subfolders, the folder name becomes a tag prefix in the caption. This is useful for style datasets organized by category (e.g., happy/, sad/, landscape/).
Captioning models
modl ships three captioning backends. The right choice depends on your dataset type and GPU.
Florence-2 is the fastest option and uses minimal VRAM. Good enough for style datasets where you’re using the --style flag anyway. One warning: it hallucinates narratives on photos of people — it will invent emotions, relationships, and backstories that aren’t there. Don’t use it for character datasets.
BLIP-2 is a solid middle ground. Slightly better accuracy than Florence-2, especially for scene descriptions. Uses more VRAM.
Qwen3-VL produces the best captions overall. It follows instructions well, produces accurate and concise descriptions, and doesn’t hallucinate on people. This is the recommended choice for character/subject LoRAs where caption accuracy directly affects training quality.
For character datasets, always use Qwen3-VL. The accuracy difference matters — a hallucinated caption teaches the LoRA the wrong thing.
Character LoRA datasets
Character LoRAs teach a model to render a specific person, animal, or object. Caption accuracy is critical here — you’re teaching the model what this subject looks like.
Image selection
Aim for 15-30 images. Vary these across your set:
- Poses (front, side, three-quarter, full body, close-up)
- Lighting (natural, studio, outdoor, indoor)
- Backgrounds (plain, varied environments)
- Expressions (neutral, smiling, serious)
- Clothing (different outfits prevent the LoRA from baking in one look)
Captioning
Use Qwen3-VL for character datasets. It describes what it actually sees without inventing details.
Don’t put trigger words (like OHWX) in your caption files. The ai-toolkit training pipeline injects trigger words automatically during training. If you put them in captions manually, they’ll be doubled.
Augment with face-crop
Small dataset? Use face-crop to augment it. It detects faces in your images and creates tightly cropped close-up versions, effectively giving the model more face detail to learn from.
The --padding option is a bbox expansion multiplier: 1.0 = tight face, 1.8 = head and shoulders (default), 2.5 = upper body. The --trigger and --class-word options are used in generated captions for the cropped images.
Style LoRA datasets
Style LoRAs learn a visual aesthetic — line quality, color palette, texture, rendering approach. The dataset strategy is fundamentally different from character LoRAs.
Image selection
Aim for 50-200 images. The key principle: consistent style, varied content.
- All images should share the same visual style
- Subjects should be as diverse as possible (people, objects, landscapes, abstract)
- More variety in content = more flexible LoRA
Captioning with —style
The --style flag is critical for style datasets. It tells the captioner to describe what’s in the image without mentioning how it looks. This forces the LoRA to learn the visual gap between “normal” caption and “stylized” image.
Florence-2 with --style is fine for style datasets. When you have 100+ images, speed matters more than per-caption perfection, and the --style flag does the heavy lifting.
For the full explanation of why style captions should omit style words, see the Train Your First Style LoRA guide — the “Why captions matter so much” section covers this in detail.
Subfolder organization
If your style images have natural categories, organize them into subfolders. The folder name becomes a tag prefix in the caption:
This gives the model category awareness during training — it learns both the overall style and the emotional subcategories.
Command reference
Dataset commands
Best practices
Image quality matters. Remove blurry, watermarked, or duplicate images before training. Every bad image dilutes the dataset signal. A small, clean dataset beats a large, noisy one.
Caption quality over quantity. 20 well-captioned images produce better results than 100 poorly captioned ones. Spot-check your captions — open a few .txt files alongside the images and make sure the descriptions are accurate.
Always resize to training resolution. Mixed resolutions slow down training and waste VRAM on downscaling. The default is 1024px, which matches SDXL and Flux native resolution.
Review before training. After captioning, open the dataset folder and check 5-10 image-caption pairs. Look for hallucinated details, missed subjects, or style words that leaked into style-mode captions.
Dataset size guidelines
Fewer images with good variety and accurate captions will always outperform a larger dataset of repetitive or poorly described images. When in doubt, curate harder.