← All Guides
primitivesvision-languagegroundingcaptioning

Caption, Tag & Detect Objects

Find objects by name, generate captions, and auto-tag images using Qwen3-VL. Three commands that bridge language and image understanding.

Mar 14, 2026 6 min read

The Upscale, Restore & Score guide covers commands that work on pixels and coordinates — score, detect, segment, upscale, remove-bg. They’re powerful but they don’t understand language. You can’t say “find the coffee cup” — you need to know the coordinates.

These three commands close that gap. They use Qwen3-VL, a vision-language model that understands both images and text:

  • modl vision ground — find objects by name, get bounding boxes
  • modl vision describe — generate natural language captions
  • modl vision tag — automatically label images with tags

All three support --model qwen3-vl-2b (fast, 4GB VRAM, default) or --model qwen3-vl-8b (higher quality, 16GB VRAM).

modl vision ground

What it does: Finds objects by name. Give it a text query and an image — it returns bounding boxes for every matching object.

$ modl vision ground "cup" cafe.webp --json
{"detections":[{"objects":[
{"label":"cup","bbox":[530,506,642,597]},
{"label":"cup","bbox":[768,514,879,601]}
],"object_count":2}],"total_objects":2}
Cafe scene with red bounding boxes around both coffee cups found by modl vision ground

modl vision ground found both cups by text query. Unlike SAM which needs pixel coordinates, ground understands what 'cup' means.

The ground → segment → inpaint pipeline

This is the key workflow that ground enables. Let’s walk through replacing the coffee cups with glasses of wine — entirely from the command line.

Step 1 — Find the objects. modl vision ground takes a text query and returns bounding boxes. The query "cup" finds both cups — "coffee cup" only found one:

$ modl vision ground "cup" cafe.webp --json
{"detections":[{"objects":[
{"label":"cup","bbox":[530,506,642,597]},
{"label":"cup","bbox":[768,514,879,601]}
],"object_count":2}],"total_objects":2}

Step 2 — Create the mask. Ground returns tight bounding boxes around the cup bodies — but not the saucers underneath. For inpainting, the mask needs to cover everything you want replaced, with padding for the model to blend. We merge both bboxes into one region and expand:

$ modl process segment cafe.webp --method bbox --bbox 530,506,879,601 --expand 50
✓ Mask saved to cafe_mask.png

Step 3 — Inpaint. Pass the original image, the mask, and a prompt. Use --size 1344x768 to match the original dimensions:

$ modl generate "two glasses of red wine on a clean cafe table, nothing else" \
--init-image cafe.webp --mask cafe_mask.png --size 1344x768
↳ Using flux-fill-dev for inpainting
✓ Generated 1 image(s)
1 Original
Original cafe scene
2 Ground
Bounding boxes around both coffee cups found by modl vision ground
3 Mask
Expanded feathered mask over the merged cup region
4 Inpainted
Both coffee cups replaced with glasses of red wine

Three commands: ground finds objects by name, segment creates a padded mask from the merged bboxes, generate fills the region.

What we learned building this pipeline

Getting clean inpainting results took several iterations. Here’s what we found:

Ground bboxes are tighter than you’d expect. The bbox wraps the visible cup body — not the saucer, not the shadow. If you expand just the cup bbox and inpaint, the saucer bleeds through underneath because it was never fully masked. For clean replacement, the mask must cover the entire object including its base.

Query wording changes what ground finds. "cup" found both cups. "coffee cup" found only one. Shorter, more generic queries cast a wider net. Start broad and narrow down.

--expand is a balancing act. Too small (30px) and the model doesn’t have room — it just regenerates a similar cup. Too large (60px) and the feathered edge bleeds into the neighboring cup, causing the model to replace both. The Gaussian blur extends well beyond the expand value, so nearby objects can be affected even when they seem safely outside the rectangle.

Merging bboxes works better than single-object masks when objects are close together. Rather than fighting with expand values to isolate one cup, it’s simpler to merge both bounding boxes into one mask region and replace everything at once.

Descriptive prompts fix artifacts. The model fills the entire masked region, not just where the object was. With a wide mask and a vague prompt like "two glasses of wine", it hallucinated stacked plates to fill the empty table area. Adding "on a clean cafe table, nothing else" told the model to leave the rest empty — same mask, clean result. When inpainting goes wrong, try improving the prompt before shrinking the mask.

Flux Fill Dev produces the best results but needs room. It’s a dedicated inpainting model with 384 input channels — run modl pull flux-fill-dev to install it. Modl automatically routes to it when a mask is provided.

An AI agent can run this entire pipeline autonomously — it just needs a text instruction like “replace the coffee cups with wine glasses.”

modl vision describe

What it does: Generates a natural language description of an image. Three detail levels for different use cases.

Outdoor cafe scene — two people sitting at a small round table with coffee cups

The cafe reference image used for all describe examples below.

$ modl vision describe cafe.png --detail brief
cafe.png:
A woman and a man are sitting at an outdoor café table, engaged
in conversation, with two coffee cups placed between them.
$ modl vision describe cafe.png --detail detailed
cafe.png:
The image depicts an outdoor café scene with two individuals
sitting at a small round table. The setting appears to be a city
street, as evidenced by the blurred background showing parked
cars and buildings.
 
The person on the left is a woman with long, dark hair, wearing
a dark coat over a light-colored top. She is seated on a metal
chair and is looking towards the person on the right. On the
table in front of her is a white cup with a brown interior,
likely containing a beverage.
 
The person on the right is a man with short, curly hair, wearing
a dark jacket. He is facing the woman but is turned slightly
away from the camera, so his face is not fully visible.
$ modl vision describe cafe.png --detail verbose
cafe.png:
The image depicts an outdoor café scene with two individuals
seated at a small round table.
 
Objects Present:
1. Two Individuals:
- A woman on the left side is facing the man across from her.
- The man is on the right side, facing away from the camera.
- Both are dressed in dark-colored clothing.
2. Table:
- The table is small and round, typical of outdoor café tables.
- Two coffee cups are placed on the table.
3. Background:
- Shows a street with parked cars and buildings.
4. Lighting:
- Daytime, likely morning or afternoon, soft shadows.
5. Mood:
- Calm and relaxed, casual social interaction.
6. Color:
- Muted browns and grays complementing the urban setting.

Detail levels:

  • brief — one sentence, good for dataset captioning
  • detailed — structured paragraphs with people, objects, setting (default)
  • verbose — everything visible including colors, spatial relationships, mood

Use cases

Dataset captioningmodl dataset caption --model qwen uses describe under the hood for higher quality captions than Florence-2 or BLIP. Better captions = better LoRAs.

Agent understanding — let an AI agent “see” what was generated and decide if it matches the prompt.

Compare output to intent — describe the generated image and check if it matches what you asked for.

Tip: For dataset captioning, use --detail brief to get concise factual descriptions. For agent workflows, use --detail detailed to give the agent enough context to make decisions.

modl vision tag

What it does: Automatically labels an image with relevant tags — objects, concepts, mood, setting.

$ modl vision tag cafe.png
→ Tagging image(s) [qwen3-vl-2b]...
✓ Tagged 1/1 images
 
cafe.png: woman, man, coffee, outdoor cafe, street, sidewalk,
cars, buildings, conversation, casual attire, relaxed atmosphere
$ modl vision tag cafe.png --max-tags 5
cafe.png: woman, man, coffee, outdoor cafe, street

Use --max-tags 5 to limit the count. Useful for:

  • Organizing outputs — auto-tag your generation gallery
  • Filtering batches — find all images with “person” or “landscape”
  • Search indexes — build searchable metadata for large output collections

Choosing a model size

All three commands default to qwen3-vl-2b (2B params, ~4GB VRAM). For higher accuracy, use --model qwen3-vl-8b (8B params, ~16GB VRAM):

# Fast (default): good for most tasks
$ modl vision ground "coffee cup" cafe.png
 
# Quality: better for complex scenes or fine-grained detection
$ modl vision ground "coffee cup" cafe.png --model qwen3-vl-8b

When to use 8B:

  • Complex scenes with many small objects
  • Queries that require fine-grained understanding (“the second cup from the left”)
  • Detailed captions where accuracy matters (dataset captioning for production LoRAs)

When 2B is enough:

  • Simple object queries (“person”, “table”, “car”)
  • Brief captions
  • Tagging
  • Any time inference speed matters more than perfect accuracy

Quick reference

modl vision ground — find objects by name

modl vision describe — image captioning

modl vision tag — automatic tagging