primitivesvision-languagegroundingcaptioning

Caption, Tag & Detect Objects

Find objects by name, generate captions, and auto-tag images using Qwen3-VL. Three commands that bridge language and image understanding.

Mar 14, 2026 6 min read

The Upscale, Restore & Score guide covers commands that work on pixels and coordinates — score, detect, segment, upscale, remove-bg. They’re powerful but they don’t understand language. You can’t say “find the coffee cup” — you need to know the coordinates.

These three commands close that gap. They use Qwen3-VL, a vision-language model that understands both images and text:

modl vision ground — find objects by name, get bounding boxes
modl vision describe — generate natural language captions
modl vision tag — automatically label images with tags

All three support --model qwen3-vl-2b (fast, 4GB VRAM, default) or --model qwen3-vl-8b (higher quality, 16GB VRAM).

modl vision ground

What it does: Finds objects by name. Give it a text query and an image — it returns bounding boxes for every matching object.

  $ modl vision ground "cup" cafe.webp --json     
     {"detections":[{"objects":[  
       {"label":"cup","bbox":[530,506,642,597]},  
       {"label":"cup","bbox":[768,514,879,601]}  
     ],"object_count":2}],"total_objects":2}  

Cafe scene with red bounding boxes around both coffee cups found by modl vision ground

modl vision ground found both cups by text query. Unlike SAM which needs pixel coordinates, ground understands what 'cup' means.

The ground → segment → inpaint pipeline

This is the key workflow that ground enables. Let’s walk through replacing the coffee cups with glasses of wine — entirely from the command line.

Step 1 — Find the objects. modl vision ground takes a text query and returns bounding boxes. The query "cup" finds both cups — "coffee cup" only found one:

  $ modl vision ground "cup" cafe.webp --json     
     {"detections":[{"objects":[  
       {"label":"cup","bbox":[530,506,642,597]},  
       {"label":"cup","bbox":[768,514,879,601]}  
     ],"object_count":2}],"total_objects":2}  

Step 2 — Create the mask. Ground returns tight bounding boxes around the cup bodies — but not the saucers underneath. For inpainting, the mask needs to cover everything you want replaced, with padding for the model to blend. We merge both bboxes into one region and expand:

$ modl process segment cafe.webp --method bbox --bbox 530,506,879,601 --expand 50

✓ Mask saved to cafe_mask.png

Step 3 — Inpaint. Pass the original image, the mask, and a prompt. Use --size 1344x768 to match the original dimensions:

  $ modl generate "two glasses of red wine on a clean cafe table, nothing else" \     
       --init-image cafe.webp --mask cafe_mask.png --size 1344x768  
     ↳ Using flux-fill-dev for inpainting  
     ✓ Generated 1 image(s)  

1 Original

2 Ground

Bounding boxes around both coffee cups found by modl vision ground

3 Mask

Expanded feathered mask over the merged cup region

4 Inpainted

Both coffee cups replaced with glasses of red wine

Three commands: ground finds objects by name, segment creates a padded mask from the merged bboxes, generate fills the region.

What we learned building this pipeline

Getting clean inpainting results took several iterations. Here’s what we found:

Ground bboxes are tighter than you’d expect. The bbox wraps the visible cup body — not the saucer, not the shadow. If you expand just the cup bbox and inpaint, the saucer bleeds through underneath because it was never fully masked. For clean replacement, the mask must cover the entire object including its base.

Query wording changes what ground finds. "cup" found both cups. "coffee cup" found only one. Shorter, more generic queries cast a wider net. Start broad and narrow down.

--expand is a balancing act. Too small (30px) and the model doesn’t have room — it just regenerates a similar cup. Too large (60px) and the feathered edge bleeds into the neighboring cup, causing the model to replace both. The Gaussian blur extends well beyond the expand value, so nearby objects can be affected even when they seem safely outside the rectangle.

Merging bboxes works better than single-object masks when objects are close together. Rather than fighting with expand values to isolate one cup, it’s simpler to merge both bounding boxes into one mask region and replace everything at once.

Descriptive prompts fix artifacts. The model fills the entire masked region, not just where the object was. With a wide mask and a vague prompt like "two glasses of wine", it hallucinated stacked plates to fill the empty table area. Adding "on a clean cafe table, nothing else" told the model to leave the rest empty — same mask, clean result. When inpainting goes wrong, try improving the prompt before shrinking the mask.

Flux Fill Dev produces the best results but needs room. It’s a dedicated inpainting model with 384 input channels — run modl pull flux-fill-dev to install it. Modl automatically routes to it when a mask is provided.

An AI agent can run this entire pipeline autonomously — it just needs a text instruction like “replace the coffee cups with wine glasses.”

modl vision describe

What it does: Generates a natural language description of an image. Three detail levels for different use cases.

Outdoor cafe scene — two people sitting at a small round table with coffee cups

The cafe reference image used for all describe examples below.

  $ modl vision describe cafe.png --detail brief     
     cafe.png:  
       A woman and a man are sitting at an outdoor café table, engaged  
       in conversation, with two coffee cups placed between them.  

  $ modl vision describe cafe.png --detail detailed     
     cafe.png:  
       The image depicts an outdoor café scene with two individuals  
       sitting at a small round table. The setting appears to be a city  
       street, as evidenced by the blurred background showing parked  
       cars and buildings.  
      
       The person on the left is a woman with long, dark hair, wearing  
       a dark coat over a light-colored top. She is seated on a metal  
       chair and is looking towards the person on the right. On the  
       table in front of her is a white cup with a brown interior,  
       likely containing a beverage.  
      
       The person on the right is a man with short, curly hair, wearing  
       a dark jacket. He is facing the woman but is turned slightly  
       away from the camera, so his face is not fully visible.  

  $ modl vision describe cafe.png --detail verbose     
     cafe.png:  
       The image depicts an outdoor café scene with two individuals  
       seated at a small round table.  
      
       Objects Present:  
       1. Two Individuals:  
       - A woman on the left side is facing the man across from her.  
       - The man is on the right side, facing away from the camera.  
       - Both are dressed in dark-colored clothing.  
       2. Table:  
       - The table is small and round, typical of outdoor café tables.  
       - Two coffee cups are placed on the table.  
       3. Background:  
       - Shows a street with parked cars and buildings.  
       4. Lighting:  
       - Daytime, likely morning or afternoon, soft shadows.  
       5. Mood:  
       - Calm and relaxed, casual social interaction.  
       6. Color:  
       - Muted browns and grays complementing the urban setting.  

Detail levels:

brief — one sentence, good for dataset captioning
detailed — structured paragraphs with people, objects, setting (default)
verbose — everything visible including colors, spatial relationships, mood

Use cases

Dataset captioning — modl dataset caption --model qwen uses describe under the hood for higher quality captions than Florence-2 or BLIP. Better captions = better LoRAs.

Agent understanding — let an AI agent “see” what was generated and decide if it matches the prompt.

Compare output to intent — describe the generated image and check if it matches what you asked for.

Tip: For dataset captioning, use --detail brief to get concise factual descriptions. For agent workflows, use --detail detailed to give the agent enough context to make decisions.

modl vision tag

What it does: Automatically labels an image with relevant tags — objects, concepts, mood, setting.

  $ modl vision tag cafe.png     
   → Tagging image(s) [qwen3-vl-2b]...  
   ✓ Tagged 1/1 images  
      
     cafe.png: woman, man, coffee, outdoor cafe, street, sidewalk,  
       cars, buildings, conversation, casual attire, relaxed atmosphere  

$ modl vision tag cafe.png --max-tags 5

cafe.png: woman, man, coffee, outdoor cafe, street

Use --max-tags 5 to limit the count. Useful for:

Organizing outputs — auto-tag your generation gallery
Filtering batches — find all images with “person” or “landscape”
Search indexes — build searchable metadata for large output collections

Choosing a model size

All three commands default to qwen3-vl-2b (2B params, ~4GB VRAM). For higher accuracy, use --model qwen3-vl-8b (8B params, ~16GB VRAM):

  # Fast (default): good for most tasks   
  $ modl vision ground "coffee cup" cafe.png     
      
  # Quality: better for complex scenes or fine-grained detection   
  $ modl vision ground "coffee cup" cafe.png --model qwen3-vl-8b     

When to use 8B:

Complex scenes with many small objects
Queries that require fine-grained understanding (“the second cup from the left”)
Detailed captions where accuracy matters (dataset captioning for production LoRAs)

Caption, Tag & Detect Objects

modl vision ground

The ground → segment → inpaint pipeline

What we learned building this pipeline

modl vision describe

Use cases

modl vision tag

Choosing a model size

Quick reference

modl vision ground — find objects by name

modl vision describe — image captioning

modl vision tag — automatic tagging