← All Guides
kleinmulti-referenceeditstyle-refcharacterflux2

Multi-Reference Editing with Klein 9B

Pass up to 4 reference images to Klein 9B and blend people, places, and styles in a single generation. Real experiments, findings, and tips.

Mar 28, 2026 15 min read

Klein 9B can take up to 4 reference images and blend them into a single generation. No training required, no IP-Adapter setup — it’s built into the model. Pass a person, a place, and a style, and get something like this:

Oil painting of a woman on a rooftop terrace at sunset, blending three reference images

Three references in, one image out: person + rooftop location + oil painting style.

This guide documents real experiments — what works, what doesn’t, and when to use references vs LoRA.

How it works

Klein 9B was trained with native multi-reference support. When you pass images via --image, the model conditions on all of them during generation. This isn’t IP-Adapter or ControlNet bolted on after training — it’s part of the architecture.

The image parameter in Flux2KleinPipeline accepts a list of PIL images. The model VAE-encodes each reference and feeds them to the transformer alongside the noise. BFL’s spec says Klein supports up to 4 reference images.

In modl, you use modl edit with multiple --image flags:

$ modl edit "prompt describing what you want" \
--image ref1.png --image ref2.png --image ref3.png \
--base flux2-klein-9b

Each --image adds another reference. The prompt tells the model what to do with them.

Your first multi-reference edit

Start with a single reference — a portrait — and see how adding a second changes the output.

Portrait reference

Reference: portrait

1-ref result

Result: Klein invents the cafe

$ modl edit "a woman sitting in a cozy cafe reading a book" \
--image ref-person-1.webp --base flux2-klein-9b --seed 42

The model kept the woman’s appearance but invented its own cafe. Now add a second reference — the actual cafe you want:

Cafe reference

Added reference: the cafe

2-ref result

Result: cafe now matches the reference

$ modl edit "a woman sitting in a cozy cafe reading a book" \
--image ref-person-1.webp --image ref-cafe.webp \
--base flux2-klein-9b --seed 42

Same plants, same window, same warm light — both the person and the environment are now grounded in real images.

The prompt is the conductor

The most important lesson: your prompt tells the model how to compose the references. Throwing in images without describing how they relate produces mediocre results. Each reference needs a role in the scene you’re describing.

Mistake 1: references without prompt context

If you add a rooftop reference but your prompt just says “a woman in a cafe”, the rooftop gets ignored — the model follows the prompt.

2 refs, prompt says cafe

✔ Person + cafe: prompt matches refs

3 refs, prompt still says cafe

✘ Added rooftop ref, prompt still says “cafe” — ignored

The rooftop reference did nothing because the prompt didn’t mention it. The model has no idea what to do with an unrelated image.

Mistake 2: spatially conflicting references

The prompt needs to mention all references — but there’s a second trap. References need to be spatially coherent. Our indoor cafe has walls and windows. The rooftop is open air. Even with a good prompt, these fight each other:

Indoor cafe + rooftop conflict

✘ Indoor cafe + rooftop: walls vs open air conflict

Outdoor terrace + rooftop harmony

✔ Outdoor terrace + rooftop: coherent, clean blend

The fix: replace the indoor cafe with an outdoor cafe terrace that agrees with the rooftop on being open air.

$ modl edit "a woman sitting at a rooftop cafe terrace at sunset, \
city skyline in the background, string lights, open air" \
--image ref-person-1.webp --image ref-outdoor-cafe.webp \
--image ref-rooftop.webp --base flux2-klein-9b --seed 42

String lights and wooden table from the terrace, skyline and railing from the rooftop, curly-haired woman from the portrait — no conflict, no compromise.

Scaling to 4 references

Add a beach and describe how it fits: a rooftop cafe overlooking the ocean.

$ modl edit "a woman at a rooftop cafe terrace overlooking \
a tropical beach at sunset, palm trees, ocean view, \
string lights, warm golden light, reading a book" \
--image ref-person-1.webp --image ref-outdoor-cafe.webp \
--image ref-rooftop.webp --image ref-beach.webp \
--base flux2-klein-9b --seed 42
Rooftop cafe overlooking tropical beach at sunset

Four coherent references: cafe table and string lights, rooftop railing, palm trees and ocean, woman reading. Every reference is visible.

All four references are clearly present — terrace furniture, railing, palm trees and turquoise water, sunset. The prompt gave each one a spatial role (“rooftop cafe terrace overlooking a tropical beach”) and the references don’t contradict each other.

Another composition: cafe-library

References that share an indoor context work together naturally:

$ modl edit "a woman sitting in a cafe that doubles as a library, \
bookshelves lining the walls, cozy reading nook, \
warm lamp light, coffee on the table" \
--image ref-person-1.webp --image ref-cafe.webp \
--image ref-library.webp --base flux2-klein-9b --seed 42
Woman in a book cafe with floor-to-ceiling shelves

Indoor cafe + library: both references agree on being indoors, so the 'book cafe' blends perfectly.

Two rules for multi-reference:

1. The prompt composes. Describe how references relate: “a rooftop cafe overlooking a beach” gives every reference a spatial role. References the prompt doesn’t mention have little to no effect.

2. References need coherence. Indoor cafe + outdoor rooftop = conflict. Outdoor terrace + rooftop = harmony. Pick references that agree on the basics (indoor/outdoor, lighting, scale) and let the prompt handle the rest.

Does order matter?

We tested three orderings of the same references (person + cafe):

Person first, cafe second

Person → Cafe

Cafe first, person second

Cafe → Person

Library, cafe, person

Library → Cafe → Person

Finding:

Order has minimal impact. All three orderings produced nearly identical results. The model treats references as a set, not a sequence — don’t worry about which image goes first.

Style transfer with references

Pass a style reference alongside a person to transform the visual language while keeping character identity.

Oil painting

References
Oil painting style reference
Result
Person rendered as oil painting

Person reference + oil painting reference → impressionist portrait.

$ modl edit "an oil painting of a woman in a garden, impressionist brushstrokes" \
--image ref-person-1.webp --image ref-style-oil.webp \
--base flux2-klein-9b --seed 42

Anime

References
Anime style reference
Result
Person rendered in anime style

Person reference + anime reference → Ghibli-style illustration with wavy brown hair.

Pixel art

Pixel art is where multi-reference really shines. The style reference locks in the aesthetic, and whatever you combine it with gets the 16-bit treatment.

Woman in pixel art castle scene

Person + pixel ref: castle scene

Woman in pixel art cyberpunk city

Person + pixel ref: cyberpunk rain

Pomeranian RPG warrior in pixel art style

Maxi (Pomeranian reference) + pixel art ref: an RPG warrior with sword and shield. The fur color and face shape carry over.

$ modl edit "a pomeranian warrior in a pixel art fantasy game, \
16-bit retro RPG style, sword and shield, green meadow" \
--image ref-maxi.webp --image ref-style-pixel.webp \
--base flux2-klein-9b --seed 42

The style reference dominates the visual language while the person/subject reference contributes identity cues (hair color, face shape, fur pattern). The prompt reinforces the style to get the strongest transfer.

Tip:

Reinforce the style in your prompt. “An oil painting of…” or “pixel art of…” helps the model commit to the style instead of blending it halfway with photorealism.

Character consistency across scenes

One of the most practical uses: take a single portrait and place that character in completely different environments.

1 Reference
Portrait reference
2 Rooftop
Same woman on a rooftop at sunset
3 Library
Same woman browsing books in a library
4 Beach
Same woman walking on a tropical beach

One reference portrait → three completely different scenes. Hair, face, and build stay consistent.

$ modl edit "a woman standing on a rooftop at sunset, city skyline" \
--image ref-person-1.webp --base flux2-klein-9b --seed 42
 
$ modl edit "a woman browsing books in a cozy library" \
--image ref-person-1.webp --base flux2-klein-9b --seed 42
 
$ modl edit "a woman walking along a tropical beach at sunset" \
--image ref-person-1.webp --base flux2-klein-9b --seed 42

Curly brown hair, similar face structure, even the clothing style carries over across wildly different scenes — all from a single reference photo, no training.

LoRA vs reference: the comparison

We compared three approaches using Maxi (a Pomeranian with a trained Klein 9B LoRA) for the same prompt.

Maxi reference photo

Reference photo

Maxi generated with LoRA only

LoRA only

Maxi generated with reference only

Reference only

Maxi generated with both LoRA and reference

LoRA + reference

# LoRA only
$ modl generate "photo of ohwx pomeranian in a sunny park" \
--lora maxi-klein-9b --base flux2-klein-9b --seed 42
 
# Reference only
$ modl edit "a pomeranian in a sunny park" \
--image ref-maxi.webp --base flux2-klein-9b --seed 42
 
# Both
$ modl edit "photo of ohwx pomeranian in a sunny park" \
--image ref-maxi.webp --lora maxi-klein-9b \
--base flux2-klein-9b --seed 42

All three produced strong Pomeranian results. Here’s a second scene to confirm:

Maxi with santa hat, LoRA only

LoRA only — Christmas scene

Maxi with santa hat, reference only

Reference only — Christmas scene

Finding:

References are surprisingly effective without any training. For many use cases — especially one-off generations or rapid prototyping — a reference photo gets you 80% of the way to a trained LoRA without the time investment.

MethodSetup timeConsistencyBest for
Reference0 (instant)GoodOne-off edits, quick experiments, style transfer
LoRA15-30 min trainingVery goodRepeated use of same subject, precise identity
Both15-30 min trainingExcellentMaximum fidelity when you need it

Mix & match: combining reference types

The real power is combining different types of references. Each image contributes a different dimension.

Each image contributes a different dimension — who, where, and how it looks:

Woman on rooftop

Person + location (2 refs)

Oil painting of woman

Person + style (2 refs)

The triple combo: person + location + style (3 refs)

Oil painting of woman on rooftop at sunset

Person + rooftop + oil painting style. Three references, each contributing a different dimension.

$ modl edit "an oil painting of a woman on a rooftop at sunset" \
--image ref-person-1.webp \
--image ref-rooftop.webp \
--image ref-style-oil.webp \
--base flux2-klein-9b --seed 42

This is the sweet spot: three references, each contributing a distinct dimension (who, where, how it looks). The model blends them cleanly because they don’t conflict.

Multiple characters in one scene

References also work for putting multiple distinct characters in the same frame — no LoRA needed. Pass one reference per character, describe the scene in the prompt.

$ modl edit "a woman with curly brown hair sitting on a kitchen floor with a pomeranian dog and a tiny orange tabby kitten, warm evening light, Pixar 3D animated style" \
--image ref-person.webp --image ref-dog.webp --image ref-kitten.webp \
--base flux2-klein-9b --seed 42 --size 16:9
3 characters ref-only, seed 42

Seed 42 — all 3 characters

3 characters ref-only, seed 77

Seed 77 — all 3 characters

3 characters ref-only, seed 111

Seed 111 — all 3 characters

2 characters ref-only, park scene

2-character scene — park bench

4 out of 4 seeds produced all 3 characters, clearly distinct. The woman keeps her curly brown hair, the Pomeranian stays a Pomeranian, and the kitten remains an orange tabby with blue eyes.

Why this works better than LoRA + prompt:

The illustrated storybook guide documented how a LoRA dominates attention and causes weaker characters to vanish — the kitten disappeared in 4 out of 6 seeds. With ref-only (no LoRA), references don’t compete for attention the same way. Each image contributes equally. The trade-off: ref-only characters are less precisely locked than LoRA characters. For storybooks or projects where you need the exact same face across dozens of generations, a LoRA is still better. For one-off scenes with multiple characters, ref-only is simpler and more reliable.

When references conflict

What happens when you pass two competing style references — an oil painting and an anime illustration?

Oil painting reference

Ref: oil painting

Anime reference

Ref: anime

Blended result

Result: hybrid blend

Klein handles conflicting references gracefully — it finds a middle ground rather than producing chaos. In this case, a painterly illustration with soft colors that borrows from both styles. This can actually be useful when you want a style that doesn’t exist in any single reference.

Tips and best practices

Tip:

Use high-quality references. The model can only work with what you give it. Clean, well-lit reference photos produce better results than noisy or compressed images.

What works well:

  • 2-3 references is the sweet spot — strong influence from each without dilution
  • Different types of references (person + place + style) complement each other
  • Prompt reinforcement — describe the style/scene in the prompt to help the model commit
  • Same seed for comparing different reference combinations

What to watch out for:

  • 5+ references — the model starts ignoring extras past 4
  • Conflicting references — the model won’t fail, but results are unpredictable blends
  • Prompt vs reference conflict — if your prompt says “cafe” but your reference is a beach, the prompt usually wins
  • Color shift — a known Klein issue where edited images have slightly different color temperature than references. Multiple seeds can help find one with less shift.

Klein 9B-KV: faster multi-reference

Tip:

Klein 9B-KV is not yet available in modl — support is in progress (PR #66). The section below explains what it is and why it matters.

BFL recently released Klein 9B-KV (Flux2KleinKVPipeline), a variant that caches reference image KV pairs across denoising steps. The result: up to 2.5x faster inference for multi-reference editing.

The speedup is most noticeable when:

  • Using 2+ reference images
  • Generating multiple variations with the same references
  • Building interactive editing pipelines

Same model quality, same outputs, just faster.

Quick reference

Multi-Reference Editing Commands

# Single reference
modl edit "prompt" --image ref.webp --base flux2-klein-9b

# Multiple references
modl edit "prompt" \
    --image person.webp \
    --image location.webp \
    --image style.webp \
    --base flux2-klein-9b

# With LoRA + reference
modl edit "prompt" \
    --image ref.webp \
    --lora my-lora \
    --base flux2-klein-9b

# Control output size
modl edit "prompt" \
    --image ref.webp \
    --base flux2-klein-9b \
    --size 1024x1024

Limits: Up to 4 references (BFL spec). 5 may work, 6+ will be ignored.

Models: flux2-klein-9b (recommended), flux2-klein-4b (faster, lower quality).

Steps: 4 (default, distilled model).

Why native multi-reference matters

A year ago, putting a specific person in a specific place with a specific style required a stack of adapters — IP-Adapter for the face, ControlNet for the pose, a style LoRA for the aesthetic, all wired together in a node graph. Each adapter added latency, VRAM, and failure modes. Getting two characters in the same frame meant doubling the adapter stack and hoping they didn’t conflict.

Klein changes this. Multi-reference editing is baked into the model architecture, not bolted on after training. The model was trained to accept multiple images as conditioning — it understands what to do with a person photo, a location photo, and a style reference in the same call. No adapters, no nodes, no glue code.

This is the direction the field is moving. Qwen baked in image understanding natively. Klein baked in multi-reference editing natively. The capabilities that used to require plugin ecosystems are becoming part of the model itself. That means less infrastructure to maintain, fewer things to break, and simpler workflows that are easier to automate — which is why an AI agent can orchestrate a full storybook pipeline with just modl edit and modl vision describe, no node graph required.

The best tool for the job a year from now probably won’t be the one with the most plugins. It’ll be the one that picked the right models and got out of the way.

Related guides: