Add multimodal support (ExLlamaV3) #7174

Katehuuh · 2025-08-05T05:13:22Z

Checklist:

I have read the Contributing guidelines.

Adds multimodal (vision) support for the new ExLlamaV3 loader with image input through both /v1/chat/completions and /v1/completions endpoints. Follows PR #7027 patterns. I know you like

Integrate with the OpenAI-compatible API
Integrate with the UI and chat history
Add Native ExLlamaV3 multimodal

Quick Test

WebUI tested with mistralai/Mistral-Small-3.2-24B-Instruct-2506 (and gemma-3-4b-it) multimodal models used in exllamav3/examples/multimodal.py:

import requests
import base64

image_url = "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"
image_data = requests.get(image_url).content
image_b64 = base64.b64encode(image_data).decode('utf-8')

endpoints = [
    ("http://127.0.0.1:5000/v1/completions", "text"),
    ("http://127.0.0.1:5000/v1/chat/completions", "message")
]

payload = {
    "messages": [{
        "role": "user", 
        "content": [
            {"type": "text", "text": "What animal is this?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
        ]
    }],
    "max_tokens": 50,
    "temperature": 0.7
}

for url, key in endpoints:
    print(f"\n--- {url.split('/')[-1]} ---")
    try:
        response = requests.post(url, json=payload, timeout=30)
        if response.status_code == 200:
            choice = response.json()['choices'][0]
            if key == 'message':
                content = choice.get('message', {}).get('content', '')
            else:
                content = choice.get('text', '')
            print(f"[OK] {content}")
        else:
            print(f"[FAIL] {response.status_code}")
    except Exception as e:
        print(f"[ERROR] {e}")

Notable Files Modified

modules/exllamav3.py - Added Native ExLlamaV3 loader with vision component loading; MMEmbedding handling preserves token ID consistency:

    vision_model = Model.from_config(config, component="vision")
    image_embeddings = vision_model.get_image_embeddings(tokenizer=tokenizer, image=pil_image)

extensions/openai/multimodal.py - Multimodal utilities
extensions/openai/completions.py - Unified image processing for both API endpoints
modules/chat.py - Image attachment handling
modules/ui_chat.py - Image upload support
modules/html_generator.py - Image preview rendering, WebUI integration
docs/12 - OpenAI API.md - Multimodal examples

Merge dev branch

- Parses the request body to separate text from base64-encoded images. - Utilizes the `multimodal` utility to generate image embeddings. - Conditionally routes the request to a multimodal generation function when images are detected. - Stores raw image data in the chat history for follow-up context.

Single line command of red img 10x10 px: curl -X POST http://127.0.0.1:5000/v1/chat/completions -H "Content-Type: application/json" -d "{\"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"What color is this image?\"}, {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAIAAAACUFjqAAAAEklEQVR4nGP8z4APMOGVHbHSAEEsAROxCnMTAAAAAElFTkSuQmCC\"}}]}]}"

…holders in prompts

oobabooga · 2025-08-06T18:48:29Z

Nice, amazing that you added a new exllamav3 loader as well. I intend to test and merge this soon.

This reverts commit bfc87b4.

oobabooga · 2025-08-09T02:30:45Z

My changes were:

Refactor the PR to move the ExLlamav3-related code to modules/exllamav3.py, to keep things more modular.
Fix setting --loader exllamav3 through the command-line

The functionality should be 100% identical to the original; if not, let me know. I have reproduced your tests and everything worked after the changes.

Thanks to this PR, adding the llama.cpp multimodal should be easy if ggml-org/llama.cpp#15108 gets merged.

oobabooga · 2025-08-09T04:17:21Z

I have reviewed each sampling parameter here and added some things like sampler priority

544c3a7

Maybe something is missing, contributions are welcome

Katehuuh · 2025-08-09T04:21:40Z

Nothing critical, but I'm passing forward @turboderp review comments.

Line 29:
Resetting the allocator shouldn't ever be needed. It just produces serial numbers, and the only requirement is that they're unique from image to image, so that a MM embedding can be represented either by a text string passed to the tokenizer or by a unique set of token IDs.
Line 83:
If you need to check if ExLlamaV3 supports vision for a given model, you can test for "vision" in config.model_classes.
Line 121:
Without going into what the state dictionary manages, I assume it just holds cached MMEmbedding objects (?).. which is fine if that's the case. They can persist indefinitely and only use system memory, but do note that if they're also kept indefinitely you'll run out of rams eventually. Ideally there should be some sort of cache that keeps the most recently used 50 images or whatever.

Crucially, there is no mechanism in ExLlama for identifying when you might be encoding the same image twice. So if you do this
prompt1 = vision_model.get_image_embeddings(tokenizer, image).text_alias + very_long_context_string
input_ids1 = tokenizer.encode(prompt1)

prompt2 = vision_model.get_image_embeddings(tokenizer, image).text_alias + very_long_context_string
input_ids2 = tokenizer.encode(prompt1)
Then input_ids1 and input_ids2 will not be identical, and since the two tokenized contexts diverge at the beginning you're not reusing any of the cache for very_long_context_string. So some caching is needed between chat rounds, at least, if you want to be able to insert images arbitrarily and not take a big performance hit.

Also note that if it suits the design you can keep a session-global store of image embeddings and add all of them to any given job. The generator will only care about the ones that are actually referenced either by their placeholder strings or by the mapped token IDs.
Line 178:
Using ComboSampler is fine here. It optimizes away the sampling steps that have no-op values. Though, I would never recommend a top-P value of 1.0. This gives you an always non-zero probability of sampling a completely wrong token. Stick to 0.95 at most, if you're doing sampling at all.
Line 238:
This looks like it's left over from the V2 implementation. V3's generate function takes a Sampler as an argument rather than a list of sampler settings.
Line 256:
Not sure where encode and decode are called from, but if you need consistency with a chat context then they should also forward the relevant image embeddings.
Line 272:
You should also call vision_model.unload() here. At the moment it doesn't make a difference, but I can't guarantee there won't ever be any unmanaged resources to clean up (like child processes and shared memory buffers in TP models.)

oobabooga · 2025-08-09T15:01:53Z

A detail I have noticed is that sometimes the context of a previous conversation leaks in the next one, and the outputs become garbage. There is probably some small bug in how the generator is implemented in the new exllamav3.py. I couldn't figure out what the issue is.

Katehuuh · 2025-08-21T00:18:15Z

∙ ”context of previous conversation leaks”

The problem isn't looking like generator queue:

finally: # Line 358:
    self.generator.clear_queue() # Already clear the queue at the end to prevent job retention.

So maybe the context leak bug is due to self.cache (KV cache) retaining tokens from the previous conversation.
To fix, we could apply a minimal patch in modules/exllamav3.py:

def generate_with_streaming(self, prompt, state):
    """
    Generate text with streaming using native ExLlamaV3 API
    """
+   # Optional defensive: clear any pending jobs from previous generation
+   if hasattr(self, 'generator') and self.generator:
+       self.generator.clear_queue()
+
+   # Clear KV cache for new conversations
+   if state.get('history', {}).get('internal', []) == []:
+       if hasattr(self, 'cache') and self.cache:
+           self.cache.clear()
+           logger.debug("Cache cleared for new conversation")

... # Optional: (around line ~410):
    if hasattr(self, 'vision_model') and self.vision_model is not None:
        try:
+           self.vision_model.unload() # Per turboderp review: add unload before del, help for tensor-P instance.
            del self.vision_model # Safely handles, del only affects the current process.

oobabooga · 2025-08-21T02:58:20Z

I don't remember when I made this change, but I think that

text-generation-webui/modules/exllamav3.py

Line 358 in cb00db1

self.generator.clear_queue()

fixed the context leakage issue.

oobabooga and others added 26 commits June 19, 2025 19:42

Merge pull request #7092 from oobabooga/dev

17f9c18

Merge dev branch

Merge pull request #7124 from oobabooga/dev

e1034fc

Merge dev branch

Merge pull request #7125 from oobabooga/dev

b7d5982

Merge dev branch

Merge pull request #7129 from oobabooga/dev

6338dc0

Merge dev branch

Merge pull request #7141 from oobabooga/dev

714f745

Merge dev branch

Add docs Multimodal-ExLlamaV3

0ee735d

Create exllamav3.py multimodal

55ca39b

Add ExLlamav3 loader

d933707

Add .image-preview

d8c26dc

Improve attachment and image handling logic

7fc79ac

feat: Add Exllamav3 loader parameters

3218fe8

feat(ui): Enable image uploads in chat input

90d13a0

feat(ui): Display html image previews

22786a6

feat(text_generation): Integrate Exllamav3 into generation pipeline

81ec39d

Add utility module for ExLlamaV3

b136bd6

feat(openai): Add image_data field to request

46d771a

feat(openai): Add image_data field to request

230d86c

Fix: proper MMEmbedding lifetime

f884a7e

docs: curl URL multimodal

e267dee

fix: reset MM allocator for model switch and handle <__media__> place…

542aaec

…holders in prompts

Fix: Img with prompt optional; support messages in CompletionRequest

7d09745

feat: add external image URL support and unified image processing

79c251a

feat: support unified image and messages format in completions

e5813ec

Fix: add missing newline

11f184a

Merge branch 'dev' into Katehuuh-main

6f38d8f

oobabooga changed the base branch from main to dev August 8, 2025 20:34

Refactor

e09e93e

oobabooga added 9 commits August 8, 2025 17:10

Recognize --loader exllamav3

d945c69

Add back a necessary block of code

12e38bf

Fix getting the ctx-size for EXL3/EXL2/Transformers models

bfc87b4

Make /completions functional again

0f5714e

Revert "Fix getting the ctx-size for EXL3/EXL2/Transformers models"

903699d

This reverts commit bfc87b4.

Merge branch 'dev' into Katehuuh-main

a338f95

Lint

5110bdd

Small change

79c70de

Lint

1710183

oobabooga merged commit 88127f4 into oobabooga:dev Aug 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multimodal support (ExLlamaV3) #7174

Add multimodal support (ExLlamaV3) #7174

Uh oh!

Katehuuh commented Aug 5, 2025 •

edited

Loading

Uh oh!

oobabooga commented Aug 6, 2025

Uh oh!

oobabooga commented Aug 9, 2025

Uh oh!

oobabooga commented Aug 9, 2025

Uh oh!

Katehuuh commented Aug 9, 2025

Uh oh!

oobabooga commented Aug 9, 2025

Uh oh!

Katehuuh commented Aug 21, 2025

Uh oh!

oobabooga commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add multimodal support (ExLlamaV3) #7174

Add multimodal support (ExLlamaV3) #7174

Uh oh!

Conversation

Katehuuh commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist:

Notable Files Modified

Uh oh!

oobabooga commented Aug 6, 2025

Uh oh!

oobabooga commented Aug 9, 2025

Uh oh!

oobabooga commented Aug 9, 2025

Uh oh!

Katehuuh commented Aug 9, 2025

Uh oh!

oobabooga commented Aug 9, 2025

Uh oh!

Katehuuh commented Aug 21, 2025

Uh oh!

oobabooga commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Katehuuh commented Aug 5, 2025 •

edited

Loading