KEMBAR78
Add multimodal support (ExLlamaV3) by Katehuuh · Pull Request #7174 · oobabooga/text-generation-webui · GitHub
Skip to content

Conversation

Katehuuh
Copy link
Contributor

@Katehuuh Katehuuh commented Aug 5, 2025

Checklist:

Adds multimodal (vision) support for the new ExLlamaV3 loader with image input through both /v1/chat/completions and /v1/completions endpoints. Follows PR #7027 patterns. I know you like

  • Integrate with the OpenAI-compatible API
  • Integrate with the UI and chat history
  • Add Native ExLlamaV3 multimodal
Quick Test

WebUI tested with mistralai/Mistral-Small-3.2-24B-Instruct-2506 (and gemma-3-4b-it) multimodal models used in exllamav3/examples/multimodal.py:

import requests
import base64

image_url = "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"
image_data = requests.get(image_url).content
image_b64 = base64.b64encode(image_data).decode('utf-8')

endpoints = [
    ("http://127.0.0.1:5000/v1/completions", "text"),
    ("http://127.0.0.1:5000/v1/chat/completions", "message")
]

payload = {
    "messages": [{
        "role": "user", 
        "content": [
            {"type": "text", "text": "What animal is this?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
        ]
    }],
    "max_tokens": 50,
    "temperature": 0.7
}

for url, key in endpoints:
    print(f"\n--- {url.split('/')[-1]} ---")
    try:
        response = requests.post(url, json=payload, timeout=30)
        if response.status_code == 200:
            choice = response.json()['choices'][0]
            if key == 'message':
                content = choice.get('message', {}).get('content', '')
            else:
                content = choice.get('text', '')
            print(f"[OK] {content}")
        else:
            print(f"[FAIL] {response.status_code}")
    except Exception as e:
        print(f"[ERROR] {e}")

Notable Files Modified

  • modules/exllamav3.py - Added Native ExLlamaV3 loader with vision component loading; MMEmbedding handling preserves token ID consistency:
    vision_model = Model.from_config(config, component="vision")
    image_embeddings = vision_model.get_image_embeddings(tokenizer=tokenizer, image=pil_image)
  • extensions/openai/multimodal.py - Multimodal utilities
  • extensions/openai/completions.py - Unified image processing for both API endpoints
  • modules/chat.py - Image attachment handling
  • modules/ui_chat.py - Image upload support
  • modules/html_generator.py - Image preview rendering, WebUI integration
  • docs/12 - OpenAI API.md - Multimodal examples

oobabooga and others added 26 commits June 19, 2025 19:42
- Parses the request body to separate text from base64-encoded images.
- Utilizes the `multimodal` utility to generate image embeddings.
- Conditionally routes the request to a multimodal generation function when images are detected.
- Stores raw image data in the chat history for follow-up context.
Single line command of red img 10x10 px:
curl -X POST http://127.0.0.1:5000/v1/chat/completions -H "Content-Type: application/json" -d "{\"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"What color is this image?\"}, {\"type\": \"image_url\", \"image_url\": {\"url\": \"\"}}]}]}"
@oobabooga
Copy link
Owner

Nice, amazing that you added a new exllamav3 loader as well. I intend to test and merge this soon.

@oobabooga oobabooga changed the base branch from main to dev August 8, 2025 20:34
@oobabooga
Copy link
Owner

My changes were:

  • Refactor the PR to move the ExLlamav3-related code to modules/exllamav3.py, to keep things more modular.
  • Fix setting --loader exllamav3 through the command-line

The functionality should be 100% identical to the original; if not, let me know. I have reproduced your tests and everything worked after the changes.

Thanks to this PR, adding the llama.cpp multimodal should be easy if ggml-org/llama.cpp#15108 gets merged.

@oobabooga oobabooga merged commit 88127f4 into oobabooga:dev Aug 9, 2025
@oobabooga
Copy link
Owner

I have reviewed each sampling parameter here and added some things like sampler priority

544c3a7

Maybe something is missing, contributions are welcome

@Katehuuh
Copy link
Contributor Author

Katehuuh commented Aug 9, 2025

Nothing critical, but I'm passing forward @turboderp review comments.

Line 29:
Resetting the allocator shouldn't ever be needed. It just produces serial numbers, and the only requirement is that they're unique from image to image, so that a MM embedding can be represented either by a text string passed to the tokenizer or by a unique set of token IDs.
Line 83:
If you need to check if ExLlamaV3 supports vision for a given model, you can test for "vision" in config.model_classes.
Line 121:
Without going into what the state dictionary manages, I assume it just holds cached MMEmbedding objects (?).. which is fine if that's the case. They can persist indefinitely and only use system memory, but do note that if they're also kept indefinitely you'll run out of rams eventually. Ideally there should be some sort of cache that keeps the most recently used 50 images or whatever.

Crucially, there is no mechanism in ExLlama for identifying when you might be encoding the same image twice. So if you do this

prompt1 = vision_model.get_image_embeddings(tokenizer, image).text_alias + very_long_context_string
input_ids1 = tokenizer.encode(prompt1)

prompt2 = vision_model.get_image_embeddings(tokenizer, image).text_alias + very_long_context_string
input_ids2 = tokenizer.encode(prompt1)

Then input_ids1 and input_ids2 will not be identical, and since the two tokenized contexts diverge at the beginning you're not reusing any of the cache for very_long_context_string. So some caching is needed between chat rounds, at least, if you want to be able to insert images arbitrarily and not take a big performance hit.

Also note that if it suits the design you can keep a session-global store of image embeddings and add all of them to any given job. The generator will only care about the ones that are actually referenced either by their placeholder strings or by the mapped token IDs.
Line 178:
Using ComboSampler is fine here. It optimizes away the sampling steps that have no-op values. Though, I would never recommend a top-P value of 1.0. This gives you an always non-zero probability of sampling a completely wrong token. Stick to 0.95 at most, if you're doing sampling at all.
Line 238:
This looks like it's left over from the V2 implementation. V3's generate function takes a Sampler as an argument rather than a list of sampler settings.
Line 256:
Not sure where encode and decode are called from, but if you need consistency with a chat context then they should also forward the relevant image embeddings.
Line 272:
You should also call vision_model.unload() here. At the moment it doesn't make a difference, but I can't guarantee there won't ever be any unmanaged resources to clean up (like child processes and shared memory buffers in TP models.)

@oobabooga
Copy link
Owner

A detail I have noticed is that sometimes the context of a previous conversation leaks in the next one, and the outputs become garbage. There is probably some small bug in how the generator is implemented in the new exllamav3.py. I couldn't figure out what the issue is.

@Katehuuh
Copy link
Contributor Author

∙ ”context of previous conversation leaks”

The problem isn't looking like generator queue:

finally: # Line 358:
    self.generator.clear_queue() # Already clear the queue at the end to prevent job retention.

So maybe the context leak bug is due to self.cache (KV cache) retaining tokens from the previous conversation.
To fix, we could apply a minimal patch in modules/exllamav3.py:

def generate_with_streaming(self, prompt, state):
    """
    Generate text with streaming using native ExLlamaV3 API
    """
+   # Optional defensive: clear any pending jobs from previous generation
+   if hasattr(self, 'generator') and self.generator:
+       self.generator.clear_queue()
+
+   # Clear KV cache for new conversations
+   if state.get('history', {}).get('internal', []) == []:
+       if hasattr(self, 'cache') and self.cache:
+           self.cache.clear()
+           logger.debug("Cache cleared for new conversation")

... # Optional: (around line ~410):
    if hasattr(self, 'vision_model') and self.vision_model is not None:
        try:
+           self.vision_model.unload() # Per turboderp review: add unload before del, help for tensor-P instance.
            del self.vision_model # Safely handles, del only affects the current process. 

@oobabooga
Copy link
Owner

I don't remember when I made this change, but I think that

self.generator.clear_queue()

fixed the context leakage issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants