-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Add multimodal support (ExLlamaV3) #7174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
- Parses the request body to separate text from base64-encoded images. - Utilizes the `multimodal` utility to generate image embeddings. - Conditionally routes the request to a multimodal generation function when images are detected. - Stores raw image data in the chat history for follow-up context.
Single line command of red img 10x10 px: curl -X POST http://127.0.0.1:5000/v1/chat/completions -H "Content-Type: application/json" -d "{\"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"What color is this image?\"}, {\"type\": \"image_url\", \"image_url\": {\"url\": \"\"}}]}]}"
…holders in prompts
Nice, amazing that you added a new exllamav3 loader as well. I intend to test and merge this soon. |
This reverts commit bfc87b4.
My changes were:
The functionality should be 100% identical to the original; if not, let me know. I have reproduced your tests and everything worked after the changes. Thanks to this PR, adding the llama.cpp multimodal should be easy if ggml-org/llama.cpp#15108 gets merged. |
I have reviewed each sampling parameter here and added some things like sampler priority Maybe something is missing, contributions are welcome |
Nothing critical, but I'm passing forward @turboderp review comments.
|
A detail I have noticed is that sometimes the context of a previous conversation leaks in the next one, and the outputs become garbage. There is probably some small bug in how the generator is implemented in the new exllamav3.py. I couldn't figure out what the issue is. |
The problem isn't looking like generator queue: finally: # Line 358:
self.generator.clear_queue() # Already clear the queue at the end to prevent job retention. So maybe the context leak bug is due to def generate_with_streaming(self, prompt, state):
"""
Generate text with streaming using native ExLlamaV3 API
"""
+ # Optional defensive: clear any pending jobs from previous generation
+ if hasattr(self, 'generator') and self.generator:
+ self.generator.clear_queue()
+
+ # Clear KV cache for new conversations
+ if state.get('history', {}).get('internal', []) == []:
+ if hasattr(self, 'cache') and self.cache:
+ self.cache.clear()
+ logger.debug("Cache cleared for new conversation")
... # Optional: (around line ~410):
if hasattr(self, 'vision_model') and self.vision_model is not None:
try:
+ self.vision_model.unload() # Per turboderp review: add unload before del, help for tensor-P instance.
del self.vision_model # Safely handles, del only affects the current process. |
I don't remember when I made this change, but I think that text-generation-webui/modules/exllamav3.py Line 358 in cb00db1
fixed the context leakage issue. |
Checklist:
Adds multimodal (vision) support for the new ExLlamaV3 loader with image input through both
/v1/chat/completions
and/v1/completions
endpoints. Follows PR #7027 patterns. I know you likeQuick Test
WebUI tested with
mistralai/Mistral-Small-3.2-24B-Instruct-2506
(andgemma-3-4b-it
) multimodal models used in exllamav3/examples/multimodal.py:Notable Files Modified
modules/exllamav3.py
- Added Native ExLlamaV3 loader with vision component loading; MMEmbedding handling preserves token ID consistency:extensions/openai/multimodal.py
- Multimodal utilitiesextensions/openai/completions.py
- Unified image processing for both API endpointsmodules/chat.py
- Image attachment handlingmodules/ui_chat.py
- Image upload supportmodules/html_generator.py
- Image preview rendering, WebUI integrationdocs/12 - OpenAI API.md
- Multimodal examples