Experimental GPT-OSS support!

@mamei16

Changes

log error when llama-server request exceeds context size (#7263). Thanks, @mamei16.
Make --trust-remote-code immutable from the UI/API for better security.

Bug fixes

Fix metadata leaking into branched chats.
Fix "continue" missing an initial space in chat-instruct/chat modes.
Fix resuming incomplete downloads after HF moved to Xet.
Revert exllamav3_hf changes in v3.14 that made it output gibberish.

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/f9fb33f2630b4b4ba9081ce9c0c921f8cd8ba4eb.
Update exllamav3 0.0.10.

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@Remowylliams

Changes

Better handle multi-GPU setups when using Transformers with bitsandbytes (load-in-8bit and load-in-4bit).
Implement the /v1/internal/logits endpoint for the exllamav3 and exllamav3_hf loaders.
Make profile picture uploading safer.
Add fla to the requirements for Exllamav3 to support qwen3-next models.

Bug fixes

Fix an issue with loading certain chat histories in Instruct mode. Thanks, @Remowylliams.
Fix portable builds for macOS x86 missing llama.cpp binaries (#7238). Thanks, @IonoclastBrigham.

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/d00cbea63c671cd85a57adaa50abf60b3b87d86f.
Update transformers to 4.57.
Update exllamav3 0.0.7.
Update bitsandbytes to 0.48.

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@stevenxdavis

Bug fixes

Don't use $ $ for LaTeX, only $$ $$, to avoid broken rendering of text like apples cost $1, oranges cost $2
Fix exllamav3 ignoring the stop button
Fix a transformers issue when using --bf16 and Flash Attention 2 (#7217). Thanks, @stevenxdavis.
Fix x86_64 macos portable builds containing arm64 files

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/7f766929ca8e8e01dcceb1c526ee584f7e5e1408
Update transformers to 4.56

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Changes

Characters can now think in chat-instruct mode! This was possible thanks to many simplifications and improvements to jinja2 template handling:

Add support for the Seed-OSS-36B-Instruct template.
Better handle the growth of the chat input textarea:

Before	After

Make the --model flag work with absolute paths for gguf models, like --model /tmp/gemma-3-270m-it-IQ4_NL.gguf
Make venv portable installs work with Python 3.13
Optimize LaTeX rendering during streaming for long replies
Give streaming instruct messages more vertical space
Preload the instruct and chat fonts for smoother startup
Improve right sidebar borders in light mode
Remove the --flash-attn flag (it's always on now in llama.cpp)
Suppress "Attempted to select a non-interactive or hidden tab" console warnings, reducing the UI CPU usage during streaming
Statically link MSVC runtime to remove the Visual C++ Redistributable dependency on Windows for the llama.cpp binaries
Make the llama.cpp terminal output with --verbose less verbose

Bug fixes

llama.cpp: Fix stderr deadlock while loading some models
llama.cpp: Fix obtaining the maximum sequence length for GPT-OSS
Fix the UI failing to launch if the Notebook prompt is too long
Fix LaTeX rendering for equations with asterisks
Fix italic and quote colors in headings

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/9961d244f2df6baf40af2f1ddc0927f8d91578c8

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@altoiddealer

Changes

Add the Tensor Parallelism option to the ExLlamav3/ExLlamav3_HF loaders through the --enable-tp and --tp-backend options.
Set multimodal status during Model Loading instead of checking every generation (#7199). Thanks, @altoiddealer.
Improve the multimodal API examples slightly.

Bug fixes

Make web search functional again
mtmd: Fix a bug when "include past attachments" is unchecked
Fix code blocks having an extra empty line in the UI

Backend updates

Update llama.cpp to ggml-org/llama.cpp@6d7f111
Update ExLlamaV3 to 0.0.6
Update flash-attention to 2.8.3

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@65a

See the Multimodal Tutorial

Changes

Add multimodal support to the UI and API
- With the llama.cpp loader (#7027). This was possible thanks to PR ggml-org/llama.cpp#15108 to llama.cpp. Thanks @65a.
- With ExLlamaV3 through a new ExLlamaV3 loader (#7174). Thanks @Katehuuh.
Add speculative decoding to the new ExLlamaV3 loader.
Use ExLlamav3 instead of ExLlamav3_HF by default for EXL3 models, since it supports multimodal and speculative decoding.
Support loading chat templates from chat_template.json files (EXL3/EXL2/Transformers models)
Default max_tokens to 512 in the API instead of 16
Better organize the right sidebar in the UI
llama.cpp: Pass --swa-full to llama-server when streaming-llm is checked to make it work for models with SWA.

Bug fixes

Fix getting the ctx-size for newer EXL3/EXL2/Transformers models
Fix the exllamav2 loader ignoring add_bos_token
Fix the color of italic text in chat messages
Fix edit window and buttons in Messenger theme (#7100). Thanks @mykeehu.

Backend updates

Bump llama.cpp to ggml-org/llama.cpp@f4586ee

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Changes

Several improvements to the GPT-OSS template handling. Special actions like "Continue" and "Impersonate" now work correctly.
Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Experimental GPT-OSS support!

I have obtained some success with the GGUF models under

https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main
https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main

It may be necessary to re-download those models in the next days if bugs are found, so make sure to recheck those pages.

Changes

Add a new Reasoning effort UI element in the chat tab, with low, medium, and high options for GPT-OSS
Support standalone .jinja chat templates -- makes it possible to load GPT-OSS through Transformers
Make web search functional with thinking models

Bug fixes

Fix an edge case in chat history loading that caused a crash (closes #7155)
Handle both int and str types in grammar char processing (fixes a rare crash when using grammar)

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/fd1234cb468935ea087d6929b2487926c3afff4b
Update Transformers to 4.55 (adds GPT-OSS support)

Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Changes

Replace use_flash_attention_2/use_eager_attention with a unified attn_implementation in the Transformers loader
Ignore add_bos_token in instruct prompts, let the jinja2 template decide
Add a "None" option for the speculative decoding model

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/90083283ec254fa8d33897746dea229aee401b37
Update Transformers to 4.53
- Also update bitsandbytes/Accelerate/PEFT to the latest versions
Update ExLlamaV3 to 0.0.5
Update ExLlamaV2 to 0.3.2

Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@Alidr79

Changes

Chat tab improvements:
- Move the 'Enable thinking' checkbox from the Parameters tab to the right sidebar
- Keep the last chat message visible as the input area grows
- Optimize chat scrolling again (I think that will be the last time—it's really responsive now)
- Replace 'Generate' with 'Send' in the main button
Support installing user extensions in user_data/extensions/ for convenience
Small UI optimizations and style improvements
Block model and session backend events in --multi-user mode (#7098). Thanks @Alidr79
One-click installer: Use miniforge instead of miniconda to avoid Anaconda licensing issues for organizations with 200+ people
Standardize margins and paddings across all chat styles (new in 3.7.1)
Update the keyboard shortcuts documentation (new in 3.7.1)
docs: Add Mirostat Explanation (#7128). Thanks @Cats1337. (new in 3.7.1)

Bug fixes

Fix the DuckDuckGo search
Fix scrolling during streaming when thinking blocks are present
Fix chat history getting lost if the UI is inactive for a long time
Fix chat sidebars toggle buttons disappearing (#7106). Thanks @philipp-classen
Fix autoscroll after initial fonts loading
Handle either missing <think> start or </think> end tags (#7102). Thanks @zombiegreedo
Fix custom stopping strings being reset when switching models
Fix navigation icons temporarily hiding when switching message versions (new in 3.7.1)
Revert "Keep the last chat message visible as the input area grows", as it was very glitchy (new in 3.7.1)

Backend updates

Bump llama.cpp to ggml-org/llama.cpp@6491d6e

Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Releases: oobabooga/text-generation-webui

v3.15

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.14

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.13

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.12

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Uh oh!

v3.11

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.10 - Multimodal support!

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.9.1

Changes

Portable builds

Which version to download:

Updating a portable install:

Uh oh!

v3.9

Experimental GPT-OSS support!

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Uh oh!

v3.8

Changes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Uh oh!

v3.7.1

Changes

Bug fixes

Backend updates

Portable builds