Releases: oobabooga/text-generation-webui
v3.15
Changes
- log error when llama-server request exceeds context size (#7263). Thanks, @mamei16.
- Make --trust-remote-code immutable from the UI/API for better security.
Bug fixes
- Fix metadata leaking into branched chats.
- Fix "continue" missing an initial space in chat-instruct/chat modes.
- Fix resuming incomplete downloads after HF moved to Xet.
- Revert exllamav3_hf changes in v3.14 that made it output gibberish.
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/f9fb33f2630b4b4ba9081ce9c0c921f8cd8ba4eb.
- Update exllamav3 0.0.10.
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.14
Changes
- Better handle multi-GPU setups when using Transformers with bitsandbytes (
load-in-8bit
andload-in-4bit
). - Implement the
/v1/internal/logits
endpoint for theexllamav3
andexllamav3_hf
loaders. - Make profile picture uploading safer.
- Add
fla
to the requirements for Exllamav3 to supportqwen3-next
models.
Bug fixes
- Fix an issue with loading certain chat histories in Instruct mode. Thanks, @Remowylliams.
- Fix portable builds for macOS x86 missing llama.cpp binaries (#7238). Thanks, @IonoclastBrigham.
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/d00cbea63c671cd85a57adaa50abf60b3b87d86f.
- Update transformers to 4.57.
- Update exllamav3 0.0.7.
- Update bitsandbytes to 0.48.
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.13
Bug fixes
- Don't use
$ $
for LaTeX, only$$ $$
, to avoid broken rendering of text likeapples cost $1, oranges cost $2
- Fix exllamav3 ignoring the stop button
- Fix a transformers issue when using --bf16 and Flash Attention 2 (#7217). Thanks, @stevenxdavis.
- Fix x86_64 macos portable builds containing arm64 files
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/7f766929ca8e8e01dcceb1c526ee584f7e5e1408
- Update transformers to 4.56
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.12
Changes
- Characters can now think in
chat-instruct
mode! This was possible thanks to many simplifications and improvements to jinja2 template handling:

- Add support for the Seed-OSS-36B-Instruct template.
- Better handle the growth of the chat input textarea:
Before | After |
---|---|
![]() |
![]() |
- Make the
--model
flag work with absolute paths for gguf models, like--model /tmp/gemma-3-270m-it-IQ4_NL.gguf
- Make venv portable installs work with Python 3.13
- Optimize LaTeX rendering during streaming for long replies
- Give streaming instruct messages more vertical space
- Preload the instruct and chat fonts for smoother startup
- Improve right sidebar borders in light mode
- Remove the
--flash-attn
flag (it's always on now in llama.cpp) - Suppress "Attempted to select a non-interactive or hidden tab" console warnings, reducing the UI CPU usage during streaming
- Statically link MSVC runtime to remove the Visual C++ Redistributable dependency on Windows for the llama.cpp binaries
- Make the llama.cpp terminal output with
--verbose
less verbose
Bug fixes
- llama.cpp: Fix stderr deadlock while loading some models
- llama.cpp: Fix obtaining the maximum sequence length for GPT-OSS
- Fix the UI failing to launch if the Notebook prompt is too long
- Fix LaTeX rendering for equations with asterisks
- Fix italic and quote colors in headings
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/9961d244f2df6baf40af2f1ddc0927f8d91578c8
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.11
Changes
- Add the Tensor Parallelism option to the ExLlamav3/ExLlamav3_HF loaders through the
--enable-tp
and--tp-backend
options. - Set multimodal status during Model Loading instead of checking every generation (#7199). Thanks, @altoiddealer.
- Improve the multimodal API examples slightly.
Bug fixes
- Make web search functional again
- mtmd: Fix a bug when "include past attachments" is unchecked
- Fix code blocks having an extra empty line in the UI
Backend updates
- Update llama.cpp to ggml-org/llama.cpp@6d7f111
- Update ExLlamaV3 to 0.0.6
- Update flash-attention to 2.8.3
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.10 - Multimodal support!
See the Multimodal Tutorial

Changes
- Add multimodal support to the UI and API
- With the llama.cpp loader (#7027). This was possible thanks to PR ggml-org/llama.cpp#15108 to llama.cpp. Thanks @65a.
- With ExLlamaV3 through a new ExLlamaV3 loader (#7174). Thanks @Katehuuh.
- Add speculative decoding to the new ExLlamaV3 loader.
- Use ExLlamav3 instead of ExLlamav3_HF by default for EXL3 models, since it supports multimodal and speculative decoding.
- Support loading chat templates from
chat_template.json
files (EXL3/EXL2/Transformers models) - Default max_tokens to 512 in the API instead of 16
- Better organize the right sidebar in the UI
- llama.cpp: Pass
--swa-full
to llama-server whenstreaming-llm
is checked to make it work for models with SWA.
Bug fixes
- Fix getting the ctx-size for newer EXL3/EXL2/Transformers models
- Fix the exllamav2 loader ignoring add_bos_token
- Fix the color of italic text in chat messages
- Fix edit window and buttons in Messenger theme (#7100). Thanks @mykeehu.
Backend updates
- Bump llama.cpp to ggml-org/llama.cpp@f4586ee
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.9.1
Changes
- Several improvements to the GPT-OSS template handling. Special actions like "Continue" and "Impersonate" now work correctly.
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.9
Experimental GPT-OSS support!
I have obtained some success with the GGUF models under
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main
https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main
It may be necessary to re-download those models in the next days if bugs are found, so make sure to recheck those pages.
Changes
- Add a new Reasoning effort UI element in the chat tab, with
low
,medium
, andhigh
options for GPT-OSS - Support standalone .jinja chat templates -- makes it possible to load GPT-OSS through Transformers
- Make web search functional with thinking models
Bug fixes
- Fix an edge case in chat history loading that caused a crash (closes #7155)
- Handle both int and str types in grammar char processing (fixes a rare crash when using grammar)
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/fd1234cb468935ea087d6929b2487926c3afff4b
- Update Transformers to 4.55 (adds GPT-OSS support)
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.8
Changes
- Replace
use_flash_attention_2
/use_eager_attention
with a unifiedattn_implementation
in the Transformers loader - Ignore
add_bos_token
in instruct prompts, let the jinja2 template decide - Add a "None" option for the speculative decoding model
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/90083283ec254fa8d33897746dea229aee401b37
- Update Transformers to 4.53
- Also update bitsandbytes/Accelerate/PEFT to the latest versions
- Update ExLlamaV3 to 0.0.5
- Update ExLlamaV2 to 0.3.2
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.7.1
Changes
- Chat tab improvements:
- Move the 'Enable thinking' checkbox from the Parameters tab to the right sidebar
- Keep the last chat message visible as the input area grows
- Optimize chat scrolling again (I think that will be the last time—it's really responsive now)
- Replace 'Generate' with 'Send' in the main button
- Support installing user extensions in
user_data/extensions/
for convenience - Small UI optimizations and style improvements
- Block model and session backend events in
--multi-user
mode (#7098). Thanks @Alidr79 - One-click installer: Use miniforge instead of miniconda to avoid Anaconda licensing issues for organizations with 200+ people
- Standardize margins and paddings across all chat styles (new in 3.7.1)
- Update the keyboard shortcuts documentation (new in 3.7.1)
- docs: Add Mirostat Explanation (#7128). Thanks @Cats1337. (new in 3.7.1)
Bug fixes
- Fix the DuckDuckGo search
- Fix scrolling during streaming when thinking blocks are present
- Fix chat history getting lost if the UI is inactive for a long time
- Fix chat sidebars toggle buttons disappearing (#7106). Thanks @philipp-classen
- Fix autoscroll after initial fonts loading
- Handle either missing
<think>
start or</think>
end tags (#7102). Thanks @zombiegreedo - Fix custom stopping strings being reset when switching models
- Fix navigation icons temporarily hiding when switching message versions (new in 3.7.1)
- Revert "Keep the last chat message visible as the input area grows", as it was very glitchy (new in 3.7.1)
Backend updates
- Bump llama.cpp to ggml-org/llama.cpp@6491d6e
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.