KEMBAR78
[feat] Add TensorRT-Engine Qwen3 (dense) model support by gkswns0531 · Pull Request #5650 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@gkswns0531
Copy link
Contributor

@gkswns0531 gkswns0531 commented Jul 1, 2025

  • Add Qwen3ForCausalLM mapping in models/init.py
  • Update config.py to support Qwen3 architecture with qwen3/qwen3_moe types
  • Add Qwen3 weight conversion logic in convert.py
  • Implement Qwen3-specific model modifications in model.py
  • Support attention_bias=False and qk_layernorm=True for Qwen3
  • Enable FP16 and FP8 quantization for Qwen3

Tested with Qwen3-1.7B model successfully. (Support for the Qwen3 MoE architecture will be added in a future update)

Dear Maintainers,

I would like to kindly submit a pull request that adds support for building the Qwen3 model as a TensorRT engine.
I would be truly grateful if you could take the time to review it at your convenience.

Thank you very much for your consideration.

ref)
vllm vs tensorrt-llm latency (Qwen3-1.7b, L4 GPU, 2048 input_length, 3072 max_length, batch 1)

FP16

  • trt-llm (tensorrt): TTFT: 25.7ms & TPS: 60.9
  • vllm: TTFT: 25.0ms & TPS: 59.2

FP8

  • trt-llm: TTFT: 18.3ms & TPS: 104.9
  • vllm: TTFT: 20.6ms & TPS: 80.2

Related to #5673

@juney-nvidia juney-nvidia added Community want to contribute PRs initiated from Community Community Engagement help/insights needed from community labels Jul 1, 2025
@gkswns0531 gkswns0531 changed the title feat: Add Qwen3 model support feat: Add TensorRT-Engine Qwen3 model support Jul 2, 2025
@gkswns0531 gkswns0531 changed the title feat: Add TensorRT-Engine Qwen3 model support [feat] Add TensorRT-Engine Qwen3 model support Jul 2, 2025
- Add Qwen3ForCausalLM mapping in models/__init__.py
- Update config.py to support Qwen3 architecture with qwen3/qwen3_moe types
- Add Qwen3 weight conversion logic in convert.py
- Implement Qwen3-specific model modifications in model.py
- Support attention_bias=False and qk_layernorm=True for Qwen3
- Enable FP16 and FP8 quantization for Qwen3

Tested with Qwen3-1.7B model successfully.
World's first TensorRT-LLM Qwen3 implementation.

Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
@gkswns0531
Copy link
Contributor Author

Dear @juney-nvidia,

I hope you're doing well. I just wanted to very politely follow up on this PR, as it’s been a few days since it was opened and I thought it might have been missed.
This is just a gentle reminder in case it got buried — I'd greatly appreciate it if a reviewer could be assigned or the workflow approved when you have a chance.
Please feel free to let me know if there’s anything I should adjust. Thank you very much for your time and attention.

@PhamGiaMinh
Copy link

hello sir , i try your commit and use python3 convert_checkpoint.py to convert but i have a error:

Traceback (most recent call last):
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 591, in
main()
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 583, in main
convert_and_save_hf(args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 524, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 531, in execute
f(args, rank)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 504, in convert_and_save_rank
llama = LLaMAForCausalLM.from_hugging_face(
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 464, in from_hugging_face
config = LLaMAConfig.from_hugging_face(hf_config_or_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/config.py", line 188, in from_hugging_face
moe_config.validate()
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/layers/moe.py", line 125, in validate
raise ValueError(
ValueError: Both or neither MoeConfig's num_experts and top_k must be set to 0

@gkswns0531
Copy link
Contributor Author

gkswns0531 commented Jul 7, 2025

@PhamGiaMinh
Thank you for taking the time to try reproducing the issue.

The error you're encountering happens because the script you're using is located under the 'llama' directory:

"TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py"

This script is intended for LLaMA models. If you're trying to convert a Qwen model, please use the correct script located here:

"TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py"

Here’s the recommended command:

python3 ./TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py \
  --model_dir <path_to_local_huggingface_qwen_model> \
  --output_dir <path_to_save_converted_checkpoint> \
  --dtype float16 \
  --tp_size 1 \
  --pp_size 1 \
  --workers 1

If you run into any other errors or have questions about the engine build commands, feel free to leave a comment anytime.

@SimengLiu-nv SimengLiu-nv requested a review from byshiue July 7, 2025 20:36
@byshiue
Copy link
Collaborator

byshiue commented Jul 8, 2025

/bot run

@byshiue
Copy link
Collaborator

byshiue commented Jul 8, 2025

/bot run

@PhamGiaMinh
Copy link

@gkswns0531 Dear Sir,

I have downloaded the updated script, but when I run the following command:
python3 convert_checkpoint.py --model_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B --output_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B-ckpt
I still get the following error:

2025-07-08 03:14:44.233721: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-07-08 03:14:44.254316: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory...
...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 1.0.0rc0
1.0.0rc0
Traceback (most recent call last):
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 342, in
main()
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 334, in main
convert_and_save_hf(args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 290, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 297, in execute
f(args, rank)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 278, in convert_and_save_rank
qwen = QWenForCausalLM.from_hugging_face(model_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 308, in from_hugging_face
config = QWenConfig.from_hugging_face(hf_config_or_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/config.py", line 109, in from_hugging_face
assert qwen_type in valid_types, f"Unsupported Qwen type: {qwen_type}, only {valid_types} are acceptable."
AssertionError: Unsupported Qwen type: qwen3_moe, only ('qwen', 'qwen2', 'qwen2_moe', 'qwen2_llava_onevision', 'qwen2_vl', 'qwen2_audio') are acceptable.

It seems like qwen3_moe is not supported in QWenConfig.from_hugging_face().

Could you please advise how to convert the Qwen3-30B-A3B checkpoint using TensorRT-LLM? Is there any planned support for qwen3_moe, or any workaround I could try?

Thank you in advance!

@gkswns0531
Copy link
Contributor Author

@PhamGiaMinh

If you take a look at my PR, support for qwen3 and qwen3_moe has been added to TensorRT-LLM/tensorrt_llm/models/qwen/config.py. However, it seems that this change is not being reflected, which is likely causing the error. Could you please check if the changes have been properly updated?

@PhamGiaMinh
Copy link

Dear sir @gkswns0531 ,

Thank you for your recent update. I tried converting the model using your latest commit, but encountered a runtime error.

Here is the command I ran:
python3 convert_checkpoint.py
--model_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B
--output_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B-ckpt
--dtype bfloat16

And here is the full traceback:
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 278, in convert_and_save_rank
qwen = QWenForCausalLM.from_hugging_face(model_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 457, in from_hugging_face
loader.generate_tllm_weights(model, arg_dict)
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 400, in generate_tllm_weights
self.load(tllm_key,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 311, in load
v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/layers/moe.py", line 650, in postprocess
weights = stack_weights(tllm_key, weights)
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/layers/moe.py", line 530, in stack_weights
torch.stack(weights[:len(weights) // 2]),
TypeError: expected Tensor as element 0 in argument 0, but got NoneType

I double-checked that the model files were downloaded properly from Hugging Face (Qwen3-30B-A3B), including config and tokenizer files.

Could you please help confirm:

Is Qwen3-30B-A3B officially supported in this branch?

Is there any missing preprocessing step or expected model structure for the MoE experts?

Do I need to manually convert experts separately?

Thanks in advance for your support!

@gkswns0531
Copy link
Contributor Author

gkswns0531 commented Jul 8, 2025

@PhamGiaMinh

Thank you very much for pointing this out.

You’re absolutely right — the MoE architecture required additional handling, and I’ve almost completed the necessary changes.

However, I’d like to check with the reviewer whether these updates should be included in the current PR, or if it would be more appropriate to submit them in a separate one.

Thank you again for your feedback.

@byshiue
Copy link
Collaborator

byshiue commented Jul 8, 2025

@gkswns0531 I am fine to merge this PR first. But we can emphasize that we support dense model in this PR, and plan to support MoE model in another PR.

@gkswns0531 gkswns0531 changed the title [feat] Add TensorRT-Engine Qwen3 model support [feat] Add TensorRT-Engine Qwen3 (dense) model support Jul 8, 2025
@gkswns0531
Copy link
Contributor Author

@byshiue Thank you!
I have updated the PR title and description to indicate that this PR supports only the dense model, and MoE will be supported in a future update.

@byshiue
Copy link
Collaborator

byshiue commented Jul 8, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11274 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11274 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8341 completed with status: 'FAILURE'

@Dramilio
Copy link

Dramilio commented Jul 8, 2025

@gkswns0531 Hi, thanks for your contribution, I confirm this working for Qwen3-14B but got this error for Qwen32B-FP8 when using your convert_checkpoint.py

Command:
python /app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py
--model_dir /engines/hf_model
--dtype float16
--output_dir /engines/model_checkpoint_quant

Result:
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 342, in
main()
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 334, in main
convert_and_save_hf(args)
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 290, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 297, in execute
f(args, rank)
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 278, in convert_and_save_rank
qwen = QWenForCausalLM.from_hugging_face(model_dir,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/qwen/model.py", line 456, in from_hugging_face
loader.generate_tllm_weights(model, arg_dict)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 402, in generate_tllm_weights
self.fill(tllm_weights)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 387, in fill
raise ValueError(
ValueError: Parameter transformer.layers.0.attention.qkv.weight has invalid shape torch.Size([10240, 5120]) compared with expected shape (6400, 5120). Auto padding failed.

@gkswns0531
Copy link
Contributor Author

gkswns0531 commented Jul 8, 2025

@Dramilio Thank you for reporting this issue and providing the detailed error information.

You're right about the QKV weight shape mismatch. After analyzing the error, I can confirm that existing issue with the base qwen model support in TensorRT-LLM.

The problem occurs in config.py line 111, where head_dim is calculated as:

head_dim = hf_config.hidden_size // hf_config.num_attention_heads

However, for Qwen3-32B model, the head_dim is explicitly defined in the config.json as 128, not 80 (5120/64). All Qwen models follow the conventional method for calculating head_dim, but starting from Qwen3-32B, there is a slight difference. This causes the QKV weight size mismatch:

  • Expected by TensorRT-LLM: (64 + 2 * 8) * 80 = 6400
  • Actual model weights: (64 + 2 * 8) * 128 = 10240

I believe the proper fix would be to read head_dim directly from the config when available:

head_dim = getattr(hf_config, 'head_dim', hf_config.hidden_size // hf_config.num_attention_heads)

This would make the implementation more robust and compatible with models that have non-standard head dimensions.

Since this affects the base qwen model support, I think it would be best to address this in a separate fix PR after this current PR is merged.

Thanks again for the thorough testing and reporting

@byshiue
Copy link
Collaborator

byshiue commented Jul 9, 2025

@gkswns0531 The CI fails due to the pre-commit

[2025-07-08T09:28:03.465Z] + python3 -u scripts/release_check.py

[2025-07-08T09:28:03.465Z] Running command: pip3 install pre-commit

[2025-07-08T09:28:05.358Z] Running command: pip3 install bandit==1.7.7

[2025-07-08T09:28:08.636Z] Running command: pre-commit install

[2025-07-08T09:28:08.636Z] Running command: pre-commit run -a --show-diff-on-failure

[2025-07-08T09:38:00.337Z] Failing command output:

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pycqa/isort.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/Lucas-C/pre-commit-hooks.git.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/google/yapf.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.

[2025-07-08T09:38:00.337Z] [WARNING] repo `[https://github.com/pre-commit/pre-commit-hooks`](https://github.com/pre-commit/pre-commit-hooks%60) uses deprecated stage names (commit, push) which will be removed in a future version.  Hint: often `pre-commit autoupdate --repo [https://github.com/pre-commit/pre-commit-hooks`](https://github.com/pre-commit/pre-commit-hooks%60) will fix this.  if it does not -- consider reporting an issue to that repo.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/PyCQA/autoflake.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/PyCQA/autoflake:tomli.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pre-commit/mirrors-clang-format.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/cheshirekow/cmake-format-precommit.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/codespell-project/codespell.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/codespell-project/codespell:tomli.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/astral-sh/ruff-pre-commit.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/executablebooks/mdformat.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/executablebooks/mdformat:mdformat_frontmatter.

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pycqa/isort.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/Lucas-C/pre-commit-hooks.git.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/google/yapf.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/PyCQA/autoflake.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pre-commit/mirrors-clang-format.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/cheshirekow/cmake-format-precommit.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/codespell-project/codespell.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/astral-sh/ruff-pre-commit.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/executablebooks/mdformat.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] isort....................................................................Failed

[2025-07-08T09:38:00.337Z] - hook id: isort

[2025-07-08T09:38:00.337Z] - files were modified by this hook

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] Fixing /home/jenkins/agent/workspace/LLM/main/L0_MergeRequest_PR/llm/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] Skipped 139 files

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] CRLF end-lines remover...................................................Passed

[2025-07-08T09:38:00.337Z] yapf.....................................................................Failed

[2025-07-08T09:38:00.337Z] - hook id: yapf

[2025-07-08T09:38:00.337Z] - files were modified by this hook

[2025-07-08T09:38:00.337Z] check for added large files..............................................Passed

[2025-07-08T09:38:00.337Z] check for merge conflicts................................................Passed

[2025-07-08T09:38:00.337Z] check for broken symlinks............................(no files to check)Skipped

[2025-07-08T09:38:00.337Z] detect private key.......................................................Passed

[2025-07-08T09:38:00.337Z] fix end of files.........................................................Passed

[2025-07-08T09:38:00.337Z] check yaml...............................................................Passed

[2025-07-08T09:38:00.337Z] trim trailing whitespace.................................................Passed

[2025-07-08T09:38:00.337Z] check toml...............................................................Passed

[2025-07-08T09:38:00.337Z] mixed line ending........................................................Passed

[2025-07-08T09:38:00.337Z] debug statements (python)................................................Passed

[2025-07-08T09:38:00.337Z] check json...........................................(no files to check)Skipped

[2025-07-08T09:38:00.337Z] autoflake................................................................Passed

[2025-07-08T09:38:00.337Z] clang-format.............................................................Passed

[2025-07-08T09:38:00.337Z] cmake-format.............................................................Passed

[2025-07-08T09:38:00.337Z] codespell................................................................Passed

[2025-07-08T09:38:00.337Z] ruff.....................................................................Passed

[2025-07-08T09:38:00.337Z] ruff-format..............................................................Passed

[2025-07-08T09:38:00.337Z] mdformat.................................................................Passed

[2025-07-08T09:38:00.337Z] pre-commit hook(s) made changes.

[2025-07-08T09:38:00.337Z] If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.

[2025-07-08T09:38:00.337Z] To run `pre-commit` as part of git workflow, use `pre-commit install`.

[2025-07-08T09:38:00.337Z] All changes made by hooks:

[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/config.py b/tensorrt_llm/models/qwen/config.py

[2025-07-08T09:38:00.337Z] index 2e6f0a7..47d1e15 100644

[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/config.py

[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/config.py

[2025-07-08T09:38:00.337Z] @@ -114,7 +114,7 @@ class QWenConfig(PretrainedConfig):

[2025-07-08T09:38:00.337Z]          hidden_act = getattr(hf_config, "hidden_act", "silu")

[2025-07-08T09:38:00.337Z]          if qwen_type == "qwen2_moe":

[2025-07-08T09:38:00.337Z]              hidden_act = "swiglu"

[2025-07-08T09:38:00.337Z] -        

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z]          # Qwen3 models have no attention bias, while legacy models have bias

[2025-07-08T09:38:00.337Z]          if qwen_type in ('qwen3', 'qwen3_moe'):

[2025-07-08T09:38:00.337Z]              attn_bias = False  # Qwen3 models have no attn bias

[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/convert.py b/tensorrt_llm/models/qwen/convert.py

[2025-07-08T09:38:00.337Z] index ccc8fc9..dc2bc35 100644

[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/convert.py

[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/convert.py

[2025-07-08T09:38:00.337Z] @@ -658,20 +658,30 @@ def convert_hf_qwen(hf_model,

[2025-07-08T09:38:00.337Z]          # Qwen3: Add q_norm and k_norm weight conversion

[2025-07-08T09:38:00.337Z]          if qwen_type in ('qwen3', 'qwen3_moe'):

[2025-07-08T09:38:00.337Z]              # Process q_norm.weight

[2025-07-08T09:38:00.337Z] -            q_norm_weight = get_weight(model_params, prefix + key_list[0] + 'q_norm', dtype)

[2025-07-08T09:38:00.337Z] +            q_norm_weight = get_weight(model_params,

[2025-07-08T09:38:00.337Z] +                                       prefix + key_list[0] + 'q_norm', dtype)

[2025-07-08T09:38:00.337Z]              weights.update(

[2025-07-08T09:38:00.337Z] -                get_tllm_linear_weight(q_norm_weight, tllm_prex + 'attention.q_layernorm.',

[2025-07-08T09:38:00.337Z] -                                       None, False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] -                                       plugin_weight_only_quant_type, dtype,

[2025-07-08T09:38:00.337Z] -                                       use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z] -            

[2025-07-08T09:38:00.337Z] -            # Process k_norm.weight  

[2025-07-08T09:38:00.337Z] -            k_norm_weight = get_weight(model_params, prefix + key_list[0] + 'k_norm', dtype)

[2025-07-08T09:38:00.337Z] +                get_tllm_linear_weight(

[2025-07-08T09:38:00.337Z] +                    q_norm_weight,

[2025-07-08T09:38:00.337Z] +                    tllm_prex + 'attention.q_layernorm.',

[2025-07-08T09:38:00.337Z] +                    None,

[2025-07-08T09:38:00.337Z] +                    False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] +                    plugin_weight_only_quant_type,

[2025-07-08T09:38:00.337Z] +                    dtype,

[2025-07-08T09:38:00.337Z] +                    use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z] +            # Process k_norm.weight

[2025-07-08T09:38:00.337Z] +            k_norm_weight = get_weight(model_params,

[2025-07-08T09:38:00.337Z] +                                       prefix + key_list[0] + 'k_norm', dtype)

[2025-07-08T09:38:00.337Z]              weights.update(

[2025-07-08T09:38:00.337Z] -                get_tllm_linear_weight(k_norm_weight, tllm_prex + 'attention.k_layernorm.',

[2025-07-08T09:38:00.337Z] -                                       None, False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] -                                       plugin_weight_only_quant_type, dtype,

[2025-07-08T09:38:00.337Z] -                                       use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z] +                get_tllm_linear_weight(

[2025-07-08T09:38:00.337Z] +                    k_norm_weight,

[2025-07-08T09:38:00.337Z] +                    tllm_prex + 'attention.k_layernorm.',

[2025-07-08T09:38:00.337Z] +                    None,

[2025-07-08T09:38:00.337Z] +                    False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] +                    plugin_weight_only_quant_type,

[2025-07-08T09:38:00.337Z] +                    dtype,

[2025-07-08T09:38:00.337Z] +                    use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]          if qwen_type == "qwen2_moe" and moe_config and moe_config.has_moe():

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/model.py b/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] index 16d748d..0fb003a 100644

[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] @@ -21,7 +21,7 @@ import torch

[2025-07-08T09:38:00.337Z]  from tqdm import tqdm

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]  from ..._utils import pad_vocab_size

[2025-07-08T09:38:00.337Z] -from ...functional import Tensor, recv, send, LayerNormType

[2025-07-08T09:38:00.337Z] +from ...functional import LayerNormType, Tensor, recv, send

[2025-07-08T09:38:00.337Z]  from ...layers import (MOE, Attention, AttentionMaskType, ColumnLinear,

[2025-07-08T09:38:00.337Z]                         Embedding, GatedMLP, RmsNorm, SharedMoE)

[2025-07-08T09:38:00.337Z]  from ...layers.moe import MOEWeightWrapper

[2025-07-08T09:38:00.337Z] @@ -38,6 +38,7 @@ from .config import QWenConfig

[2025-07-08T09:38:00.337Z]  from .convert import (load_hf_qwen, load_weights_from_hf_gptq_model,

[2025-07-08T09:38:00.337Z]                        load_weights_from_hf_model)

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z]  class QWenDecoderLayer(Module):

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]      def __init__(self, config: QWenConfig, layer_idx: int):

[2025-07-08T09:38:00.337Z] @@ -57,7 +58,7 @@ class QWenDecoderLayer(Module):

[2025-07-08T09:38:00.337Z]          local_layer_idx = layer_idx - layers_range[0]

[2025-07-08T09:38:00.337Z]          # Qwen3: Enable qk_layernorm for Q/K normalization (similar to Gemma3)

[2025-07-08T09:38:00.337Z]          qk_layernorm = config.qwen_type in ('qwen3', 'qwen3_moe')

[2025-07-08T09:38:00.337Z] -        

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z]          self.attention = Attention(

[2025-07-08T09:38:00.337Z]              local_layer_idx=local_layer_idx,

[2025-07-08T09:38:00.337Z]              hidden_size=config.hidden_size,

[2025-07-08T09:38:00.337Z] @@ -83,7 +84,8 @@ class QWenDecoderLayer(Module):

[2025-07-08T09:38:00.337Z]              dense_bias=False,

[2025-07-08T09:38:00.337Z]              # Qwen3: Add Q/K layer normalization

[2025-07-08T09:38:00.337Z]              qk_layernorm=qk_layernorm,

[2025-07-08T09:38:00.337Z] -            layernorm_type=LayerNormType.RmsNorm if qk_layernorm else LayerNormType.LayerNorm)

[2025-07-08T09:38:00.337Z] +            layernorm_type=LayerNormType.RmsNorm

[2025-07-08T09:38:00.337Z] +            if qk_layernorm else LayerNormType.LayerNorm)

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]          if config.moe.has_moe():

[2025-07-08T09:38:00.337Z]              mlp_kwargs = {'moe_config': config.moe, 'mapping': config.mapping}

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] Error: pre-commit checks failed

[2025-07-08T09:38:00.337Z] Please refer to our coding style guidelines at: https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#coding-style to fix this issue

[2025-07-08T09:38:00.337Z] + git restore .

[2025-07-08T09:38:00.337Z] + false

Can you run the following scripts to fix it?

pre-commit install
pre-commit run -a --show-diff-on-failure

@ccys-a11y
Copy link

@gkswns0531
Hello Sir, thanks for your contribution. I confirm this working for Qwen3-14B but got this error for blockwise quantized Qwen3-14B.

The quantization script is as follows:
"
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
import torch
import os
model_path = "/root/lf-Qwen3-14B"
output_dir = "/root/lf-Qwen3-14B-fp8-block-wise-fromT-b32"

os.makedirs(output_dir, exist_ok=True)

fp8_config = FineGrainedFP8Config(
activation_scheme="dynamic",
weight_block_size=(32, 32),
modules_to_not_convert=["lm_head"]
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
quantization_config=fp8_config,
device_map=None,
)
model.save_pretrained(output_dir)
"
When running the attached script, the output becomes garbled:
quickstart_advanced_qwen.txt
ERROR

Could you please help confirm:
Is blockwise quantized Qwen3-14B officially supported in this branch?

Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
@gkswns0531
Copy link
Contributor Author

@byshiue
I’ve run pre-commit locally and pushed the updated code accordingly. Please let me know if there’s anything else I should adjust. Thank you!

@byshiue
Copy link
Collaborator

byshiue commented Jul 10, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11474 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11474 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8490 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@byshiue byshiue merged commit 6490a27 into NVIDIA:main Jul 10, 2025
2 checks passed
@byshiue
Copy link
Collaborator

byshiue commented Jul 10, 2025

Merge this PR. Thank you for the contribution.

@gkswns0531
Copy link
Contributor Author

@byshiue

Thank you very much for merging the PR!
I’ll follow up with additional PRs to handle models with slightly different structures like qwen3-32b, as well as support for the moe architecture.

zhou-yuxin pushed a commit to zhou-yuxin/TensorRT-LLM that referenced this pull request Jul 15, 2025
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Yuxin <yuxinz@nvidia.com>
@Shruti-db
Copy link

@gkswns0531 would it be possible for you to share the benchmarking code for both vllm and TRT-LLM? Thanks!

@gkswns0531
Copy link
Contributor Author

@Shruti-db

Please refer to the repository below.
The code I used to compare Qwen3 is a slightly modified local version of the one in that repository,
but I believe you should be able to run comparisons without significant issues using the code from the repo.
Try it out with the repository first, and if it doesn’t work, feel free to ping me.

https://github.com/gkswns0531/qwen2.5_engine_compare

@Shruti-db
Copy link

thanks for sharing! @gkswns0531

@Shruti-db
Copy link

Shruti-db commented Jul 21, 2025

I'm trying to run Qwen-4B but getting this error.
Trying to run it on L40S

ubuntu@ip-172-31-3-85:~/qwen2.5_engine_compare$ python3 build_qwen_engine.py --quantization fp8 --model_name qwen3-4b
Building TensorRT-LLM engine with FP8 quantization
Model: qwen3-4b

Building TensorRT-LLM engine...
Quantization type: FP8
Checkpoint directory: ./qwen3-4b_checkpoints_fp8
Engine directory: ./qwen3-4b_engine_fp8

TRTLLM-BUILD COMMAND:

Command: trtllm-build --checkpoint_dir ./qwen3-4b_checkpoints_fp8
Active Parameters:
--output_dir = ./qwen3-4b_engine_fp8
--max_batch_size = 1
--max_input_len = 2048
--max_seq_len = 3072
--max_beam_width = 1
--max_num_tokens = 8192
--opt_num_tokens = 3072
--kv_cache_type = paged
--tokens_per_block = 128
--gpt_attention_plugin = auto
--gemm_plugin = fp8
--fp8_rowwise_gemm_plugin = auto
--nccl_plugin = auto
--context_fmha = enable
--use_paged_context_fmha = enable
--use_fp8_context_fmha = enable
--norm_quant_fusion = enable
--reduce_fusion = enable
--use_fused_mlp = enable
--user_buffer = enable
--remove_input_padding = enable
--workers = 4
--profiling_verbosity = detailed
--log_level = info

:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-07-21 17:54:55] INFO config.py:54: PyTorch version 2.7.1 available.
2025-07-21 17:54:58,889 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 1.0.0rc3
[07/21/2025-17:54:58] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gemm_plugin to fp8.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set lora_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set dora_plugin to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set moe_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set context_fmha to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set remove_input_padding to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set norm_quant_fusion to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set reduce_fusion to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set user_buffer to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set tokens_per_block to 128.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set multiple_profiles to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set paged_state to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set streamingllm to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set use_fused_mlp to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.producer = {'name': 'modelopt', 'version': '0.31.0'}
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.share_embedding_table = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.residual_mlp = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.bias = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.rotary_pct = 1.0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.rank = 0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.decoder = qwen
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.rmsnorm = True
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.lm_head_bias = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen3
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = True
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.model_type = qwen
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 789, in load
param.value = weights[name]
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/parameter.py", line 222, in value
assert v.shape == self.shape,
AssertionError: The value updated is not the same shape as the original. Updated: (6144, 2560), original: (3840, 2560)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/.local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 626, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 419, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 384, in build_and_save
engine = build_model(build_config,
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 361, in build_model
model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 754, in from_checkpoint
model.load(weights, from_pruned=is_checkpoint_pruned)
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 791, in load
raise RuntimeError(
RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (6144, 2560), original: (3840, 2560)' for parameter 'transformer.layers.0.attention.qkv.weight'
Traceback (most recent call last):
File "/home/ubuntu/qwen2.5_engine_compare/build_qwen_engine.py", line 258, in
main()
File "/home/ubuntu/qwen2.5_engine_compare/build_qwen_engine.py", line 254, in main
success = build_trtllm_engine(quantization=args.quantization)
File "/home/ubuntu/qwen2.5_engine_compare/build_qwen_engine.py", line 229, in build_trtllm_engine
result = subprocess.run(cmd, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['trtllm-build', '--checkpoint_dir', './qwen3-4b_checkpoints_fp8', '--output_dir', './qwen3-4b_engine_fp8', '--max_batch_size', '1', '--max_input_len', '2048', '--max_seq_len', '3072', '--max_beam_width', '1', '--max_num_tokens', '8192', '--opt_num_tokens', '3072', '--kv_cache_type', 'paged', '--tokens_per_block', '128', '--gpt_attention_plugin', 'auto', '--gemm_plugin', 'fp8', '--fp8_rowwise_gemm_plugin', 'auto', '--nccl_plugin', 'auto', '--context_fmha', 'enable', '--use_paged_context_fmha', 'enable', '--use_fp8_context_fmha', 'enable', '--norm_quant_fusion', 'enable', '--reduce_fusion', 'enable', '--use_fused_mlp', 'enable', '--user_buffer', 'enable', '--remove_input_padding', 'enable', '--workers', '4', '--profiling_verbosity', 'detailed', '--log_level', 'info']' returned non-zero exit status 1.

@gkswns0531
Copy link
Contributor Author

@Shruti-db
This error occurs because the method for calculating head_dim has changed slightly starting from the Qwen3 model. A pull request to fix this is currently under review. Thank you.
PR: #5913

@Shruti-db
Copy link

Thanks @gkswns0531 patched your fix. Works now!

@michaelroyzen
Copy link

Thank you for the amazing contribution @gkswns0531. Do you have any updates on MoE support?

Best,

Michael

solrex pushed a commit to solrex/TensorRT-LLM that referenced this pull request Sep 10, 2025
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Engagement help/insights needed from community Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants