[feat] Add TensorRT-Engine Qwen3 (dense) model support #5650

gkswns0531 · 2025-07-01T17:03:12Z

Add Qwen3ForCausalLM mapping in models/init.py
Update config.py to support Qwen3 architecture with qwen3/qwen3_moe types
Add Qwen3 weight conversion logic in convert.py
Implement Qwen3-specific model modifications in model.py
Support attention_bias=False and qk_layernorm=True for Qwen3
Enable FP16 and FP8 quantization for Qwen3

Tested with Qwen3-1.7B model successfully. (Support for the Qwen3 MoE architecture will be added in a future update)

Dear Maintainers,

I would like to kindly submit a pull request that adds support for building the Qwen3 model as a TensorRT engine.
I would be truly grateful if you could take the time to review it at your convenience.

Thank you very much for your consideration.

ref)
vllm vs tensorrt-llm latency (Qwen3-1.7b, L4 GPU, 2048 input_length, 3072 max_length, batch 1)

FP16

trt-llm (tensorrt): TTFT: 25.7ms & TPS: 60.9
vllm: TTFT: 25.0ms & TPS: 59.2

FP8

trt-llm: TTFT: 18.3ms & TPS: 104.9
vllm: TTFT: 20.6ms & TPS: 80.2

Related to #5673

- Add Qwen3ForCausalLM mapping in models/__init__.py - Update config.py to support Qwen3 architecture with qwen3/qwen3_moe types - Add Qwen3 weight conversion logic in convert.py - Implement Qwen3-specific model modifications in model.py - Support attention_bias=False and qk_layernorm=True for Qwen3 - Enable FP16 and FP8 quantization for Qwen3 Tested with Qwen3-1.7B model successfully. World's first TensorRT-LLM Qwen3 implementation. Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>

Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>

gkswns0531 · 2025-07-07T04:41:26Z

Dear @juney-nvidia,

I hope you're doing well. I just wanted to very politely follow up on this PR, as it’s been a few days since it was opened and I thought it might have been missed.
This is just a gentle reminder in case it got buried — I'd greatly appreciate it if a reviewer could be assigned or the workflow approved when you have a chance.
Please feel free to let me know if there’s anything I should adjust. Thank you very much for your time and attention.

PhamGiaMinh · 2025-07-07T11:08:39Z

hello sir , i try your commit and use python3 convert_checkpoint.py to convert but i have a error:

Traceback (most recent call last):
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 591, in
main()
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 583, in main
convert_and_save_hf(args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 524, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 531, in execute
f(args, rank)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py", line 504, in convert_and_save_rank
llama = LLaMAForCausalLM.from_hugging_face(
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 464, in from_hugging_face
config = LLaMAConfig.from_hugging_face(hf_config_or_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/config.py", line 188, in from_hugging_face
moe_config.validate()
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/layers/moe.py", line 125, in validate
raise ValueError(
ValueError: Both or neither MoeConfig's num_experts and top_k must be set to 0

gkswns0531 · 2025-07-07T16:38:03Z

@PhamGiaMinh
Thank you for taking the time to try reproducing the issue.

The error you're encountering happens because the script you're using is located under the 'llama' directory:

"TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py"

This script is intended for LLaMA models. If you're trying to convert a Qwen model, please use the correct script located here:

"TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py"

Here’s the recommended command:

python3 ./TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py \
  --model_dir <path_to_local_huggingface_qwen_model> \
  --output_dir <path_to_save_converted_checkpoint> \
  --dtype float16 \
  --tp_size 1 \
  --pp_size 1 \
  --workers 1

If you run into any other errors or have questions about the engine build commands, feel free to leave a comment anytime.

byshiue · 2025-07-08T01:37:52Z

/bot run

byshiue · 2025-07-08T01:54:37Z

/bot run

PhamGiaMinh · 2025-07-08T02:27:53Z

@gkswns0531 Dear Sir,

I have downloaded the updated script, but when I run the following command:
python3 convert_checkpoint.py --model_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B --output_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B-ckpt
I still get the following error:

2025-07-08 03:14:44.233721: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-07-08 03:14:44.254316: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory...
...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 1.0.0rc0
1.0.0rc0
Traceback (most recent call last):
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 342, in
main()
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 334, in main
convert_and_save_hf(args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 290, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 297, in execute
f(args, rank)
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 278, in convert_and_save_rank
qwen = QWenForCausalLM.from_hugging_face(model_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 308, in from_hugging_face
config = QWenConfig.from_hugging_face(hf_config_or_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/config.py", line 109, in from_hugging_face
assert qwen_type in valid_types, f"Unsupported Qwen type: {qwen_type}, only {valid_types} are acceptable."
AssertionError: Unsupported Qwen type: qwen3_moe, only ('qwen', 'qwen2', 'qwen2_moe', 'qwen2_llava_onevision', 'qwen2_vl', 'qwen2_audio') are acceptable.

It seems like qwen3_moe is not supported in QWenConfig.from_hugging_face().

Could you please advise how to convert the Qwen3-30B-A3B checkpoint using TensorRT-LLM? Is there any planned support for qwen3_moe, or any workaround I could try?

Thank you in advance!

gkswns0531 · 2025-07-08T02:48:19Z

@PhamGiaMinh

If you take a look at my PR, support for qwen3 and qwen3_moe has been added to TensorRT-LLM/tensorrt_llm/models/qwen/config.py. However, it seems that this change is not being reflected, which is likely causing the error. Could you please check if the changes have been properly updated?

PhamGiaMinh · 2025-07-08T04:04:08Z

Dear sir @gkswns0531 ,

Thank you for your recent update. I tried converting the model using your latest commit, but encountered a runtime error.

Here is the command I ran:
python3 convert_checkpoint.py
--model_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B
--output_dir /mnt/giaminh/models_QWEN/Qwen3-30B-A3B-ckpt
--dtype bfloat16

And here is the full traceback:
File "/mnt/giaminh/TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py", line 278, in convert_and_save_rank
qwen = QWenForCausalLM.from_hugging_face(model_dir,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 457, in from_hugging_face
loader.generate_tllm_weights(model, arg_dict)
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 400, in generate_tllm_weights
self.load(tllm_key,
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 311, in load
v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/layers/moe.py", line 650, in postprocess
weights = stack_weights(tllm_key, weights)
File "/home/osint/.local/lib/python3.10/site-packages/tensorrt_llm/layers/moe.py", line 530, in stack_weights
torch.stack(weights[:len(weights) // 2]),
TypeError: expected Tensor as element 0 in argument 0, but got NoneType

I double-checked that the model files were downloaded properly from Hugging Face (Qwen3-30B-A3B), including config and tokenizer files.

Could you please help confirm:

Is Qwen3-30B-A3B officially supported in this branch?

Is there any missing preprocessing step or expected model structure for the MoE experts?

Do I need to manually convert experts separately?

Thanks in advance for your support!

gkswns0531 · 2025-07-08T04:16:14Z

@PhamGiaMinh

Thank you very much for pointing this out.

You’re absolutely right — the MoE architecture required additional handling, and I’ve almost completed the necessary changes.

However, I’d like to check with the reviewer whether these updates should be included in the current PR, or if it would be more appropriate to submit them in a separate one.

Thank you again for your feedback.

byshiue · 2025-07-08T05:33:58Z

@gkswns0531 I am fine to merge this PR first. But we can emphasize that we support dense model in this PR, and plan to support MoE model in another PR.

gkswns0531 · 2025-07-08T06:38:18Z

@byshiue Thank you!
I have updated the PR title and description to indicate that this PR supports only the dense model, and MoE will be supported in a future update.

byshiue · 2025-07-08T09:17:07Z

/bot run

tensorrt-cicd · 2025-07-08T09:22:12Z

PR_Github #11274 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T09:42:22Z

PR_Github #11274 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8341 completed with status: 'FAILURE'

Dramilio · 2025-07-08T11:54:33Z

@gkswns0531 Hi, thanks for your contribution, I confirm this working for Qwen3-14B but got this error for Qwen32B-FP8 when using your convert_checkpoint.py

Command:
python /app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py
--model_dir /engines/hf_model
--dtype float16
--output_dir /engines/model_checkpoint_quant

Result:
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 342, in
main()
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 334, in main
convert_and_save_hf(args)
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 290, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 297, in execute
f(args, rank)
File "/app/tensorrt_llm/examples/models/core/qwen/convert_checkpoint.py", line 278, in convert_and_save_rank
qwen = QWenForCausalLM.from_hugging_face(model_dir,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/qwen/model.py", line 456, in from_hugging_face
loader.generate_tllm_weights(model, arg_dict)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 402, in generate_tllm_weights
self.fill(tllm_weights)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 387, in fill
raise ValueError(
ValueError: Parameter transformer.layers.0.attention.qkv.weight has invalid shape torch.Size([10240, 5120]) compared with expected shape (6400, 5120). Auto padding failed.

gkswns0531 · 2025-07-08T13:50:45Z

@Dramilio Thank you for reporting this issue and providing the detailed error information.

You're right about the QKV weight shape mismatch. After analyzing the error, I can confirm that existing issue with the base qwen model support in TensorRT-LLM.

The problem occurs in config.py line 111, where head_dim is calculated as:

head_dim = hf_config.hidden_size // hf_config.num_attention_heads

However, for Qwen3-32B model, the head_dim is explicitly defined in the config.json as 128, not 80 (5120/64). All Qwen models follow the conventional method for calculating head_dim, but starting from Qwen3-32B, there is a slight difference. This causes the QKV weight size mismatch:

Expected by TensorRT-LLM: (64 + 2 * 8) * 80 = 6400
Actual model weights: (64 + 2 * 8) * 128 = 10240

I believe the proper fix would be to read head_dim directly from the config when available:

head_dim = getattr(hf_config, 'head_dim', hf_config.hidden_size // hf_config.num_attention_heads)

This would make the implementation more robust and compatible with models that have non-standard head dimensions.

Since this affects the base qwen model support, I think it would be best to address this in a separate fix PR after this current PR is merged.

Thanks again for the thorough testing and reporting

byshiue · 2025-07-09T03:20:15Z

@gkswns0531 The CI fails due to the pre-commit

[2025-07-08T09:28:03.465Z] + python3 -u scripts/release_check.py

[2025-07-08T09:28:03.465Z] Running command: pip3 install pre-commit

[2025-07-08T09:28:05.358Z] Running command: pip3 install bandit==1.7.7

[2025-07-08T09:28:08.636Z] Running command: pre-commit install

[2025-07-08T09:28:08.636Z] Running command: pre-commit run -a --show-diff-on-failure

[2025-07-08T09:38:00.337Z] Failing command output:

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pycqa/isort.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/Lucas-C/pre-commit-hooks.git.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/google/yapf.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.

[2025-07-08T09:38:00.337Z] [WARNING] repo `[https://github.com/pre-commit/pre-commit-hooks`](https://github.com/pre-commit/pre-commit-hooks%60) uses deprecated stage names (commit, push) which will be removed in a future version.  Hint: often `pre-commit autoupdate --repo [https://github.com/pre-commit/pre-commit-hooks`](https://github.com/pre-commit/pre-commit-hooks%60) will fix this.  if it does not -- consider reporting an issue to that repo.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/PyCQA/autoflake.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/PyCQA/autoflake:tomli.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pre-commit/mirrors-clang-format.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/cheshirekow/cmake-format-precommit.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/codespell-project/codespell.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/codespell-project/codespell:tomli.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/astral-sh/ruff-pre-commit.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/executablebooks/mdformat.

[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/executablebooks/mdformat:mdformat_frontmatter.

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pycqa/isort.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/Lucas-C/pre-commit-hooks.git.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/google/yapf.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/PyCQA/autoflake.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pre-commit/mirrors-clang-format.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/cheshirekow/cmake-format-precommit.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/codespell-project/codespell.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/astral-sh/ruff-pre-commit.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/executablebooks/mdformat.

[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.

[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...

[2025-07-08T09:38:00.337Z] isort....................................................................Failed

[2025-07-08T09:38:00.337Z] - hook id: isort

[2025-07-08T09:38:00.337Z] - files were modified by this hook

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] Fixing /home/jenkins/agent/workspace/LLM/main/L0_MergeRequest_PR/llm/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] Skipped 139 files

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] CRLF end-lines remover...................................................Passed

[2025-07-08T09:38:00.337Z] yapf.....................................................................Failed

[2025-07-08T09:38:00.337Z] - hook id: yapf

[2025-07-08T09:38:00.337Z] - files were modified by this hook

[2025-07-08T09:38:00.337Z] check for added large files..............................................Passed

[2025-07-08T09:38:00.337Z] check for merge conflicts................................................Passed

[2025-07-08T09:38:00.337Z] check for broken symlinks............................(no files to check)Skipped

[2025-07-08T09:38:00.337Z] detect private key.......................................................Passed

[2025-07-08T09:38:00.337Z] fix end of files.........................................................Passed

[2025-07-08T09:38:00.337Z] check yaml...............................................................Passed

[2025-07-08T09:38:00.337Z] trim trailing whitespace.................................................Passed

[2025-07-08T09:38:00.337Z] check toml...............................................................Passed

[2025-07-08T09:38:00.337Z] mixed line ending........................................................Passed

[2025-07-08T09:38:00.337Z] debug statements (python)................................................Passed

[2025-07-08T09:38:00.337Z] check json...........................................(no files to check)Skipped

[2025-07-08T09:38:00.337Z] autoflake................................................................Passed

[2025-07-08T09:38:00.337Z] clang-format.............................................................Passed

[2025-07-08T09:38:00.337Z] cmake-format.............................................................Passed

[2025-07-08T09:38:00.337Z] codespell................................................................Passed

[2025-07-08T09:38:00.337Z] ruff.....................................................................Passed

[2025-07-08T09:38:00.337Z] ruff-format..............................................................Passed

[2025-07-08T09:38:00.337Z] mdformat.................................................................Passed

[2025-07-08T09:38:00.337Z] pre-commit hook(s) made changes.

[2025-07-08T09:38:00.337Z] If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.

[2025-07-08T09:38:00.337Z] To run `pre-commit` as part of git workflow, use `pre-commit install`.

[2025-07-08T09:38:00.337Z] All changes made by hooks:

[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/config.py b/tensorrt_llm/models/qwen/config.py

[2025-07-08T09:38:00.337Z] index 2e6f0a7..47d1e15 100644

[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/config.py

[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/config.py

[2025-07-08T09:38:00.337Z] @@ -114,7 +114,7 @@ class QWenConfig(PretrainedConfig):

[2025-07-08T09:38:00.337Z]          hidden_act = getattr(hf_config, "hidden_act", "silu")

[2025-07-08T09:38:00.337Z]          if qwen_type == "qwen2_moe":

[2025-07-08T09:38:00.337Z]              hidden_act = "swiglu"

[2025-07-08T09:38:00.337Z] -        

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z]          # Qwen3 models have no attention bias, while legacy models have bias

[2025-07-08T09:38:00.337Z]          if qwen_type in ('qwen3', 'qwen3_moe'):

[2025-07-08T09:38:00.337Z]              attn_bias = False  # Qwen3 models have no attn bias

[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/convert.py b/tensorrt_llm/models/qwen/convert.py

[2025-07-08T09:38:00.337Z] index ccc8fc9..dc2bc35 100644

[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/convert.py

[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/convert.py

[2025-07-08T09:38:00.337Z] @@ -658,20 +658,30 @@ def convert_hf_qwen(hf_model,

[2025-07-08T09:38:00.337Z]          # Qwen3: Add q_norm and k_norm weight conversion

[2025-07-08T09:38:00.337Z]          if qwen_type in ('qwen3', 'qwen3_moe'):

[2025-07-08T09:38:00.337Z]              # Process q_norm.weight

[2025-07-08T09:38:00.337Z] -            q_norm_weight = get_weight(model_params, prefix + key_list[0] + 'q_norm', dtype)

[2025-07-08T09:38:00.337Z] +            q_norm_weight = get_weight(model_params,

[2025-07-08T09:38:00.337Z] +                                       prefix + key_list[0] + 'q_norm', dtype)

[2025-07-08T09:38:00.337Z]              weights.update(

[2025-07-08T09:38:00.337Z] -                get_tllm_linear_weight(q_norm_weight, tllm_prex + 'attention.q_layernorm.',

[2025-07-08T09:38:00.337Z] -                                       None, False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] -                                       plugin_weight_only_quant_type, dtype,

[2025-07-08T09:38:00.337Z] -                                       use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z] -            

[2025-07-08T09:38:00.337Z] -            # Process k_norm.weight  

[2025-07-08T09:38:00.337Z] -            k_norm_weight = get_weight(model_params, prefix + key_list[0] + 'k_norm', dtype)

[2025-07-08T09:38:00.337Z] +                get_tllm_linear_weight(

[2025-07-08T09:38:00.337Z] +                    q_norm_weight,

[2025-07-08T09:38:00.337Z] +                    tllm_prex + 'attention.q_layernorm.',

[2025-07-08T09:38:00.337Z] +                    None,

[2025-07-08T09:38:00.337Z] +                    False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] +                    plugin_weight_only_quant_type,

[2025-07-08T09:38:00.337Z] +                    dtype,

[2025-07-08T09:38:00.337Z] +                    use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z] +            # Process k_norm.weight

[2025-07-08T09:38:00.337Z] +            k_norm_weight = get_weight(model_params,

[2025-07-08T09:38:00.337Z] +                                       prefix + key_list[0] + 'k_norm', dtype)

[2025-07-08T09:38:00.337Z]              weights.update(

[2025-07-08T09:38:00.337Z] -                get_tllm_linear_weight(k_norm_weight, tllm_prex + 'attention.k_layernorm.',

[2025-07-08T09:38:00.337Z] -                                       None, False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] -                                       plugin_weight_only_quant_type, dtype,

[2025-07-08T09:38:00.337Z] -                                       use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z] +                get_tllm_linear_weight(

[2025-07-08T09:38:00.337Z] +                    k_norm_weight,

[2025-07-08T09:38:00.337Z] +                    tllm_prex + 'attention.k_layernorm.',

[2025-07-08T09:38:00.337Z] +                    None,

[2025-07-08T09:38:00.337Z] +                    False,  # LayerNorm should not be quantized

[2025-07-08T09:38:00.337Z] +                    plugin_weight_only_quant_type,

[2025-07-08T09:38:00.337Z] +                    dtype,

[2025-07-08T09:38:00.337Z] +                    use_gemm_woq_plugin))

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]          if qwen_type == "qwen2_moe" and moe_config and moe_config.has_moe():

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/model.py b/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] index 16d748d..0fb003a 100644

[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/model.py

[2025-07-08T09:38:00.337Z] @@ -21,7 +21,7 @@ import torch

[2025-07-08T09:38:00.337Z]  from tqdm import tqdm

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]  from ..._utils import pad_vocab_size

[2025-07-08T09:38:00.337Z] -from ...functional import Tensor, recv, send, LayerNormType

[2025-07-08T09:38:00.337Z] +from ...functional import LayerNormType, Tensor, recv, send

[2025-07-08T09:38:00.337Z]  from ...layers import (MOE, Attention, AttentionMaskType, ColumnLinear,

[2025-07-08T09:38:00.337Z]                         Embedding, GatedMLP, RmsNorm, SharedMoE)

[2025-07-08T09:38:00.337Z]  from ...layers.moe import MOEWeightWrapper

[2025-07-08T09:38:00.337Z] @@ -38,6 +38,7 @@ from .config import QWenConfig

[2025-07-08T09:38:00.337Z]  from .convert import (load_hf_qwen, load_weights_from_hf_gptq_model,

[2025-07-08T09:38:00.337Z]                        load_weights_from_hf_model)

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z]  class QWenDecoderLayer(Module):

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]      def __init__(self, config: QWenConfig, layer_idx: int):

[2025-07-08T09:38:00.337Z] @@ -57,7 +58,7 @@ class QWenDecoderLayer(Module):

[2025-07-08T09:38:00.337Z]          local_layer_idx = layer_idx - layers_range[0]

[2025-07-08T09:38:00.337Z]          # Qwen3: Enable qk_layernorm for Q/K normalization (similar to Gemma3)

[2025-07-08T09:38:00.337Z]          qk_layernorm = config.qwen_type in ('qwen3', 'qwen3_moe')

[2025-07-08T09:38:00.337Z] -        

[2025-07-08T09:38:00.337Z] +

[2025-07-08T09:38:00.337Z]          self.attention = Attention(

[2025-07-08T09:38:00.337Z]              local_layer_idx=local_layer_idx,

[2025-07-08T09:38:00.337Z]              hidden_size=config.hidden_size,

[2025-07-08T09:38:00.337Z] @@ -83,7 +84,8 @@ class QWenDecoderLayer(Module):

[2025-07-08T09:38:00.337Z]              dense_bias=False,

[2025-07-08T09:38:00.337Z]              # Qwen3: Add Q/K layer normalization

[2025-07-08T09:38:00.337Z]              qk_layernorm=qk_layernorm,

[2025-07-08T09:38:00.337Z] -            layernorm_type=LayerNormType.RmsNorm if qk_layernorm else LayerNormType.LayerNorm)

[2025-07-08T09:38:00.337Z] +            layernorm_type=LayerNormType.RmsNorm

[2025-07-08T09:38:00.337Z] +            if qk_layernorm else LayerNormType.LayerNorm)

[2025-07-08T09:38:00.337Z]  

[2025-07-08T09:38:00.337Z]          if config.moe.has_moe():

[2025-07-08T09:38:00.337Z]              mlp_kwargs = {'moe_config': config.moe, 'mapping': config.mapping}

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] 

[2025-07-08T09:38:00.337Z] Error: pre-commit checks failed

[2025-07-08T09:38:00.337Z] Please refer to our coding style guidelines at: https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#coding-style to fix this issue

[2025-07-08T09:38:00.337Z] + git restore .

[2025-07-08T09:38:00.337Z] + false

Can you run the following scripts to fix it?

pre-commit install
pre-commit run -a --show-diff-on-failure

ccys-a11y · 2025-07-09T03:41:29Z

@gkswns0531
Hello Sir, thanks for your contribution. I confirm this working for Qwen3-14B but got this error for blockwise quantized Qwen3-14B.

The quantization script is as follows:
"
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
import torch
import os
model_path = "/root/lf-Qwen3-14B"
output_dir = "/root/lf-Qwen3-14B-fp8-block-wise-fromT-b32"

os.makedirs(output_dir, exist_ok=True)

fp8_config = FineGrainedFP8Config(
activation_scheme="dynamic",
weight_block_size=(32, 32),
modules_to_not_convert=["lm_head"]
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
quantization_config=fp8_config,
device_map=None,
)
model.save_pretrained(output_dir)
"
When running the attached script, the output becomes garbled:
quickstart_advanced_qwen.txt

Could you please help confirm:
Is blockwise quantized Qwen3-14B officially supported in this branch?

Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>

gkswns0531 · 2025-07-09T05:20:26Z

@byshiue
I’ve run pre-commit locally and pushed the updated code accordingly. Please let me know if there’s anything else I should adjust. Thank you!

byshiue · 2025-07-10T00:01:54Z

/bot run

tensorrt-cicd · 2025-07-10T00:07:06Z

PR_Github #11474 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-10T02:25:32Z

PR_Github #11474 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8490 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

byshiue · 2025-07-10T02:26:42Z

Merge this PR. Thank you for the contribution.

gkswns0531 · 2025-07-10T02:35:53Z

@byshiue

Thank you very much for merging the PR!
I’ll follow up with additional PRs to handle models with slightly different structures like qwen3-32b, as well as support for the moe architecture.

Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Signed-off-by: Yuxin <yuxinz@nvidia.com>

Shruti-db · 2025-07-17T19:09:16Z

@gkswns0531 would it be possible for you to share the benchmarking code for both vllm and TRT-LLM? Thanks!

gkswns0531 · 2025-07-18T04:23:01Z

@Shruti-db

Please refer to the repository below.
The code I used to compare Qwen3 is a slightly modified local version of the one in that repository,
but I believe you should be able to run comparisons without significant issues using the code from the repo.
Try it out with the repository first, and if it doesn’t work, feel free to ping me.

https://github.com/gkswns0531/qwen2.5_engine_compare

Shruti-db · 2025-07-18T16:18:28Z

thanks for sharing! @gkswns0531

Shruti-db · 2025-07-21T17:58:35Z

I'm trying to run Qwen-4B but getting this error.
Trying to run it on L40S

ubuntu@ip-172-31-3-85:~/qwen2.5_engine_compare$ python3 build_qwen_engine.py --quantization fp8 --model_name qwen3-4b
Building TensorRT-LLM engine with FP8 quantization
Model: qwen3-4b

Building TensorRT-LLM engine...
Quantization type: FP8
Checkpoint directory: ./qwen3-4b_checkpoints_fp8
Engine directory: ./qwen3-4b_engine_fp8

TRTLLM-BUILD COMMAND:

Command: trtllm-build --checkpoint_dir ./qwen3-4b_checkpoints_fp8
Active Parameters:
--output_dir = ./qwen3-4b_engine_fp8
--max_batch_size = 1
--max_input_len = 2048
--max_seq_len = 3072
--max_beam_width = 1
--max_num_tokens = 8192
--opt_num_tokens = 3072
--kv_cache_type = paged
--tokens_per_block = 128
--gpt_attention_plugin = auto
--gemm_plugin = fp8
--fp8_rowwise_gemm_plugin = auto
--nccl_plugin = auto
--context_fmha = enable
--use_paged_context_fmha = enable
--use_fp8_context_fmha = enable
--norm_quant_fusion = enable
--reduce_fusion = enable
--use_fused_mlp = enable
--user_buffer = enable
--remove_input_padding = enable
--workers = 4
--profiling_verbosity = detailed
--log_level = info

:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-07-21 17:54:55] INFO config.py:54: PyTorch version 2.7.1 available.
2025-07-21 17:54:58,889 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 1.0.0rc3
[07/21/2025-17:54:58] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gemm_plugin to fp8.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set lora_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set dora_plugin to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set moe_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set context_fmha to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set remove_input_padding to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set norm_quant_fusion to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set reduce_fusion to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set user_buffer to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set tokens_per_block to 128.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set multiple_profiles to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set paged_state to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set streamingllm to False.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set use_fused_mlp to True.
[07/21/2025-17:54:58] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.producer = {'name': 'modelopt', 'version': '0.31.0'}
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.share_embedding_table = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.residual_mlp = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.bias = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.rotary_pct = 1.0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.rank = 0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.decoder = qwen
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.rmsnorm = True
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.lm_head_bias = False
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen3
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = True
[07/21/2025-17:54:58] [TRT-LLM] [W] Implicitly setting QWenConfig.model_type = qwen
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 789, in load
param.value = weights[name]
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/parameter.py", line 222, in value
assert v.shape == self.shape,
AssertionError: The value updated is not the same shape as the original. Updated: (6144, 2560), original: (3840, 2560)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/.local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 626, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 419, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 384, in build_and_save
engine = build_model(build_config,
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 361, in build_model
model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 754, in from_checkpoint
model.load(weights, from_pruned=is_checkpoint_pruned)
File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 791, in load
raise RuntimeError(
RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (6144, 2560), original: (3840, 2560)' for parameter 'transformer.layers.0.attention.qkv.weight'
Traceback (most recent call last):
File "/home/ubuntu/qwen2.5_engine_compare/build_qwen_engine.py", line 258, in
main()
File "/home/ubuntu/qwen2.5_engine_compare/build_qwen_engine.py", line 254, in main
success = build_trtllm_engine(quantization=args.quantization)
File "/home/ubuntu/qwen2.5_engine_compare/build_qwen_engine.py", line 229, in build_trtllm_engine
result = subprocess.run(cmd, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['trtllm-build', '--checkpoint_dir', './qwen3-4b_checkpoints_fp8', '--output_dir', './qwen3-4b_engine_fp8', '--max_batch_size', '1', '--max_input_len', '2048', '--max_seq_len', '3072', '--max_beam_width', '1', '--max_num_tokens', '8192', '--opt_num_tokens', '3072', '--kv_cache_type', 'paged', '--tokens_per_block', '128', '--gpt_attention_plugin', 'auto', '--gemm_plugin', 'fp8', '--fp8_rowwise_gemm_plugin', 'auto', '--nccl_plugin', 'auto', '--context_fmha', 'enable', '--use_paged_context_fmha', 'enable', '--use_fp8_context_fmha', 'enable', '--norm_quant_fusion', 'enable', '--reduce_fusion', 'enable', '--use_fused_mlp', 'enable', '--user_buffer', 'enable', '--remove_input_padding', 'enable', '--workers', '4', '--profiling_verbosity', 'detailed', '--log_level', 'info']' returned non-zero exit status 1.

gkswns0531 · 2025-07-22T00:03:07Z

@Shruti-db
This error occurs because the method for calculating head_dim has changed slightly starting from the Qwen3 model. A pull request to fix this is currently under review. Thank you.
PR: #5913

Shruti-db · 2025-07-22T19:31:01Z

Thanks @gkswns0531 patched your fix. Works now!

michaelroyzen · 2025-08-05T03:01:53Z

Thank you for the amazing contribution @gkswns0531. Do you have any updates on MoE support?

Best,

Michael

Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>

juney-nvidia added Community want to contribute PRs initiated from Community Community Engagement help/insights needed from community labels Jul 1, 2025

gkswns0531 changed the title ~~feat: Add Qwen3 model support~~ feat: Add TensorRT-Engine Qwen3 model support Jul 2, 2025

gkswns0531 changed the title ~~feat: Add TensorRT-Engine Qwen3 model support~~ [feat] Add TensorRT-Engine Qwen3 model support Jul 2, 2025

gkswns0531 force-pushed the qwen3-support branch from 62fe0a6 to 4417943 Compare July 2, 2025 05:49

Update model.py

c33908e

Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>

gkswns0531 mentioned this pull request Jul 2, 2025

Support for Qwen3 model in TensorRT-LLM Engine #5673

Closed

Merge branch 'main' into qwen3-support

cbe05ec

SimengLiu-nv requested a review from byshiue July 7, 2025 20:36

byshiue approved these changes Jul 8, 2025

View reviewed changes

Merge branch 'main' into qwen3-support

ebf94da

gkswns0531 changed the title ~~[feat] Add TensorRT-Engine Qwen3 model support~~ [feat] Add TensorRT-Engine Qwen3 (dense) model support Jul 8, 2025

Fix code formatting issues with pre-commit hooks

9bdf8a9

Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>

byshiue merged commit 6490a27 into NVIDIA:main Jul 10, 2025
2 checks passed

gkswns0531 mentioned this pull request Jul 12, 2025

Qwen3MoeForCausalLM Not Supported when exporting to TensorRT-LLM format #5978

Closed

gkswns0531 mentioned this pull request Jul 15, 2025

[fix] improve head_dim calculation in Qwen config #5913

Closed

This was referenced Jul 25, 2025

[TensorRT-LLM] Qwen3 model convert flow errors #6295

Closed

[FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine #6344

Merged

michaelroyzen mentioned this pull request Aug 5, 2025

Qwen3 MoE conversion failing in 1.0.0rc4 and 1.0.0rc5 #6613

Closed

4 tasks

[feat] Add TensorRT-Engine Qwen3 (dense) model support #5650

[feat] Add TensorRT-Engine Qwen3 (dense) model support #5650

Conversation

gkswns0531 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gkswns0531 commented Jul 7, 2025

Uh oh!

PhamGiaMinh commented Jul 7, 2025

Uh oh!

gkswns0531 commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

byshiue commented Jul 8, 2025

Uh oh!

byshiue commented Jul 8, 2025

Uh oh!

PhamGiaMinh commented Jul 8, 2025

Uh oh!

gkswns0531 commented Jul 8, 2025

Uh oh!

PhamGiaMinh commented Jul 8, 2025

Uh oh!

gkswns0531 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

byshiue commented Jul 8, 2025

Uh oh!

gkswns0531 commented Jul 8, 2025

Uh oh!

byshiue commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

Dramilio commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gkswns0531 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

byshiue commented Jul 9, 2025

Uh oh!

ccys-a11y commented Jul 9, 2025

Uh oh!

gkswns0531 commented Jul 9, 2025

Uh oh!

byshiue commented Jul 10, 2025

Uh oh!

tensorrt-cicd commented Jul 10, 2025

Uh oh!

tensorrt-cicd commented Jul 10, 2025

Uh oh!

Uh oh!

byshiue commented Jul 10, 2025

Uh oh!

gkswns0531 commented Jul 10, 2025

Uh oh!

Shruti-db commented Jul 17, 2025

Uh oh!

gkswns0531 commented Jul 18, 2025

Uh oh!

Shruti-db commented Jul 18, 2025

Uh oh!

Shruti-db commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ubuntu@ip-172-31-3-85:~/qwen2.5_engine_compare$ python3 build_qwen_engine.py --quantization fp8 --model_name qwen3-4b Building TensorRT-LLM engine with FP8 quantization Model: qwen3-4b

Building TensorRT-LLM engine... Quantization type: FP8 Checkpoint directory: ./qwen3-4b_checkpoints_fp8 Engine directory: ./qwen3-4b_engine_fp8

TRTLLM-BUILD COMMAND:

Uh oh!

gkswns0531 commented Jul 22, 2025

Uh oh!

Shruti-db commented Jul 22, 2025

Uh oh!

michaelroyzen commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

gkswns0531 commented Jul 1, 2025 •

edited

Loading

gkswns0531 commented Jul 7, 2025 •

edited

Loading

gkswns0531 commented Jul 8, 2025 •

edited

Loading

Dramilio commented Jul 8, 2025 •

edited

Loading

gkswns0531 commented Jul 8, 2025 •

edited

Loading

Shruti-db commented Jul 21, 2025 •

edited

Loading

ubuntu@ip-172-31-3-85:~/qwen2.5_engine_compare$ python3 build_qwen_engine.py --quantization fp8 --model_name qwen3-4b
Building TensorRT-LLM engine with FP8 quantization
Model: qwen3-4b

Building TensorRT-LLM engine...
Quantization type: FP8
Checkpoint directory: ./qwen3-4b_checkpoints_fp8
Engine directory: ./qwen3-4b_engine_fp8