-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[feat] Add TensorRT-Engine Qwen3 (dense) model support #5650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add Qwen3ForCausalLM mapping in models/__init__.py - Update config.py to support Qwen3 architecture with qwen3/qwen3_moe types - Add Qwen3 weight conversion logic in convert.py - Implement Qwen3-specific model modifications in model.py - Support attention_bias=False and qk_layernorm=True for Qwen3 - Enable FP16 and FP8 quantization for Qwen3 Tested with Qwen3-1.7B model successfully. World's first TensorRT-LLM Qwen3 implementation. Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
|
Dear @juney-nvidia, I hope you're doing well. I just wanted to very politely follow up on this PR, as it’s been a few days since it was opened and I thought it might have been missed. |
|
hello sir , i try your commit and use python3 convert_checkpoint.py to convert but i have a error: Traceback (most recent call last): |
|
@PhamGiaMinh The error you're encountering happens because the script you're using is located under the 'llama' directory: "TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py" This script is intended for LLaMA models. If you're trying to convert a Qwen model, please use the correct script located here: "TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py" Here’s the recommended command: python3 ./TensorRT-LLM/examples/models/core/qwen/convert_checkpoint.py \
--model_dir <path_to_local_huggingface_qwen_model> \
--output_dir <path_to_save_converted_checkpoint> \
--dtype float16 \
--tp_size 1 \
--pp_size 1 \
--workers 1If you run into any other errors or have questions about the engine build commands, feel free to leave a comment anytime. |
|
/bot run |
|
/bot run |
|
@gkswns0531 Dear Sir, I have downloaded the updated script, but when I run the following command: 2025-07-08 03:14:44.233721: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered It seems like qwen3_moe is not supported in QWenConfig.from_hugging_face(). Could you please advise how to convert the Qwen3-30B-A3B checkpoint using TensorRT-LLM? Is there any planned support for qwen3_moe, or any workaround I could try? Thank you in advance! |
|
If you take a look at my PR, support for qwen3 and qwen3_moe has been added to TensorRT-LLM/tensorrt_llm/models/qwen/config.py. However, it seems that this change is not being reflected, which is likely causing the error. Could you please check if the changes have been properly updated? |
|
Dear sir @gkswns0531 , Thank you for your recent update. I tried converting the model using your latest commit, but encountered a runtime error. Here is the command I ran: And here is the full traceback: I double-checked that the model files were downloaded properly from Hugging Face (Qwen3-30B-A3B), including config and tokenizer files. Could you please help confirm: Is Qwen3-30B-A3B officially supported in this branch? Is there any missing preprocessing step or expected model structure for the MoE experts? Do I need to manually convert experts separately? Thanks in advance for your support! |
|
Thank you very much for pointing this out. You’re absolutely right — the MoE architecture required additional handling, and I’ve almost completed the necessary changes. However, I’d like to check with the reviewer whether these updates should be included in the current PR, or if it would be more appropriate to submit them in a separate one. Thank you again for your feedback. |
|
@gkswns0531 I am fine to merge this PR first. But we can emphasize that we support dense model in this PR, and plan to support MoE model in another PR. |
|
@byshiue Thank you! |
|
/bot run |
|
PR_Github #11274 [ run ] triggered by Bot |
|
PR_Github #11274 [ run ] completed with state |
|
@gkswns0531 Hi, thanks for your contribution, I confirm this working for Qwen3-14B but got this error for Qwen32B-FP8 when using your convert_checkpoint.py Command: Result: Traceback (most recent call last): |
|
@Dramilio Thank you for reporting this issue and providing the detailed error information. You're right about the QKV weight shape mismatch. After analyzing the error, I can confirm that existing issue with the base qwen model support in TensorRT-LLM. The problem occurs in head_dim = hf_config.hidden_size // hf_config.num_attention_headsHowever, for Qwen3-32B model, the
I believe the proper fix would be to read head_dim = getattr(hf_config, 'head_dim', hf_config.hidden_size // hf_config.num_attention_heads)This would make the implementation more robust and compatible with models that have non-standard head dimensions. Since this affects the base qwen model support, I think it would be best to address this in a separate fix PR after this current PR is merged. Thanks again for the thorough testing and reporting |
|
@gkswns0531 The CI fails due to the pre-commit [2025-07-08T09:28:03.465Z] + python3 -u scripts/release_check.py
[2025-07-08T09:28:03.465Z] Running command: pip3 install pre-commit
[2025-07-08T09:28:05.358Z] Running command: pip3 install bandit==1.7.7
[2025-07-08T09:28:08.636Z] Running command: pre-commit install
[2025-07-08T09:28:08.636Z] Running command: pre-commit run -a --show-diff-on-failure
[2025-07-08T09:38:00.337Z] Failing command output:
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pycqa/isort.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/Lucas-C/pre-commit-hooks.git.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/google/yapf.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
[2025-07-08T09:38:00.337Z] [WARNING] repo `[https://github.com/pre-commit/pre-commit-hooks`](https://github.com/pre-commit/pre-commit-hooks%60) uses deprecated stage names (commit, push) which will be removed in a future version. Hint: often `pre-commit autoupdate --repo [https://github.com/pre-commit/pre-commit-hooks`](https://github.com/pre-commit/pre-commit-hooks%60) will fix this. if it does not -- consider reporting an issue to that repo.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/PyCQA/autoflake.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/PyCQA/autoflake:tomli.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/pre-commit/mirrors-clang-format.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/cheshirekow/cmake-format-precommit.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/codespell-project/codespell.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/codespell-project/codespell:tomli.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/astral-sh/ruff-pre-commit.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/executablebooks/mdformat.
[2025-07-08T09:38:00.337Z] [INFO] Initializing environment for https://github.com/executablebooks/mdformat:mdformat_frontmatter.
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pycqa/isort.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/Lucas-C/pre-commit-hooks.git.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/google/yapf.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/PyCQA/autoflake.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/pre-commit/mirrors-clang-format.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/cheshirekow/cmake-format-precommit.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/codespell-project/codespell.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/astral-sh/ruff-pre-commit.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] [INFO] Installing environment for https://github.com/executablebooks/mdformat.
[2025-07-08T09:38:00.337Z] [INFO] Once installed this environment will be reused.
[2025-07-08T09:38:00.337Z] [INFO] This may take a few minutes...
[2025-07-08T09:38:00.337Z] isort....................................................................Failed
[2025-07-08T09:38:00.337Z] - hook id: isort
[2025-07-08T09:38:00.337Z] - files were modified by this hook
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] Fixing /home/jenkins/agent/workspace/LLM/main/L0_MergeRequest_PR/llm/tensorrt_llm/models/qwen/model.py
[2025-07-08T09:38:00.337Z] Skipped 139 files
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] CRLF end-lines remover...................................................Passed
[2025-07-08T09:38:00.337Z] yapf.....................................................................Failed
[2025-07-08T09:38:00.337Z] - hook id: yapf
[2025-07-08T09:38:00.337Z] - files were modified by this hook
[2025-07-08T09:38:00.337Z] check for added large files..............................................Passed
[2025-07-08T09:38:00.337Z] check for merge conflicts................................................Passed
[2025-07-08T09:38:00.337Z] check for broken symlinks............................(no files to check)Skipped
[2025-07-08T09:38:00.337Z] detect private key.......................................................Passed
[2025-07-08T09:38:00.337Z] fix end of files.........................................................Passed
[2025-07-08T09:38:00.337Z] check yaml...............................................................Passed
[2025-07-08T09:38:00.337Z] trim trailing whitespace.................................................Passed
[2025-07-08T09:38:00.337Z] check toml...............................................................Passed
[2025-07-08T09:38:00.337Z] mixed line ending........................................................Passed
[2025-07-08T09:38:00.337Z] debug statements (python)................................................Passed
[2025-07-08T09:38:00.337Z] check json...........................................(no files to check)Skipped
[2025-07-08T09:38:00.337Z] autoflake................................................................Passed
[2025-07-08T09:38:00.337Z] clang-format.............................................................Passed
[2025-07-08T09:38:00.337Z] cmake-format.............................................................Passed
[2025-07-08T09:38:00.337Z] codespell................................................................Passed
[2025-07-08T09:38:00.337Z] ruff.....................................................................Passed
[2025-07-08T09:38:00.337Z] ruff-format..............................................................Passed
[2025-07-08T09:38:00.337Z] mdformat.................................................................Passed
[2025-07-08T09:38:00.337Z] pre-commit hook(s) made changes.
[2025-07-08T09:38:00.337Z] If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
[2025-07-08T09:38:00.337Z] To run `pre-commit` as part of git workflow, use `pre-commit install`.
[2025-07-08T09:38:00.337Z] All changes made by hooks:
[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/config.py b/tensorrt_llm/models/qwen/config.py
[2025-07-08T09:38:00.337Z] index 2e6f0a7..47d1e15 100644
[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/config.py
[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/config.py
[2025-07-08T09:38:00.337Z] @@ -114,7 +114,7 @@ class QWenConfig(PretrainedConfig):
[2025-07-08T09:38:00.337Z] hidden_act = getattr(hf_config, "hidden_act", "silu")
[2025-07-08T09:38:00.337Z] if qwen_type == "qwen2_moe":
[2025-07-08T09:38:00.337Z] hidden_act = "swiglu"
[2025-07-08T09:38:00.337Z] -
[2025-07-08T09:38:00.337Z] +
[2025-07-08T09:38:00.337Z] # Qwen3 models have no attention bias, while legacy models have bias
[2025-07-08T09:38:00.337Z] if qwen_type in ('qwen3', 'qwen3_moe'):
[2025-07-08T09:38:00.337Z] attn_bias = False # Qwen3 models have no attn bias
[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/convert.py b/tensorrt_llm/models/qwen/convert.py
[2025-07-08T09:38:00.337Z] index ccc8fc9..dc2bc35 100644
[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/convert.py
[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/convert.py
[2025-07-08T09:38:00.337Z] @@ -658,20 +658,30 @@ def convert_hf_qwen(hf_model,
[2025-07-08T09:38:00.337Z] # Qwen3: Add q_norm and k_norm weight conversion
[2025-07-08T09:38:00.337Z] if qwen_type in ('qwen3', 'qwen3_moe'):
[2025-07-08T09:38:00.337Z] # Process q_norm.weight
[2025-07-08T09:38:00.337Z] - q_norm_weight = get_weight(model_params, prefix + key_list[0] + 'q_norm', dtype)
[2025-07-08T09:38:00.337Z] + q_norm_weight = get_weight(model_params,
[2025-07-08T09:38:00.337Z] + prefix + key_list[0] + 'q_norm', dtype)
[2025-07-08T09:38:00.337Z] weights.update(
[2025-07-08T09:38:00.337Z] - get_tllm_linear_weight(q_norm_weight, tllm_prex + 'attention.q_layernorm.',
[2025-07-08T09:38:00.337Z] - None, False, # LayerNorm should not be quantized
[2025-07-08T09:38:00.337Z] - plugin_weight_only_quant_type, dtype,
[2025-07-08T09:38:00.337Z] - use_gemm_woq_plugin))
[2025-07-08T09:38:00.337Z] -
[2025-07-08T09:38:00.337Z] - # Process k_norm.weight
[2025-07-08T09:38:00.337Z] - k_norm_weight = get_weight(model_params, prefix + key_list[0] + 'k_norm', dtype)
[2025-07-08T09:38:00.337Z] + get_tllm_linear_weight(
[2025-07-08T09:38:00.337Z] + q_norm_weight,
[2025-07-08T09:38:00.337Z] + tllm_prex + 'attention.q_layernorm.',
[2025-07-08T09:38:00.337Z] + None,
[2025-07-08T09:38:00.337Z] + False, # LayerNorm should not be quantized
[2025-07-08T09:38:00.337Z] + plugin_weight_only_quant_type,
[2025-07-08T09:38:00.337Z] + dtype,
[2025-07-08T09:38:00.337Z] + use_gemm_woq_plugin))
[2025-07-08T09:38:00.337Z] +
[2025-07-08T09:38:00.337Z] + # Process k_norm.weight
[2025-07-08T09:38:00.337Z] + k_norm_weight = get_weight(model_params,
[2025-07-08T09:38:00.337Z] + prefix + key_list[0] + 'k_norm', dtype)
[2025-07-08T09:38:00.337Z] weights.update(
[2025-07-08T09:38:00.337Z] - get_tllm_linear_weight(k_norm_weight, tllm_prex + 'attention.k_layernorm.',
[2025-07-08T09:38:00.337Z] - None, False, # LayerNorm should not be quantized
[2025-07-08T09:38:00.337Z] - plugin_weight_only_quant_type, dtype,
[2025-07-08T09:38:00.337Z] - use_gemm_woq_plugin))
[2025-07-08T09:38:00.337Z] + get_tllm_linear_weight(
[2025-07-08T09:38:00.337Z] + k_norm_weight,
[2025-07-08T09:38:00.337Z] + tllm_prex + 'attention.k_layernorm.',
[2025-07-08T09:38:00.337Z] + None,
[2025-07-08T09:38:00.337Z] + False, # LayerNorm should not be quantized
[2025-07-08T09:38:00.337Z] + plugin_weight_only_quant_type,
[2025-07-08T09:38:00.337Z] + dtype,
[2025-07-08T09:38:00.337Z] + use_gemm_woq_plugin))
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] if qwen_type == "qwen2_moe" and moe_config and moe_config.has_moe():
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] diff --git a/tensorrt_llm/models/qwen/model.py b/tensorrt_llm/models/qwen/model.py
[2025-07-08T09:38:00.337Z] index 16d748d..0fb003a 100644
[2025-07-08T09:38:00.337Z] --- a/tensorrt_llm/models/qwen/model.py
[2025-07-08T09:38:00.337Z] +++ b/tensorrt_llm/models/qwen/model.py
[2025-07-08T09:38:00.337Z] @@ -21,7 +21,7 @@ import torch
[2025-07-08T09:38:00.337Z] from tqdm import tqdm
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] from ..._utils import pad_vocab_size
[2025-07-08T09:38:00.337Z] -from ...functional import Tensor, recv, send, LayerNormType
[2025-07-08T09:38:00.337Z] +from ...functional import LayerNormType, Tensor, recv, send
[2025-07-08T09:38:00.337Z] from ...layers import (MOE, Attention, AttentionMaskType, ColumnLinear,
[2025-07-08T09:38:00.337Z] Embedding, GatedMLP, RmsNorm, SharedMoE)
[2025-07-08T09:38:00.337Z] from ...layers.moe import MOEWeightWrapper
[2025-07-08T09:38:00.337Z] @@ -38,6 +38,7 @@ from .config import QWenConfig
[2025-07-08T09:38:00.337Z] from .convert import (load_hf_qwen, load_weights_from_hf_gptq_model,
[2025-07-08T09:38:00.337Z] load_weights_from_hf_model)
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] +
[2025-07-08T09:38:00.337Z] class QWenDecoderLayer(Module):
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] def __init__(self, config: QWenConfig, layer_idx: int):
[2025-07-08T09:38:00.337Z] @@ -57,7 +58,7 @@ class QWenDecoderLayer(Module):
[2025-07-08T09:38:00.337Z] local_layer_idx = layer_idx - layers_range[0]
[2025-07-08T09:38:00.337Z] # Qwen3: Enable qk_layernorm for Q/K normalization (similar to Gemma3)
[2025-07-08T09:38:00.337Z] qk_layernorm = config.qwen_type in ('qwen3', 'qwen3_moe')
[2025-07-08T09:38:00.337Z] -
[2025-07-08T09:38:00.337Z] +
[2025-07-08T09:38:00.337Z] self.attention = Attention(
[2025-07-08T09:38:00.337Z] local_layer_idx=local_layer_idx,
[2025-07-08T09:38:00.337Z] hidden_size=config.hidden_size,
[2025-07-08T09:38:00.337Z] @@ -83,7 +84,8 @@ class QWenDecoderLayer(Module):
[2025-07-08T09:38:00.337Z] dense_bias=False,
[2025-07-08T09:38:00.337Z] # Qwen3: Add Q/K layer normalization
[2025-07-08T09:38:00.337Z] qk_layernorm=qk_layernorm,
[2025-07-08T09:38:00.337Z] - layernorm_type=LayerNormType.RmsNorm if qk_layernorm else LayerNormType.LayerNorm)
[2025-07-08T09:38:00.337Z] + layernorm_type=LayerNormType.RmsNorm
[2025-07-08T09:38:00.337Z] + if qk_layernorm else LayerNormType.LayerNorm)
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] if config.moe.has_moe():
[2025-07-08T09:38:00.337Z] mlp_kwargs = {'moe_config': config.moe, 'mapping': config.mapping}
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z]
[2025-07-08T09:38:00.337Z] Error: pre-commit checks failed
[2025-07-08T09:38:00.337Z] Please refer to our coding style guidelines at: https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#coding-style to fix this issue
[2025-07-08T09:38:00.337Z] + git restore .
[2025-07-08T09:38:00.337Z] + falseCan you run the following scripts to fix it? pre-commit install
pre-commit run -a --show-diff-on-failure |
|
@gkswns0531 The quantization script is as follows: os.makedirs(output_dir, exist_ok=True) fp8_config = FineGrainedFP8Config( model = AutoModelForCausalLM.from_pretrained( Could you please help confirm: |
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
|
@byshiue |
|
/bot run |
|
PR_Github #11474 [ run ] triggered by Bot |
|
PR_Github #11474 [ run ] completed with state |
|
Merge this PR. Thank you for the contribution. |
|
Thank you very much for merging the PR! |
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Signed-off-by: Yuxin <yuxinz@nvidia.com>
|
@gkswns0531 would it be possible for you to share the benchmarking code for both vllm and TRT-LLM? Thanks! |
|
Please refer to the repository below. |
|
thanks for sharing! @gkswns0531 |
|
I'm trying to run Qwen-4B but getting this error. ubuntu@ip-172-31-3-85:~/qwen2.5_engine_compare$ python3 build_qwen_engine.py --quantization fp8 --model_name qwen3-4b
|
|
@Shruti-db |
|
Thanks @gkswns0531 patched your fix. Works now! |
|
Thank you for the amazing contribution @gkswns0531. Do you have any updates on MoE support? Best, Michael |
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>

Tested with Qwen3-1.7B model successfully. (Support for the Qwen3 MoE architecture will be added in a future update)
Dear Maintainers,
I would like to kindly submit a pull request that adds support for building the Qwen3 model as a TensorRT engine.
I would be truly grateful if you could take the time to review it at your convenience.
Thank you very much for your consideration.
ref)
vllm vs tensorrt-llm latency (Qwen3-1.7b, L4 GPU, 2048 input_length, 3072 max_length, batch 1)
FP16
FP8
Related to #5673