KEMBAR78
[AutoDeploy] Arch2: Model Support: VLM, Long-Context, and Linear Attention · Issue #4593 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

[AutoDeploy] Arch2: Model Support: VLM, Long-Context, and Linear Attention #4593

@sugunav14

Description

@sugunav14

A running epic to track support for increasing model coverage beyond attention-based text-to-text models

Some imminent features / discussions for that entail:

  • Add support for handling logit soft capping that is used in (used in Gemini, Grok and Gemma-2, etc.) The attention system currently fails for google/gemma-2-27b-it. ("warm-up")
  • Considering linear attention and other state-space model approaches
  • Long-context and seqlen-dependent attention masking (sliding window, chunked attention, ...)
  • VLM and other many-to-text models

Sub-issues

Metadata

Metadata

Assignees

Labels

AutoDeploy<NV> AutoDeploy Backend

Type

Projects

Status

In progress

Relationships

None yet

Development

No branches or pull requests

Issue actions