A running epic to track support for increasing model coverage beyond attention-based text-to-text models
Some imminent features / discussions for that entail:
- Add support for handling logit soft capping that is used in (used in Gemini, Grok and Gemma-2, etc.) The attention system currently fails for google/gemma-2-27b-it. ("warm-up")
- Considering linear attention and other state-space model approaches
- Long-context and seqlen-dependent attention masking (sliding window, chunked attention, ...)
- VLM and other many-to-text models