Quantization-aware Transformation Pipeline via Early Fusion + Node Annotation


Background

Currently, quantization transformation is applied at the beginning of the transformation pipeline. It rewrites standard linear ops into quantized ops (e.g., fused quant_linear). While this works, it introduces coupling between transformation passes and quantization backends. I.e. Downstream passes (e.g. pattern matching, weight fusion, sharding) must be quantization-aware.


Goals
- Introduce an annotation-based approach to preserve quantization metadata across passes. Unify and simplify the handling of quantized ops in the transformation pipeline.
- Leverage the InferenceOptimizer transformation system (staged passes, config inheritance) to modularize quantization-aware logic.
- Consider alternative ways other than node.meta to preserve quant info


Exploration:
- Configurable Quantization Backend options similar to attention backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization-aware Transformation Pipeline via Early Fusion + Node Annotation #5861

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantization-aware Transformation Pipeline via Early Fusion + Node Annotation #5861

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions