Background
Currently, quantization transformation is applied at the beginning of the transformation pipeline. It rewrites standard linear ops into quantized ops (e.g., fused quant_linear). While this works, it introduces coupling between transformation passes and quantization backends. I.e. Downstream passes (e.g. pattern matching, weight fusion, sharding) must be quantization-aware.
Goals
- Introduce an annotation-based approach to preserve quantization metadata across passes. Unify and simplify the handling of quantized ops in the transformation pipeline.
- Leverage the InferenceOptimizer transformation system (staged passes, config inheritance) to modularize quantization-aware logic.
- Consider alternative ways other than node.meta to preserve quant info
Exploration:
- Configurable Quantization Backend options similar to attention backend