KEMBAR78
Refactor AWQ support by julian-q · Pull Request #926 · vllm-project/vllm · GitHub
Skip to content

Conversation

@julian-q
Copy link
Contributor

@julian-q julian-q commented Aug 31, 2023

This PR is intended to build on top of and be merged into #762 , streamlining the development of supporting AWQ weight quantization for new models and make it easier to maintain. Posting here as a draft for review.

  1. Minimizes dependency on llm-awq code
  2. Integrates quantized linear into existing tensor parallel layers
  3. Decomposes quantized weight loading logic where possible
  4. Support tensor parallelism

Special thanks to @robirv938, @ri938, and @casper-hansen for making lots of progress on AWQ support

@casper-hansen
Copy link
Contributor

@julian-q Great work, thank you for this. Just one thing, the commit f5e8d15 that I created was only a work in progress for MPT models, so I think we have to revert this once it is ready to be merged.

Have you tested other models than LLaMa? I do want to test this with MPT models as they are my go-to since they are Apache 2.0 licensed.

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution Julian! Left some review comments. And please make sure to format your code with format.sh.

Comment on lines +135 to +140
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way for us to directly tell a model is quantized or not using model's checkpoint or config, without asking a user to pass in an argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The config doesn't include quantization info but maybe we can check for names of quantized parameters, e.g. qweight, qzeros, and scales.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The config doesn't include quantization info but maybe we can check for names of quantized parameters, e.g. qweight, qzeros, and scales.

AutoAWQ creates a quant_config.json. If you do not want to rely on this config, you can do what @julian-q said, but beware of the potential overhead associated with loading the state dict and looking for module names.

Copy link
Contributor Author

@julian-q julian-q Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoAWQ looks really cool! @casper-hansen
Yes, loading the state dict is not ideal, so this could be a good solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not fixed in this PR: I feel passing quant_config into every class of every model is too big an overhead. We should find some other ways to seamlessly support quantization for different models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created AutoAWQ that automatically saves a quant_config.json and loads it from huggingface.
https://github.com/casper-hansen/AutoAWQ

@julian-q julian-q marked this pull request as ready for review September 5, 2023 05:44
@WoosukKwon
Copy link
Collaborator

@julian-q We recently merged several PRs. Could you resolve the merge conflicts? Sorry for the inconvenience.

@casper-hansen
Copy link
Contributor

casper-hansen commented Sep 8, 2023

There is now a new kernel in AWQ that is supposedly faster than the previous version. The format of the weights is now also slightly different.

Is this something we want to pivot to for this PR in order to see if it can improve throughout? It should be as easy as copying over the GEMV kernel and new WQLinear module, quantizing a new model, and running throughout tests.

EDIT: After further evaluation, this new GEMV kernel seems to struggle with large context and increasing batch sizes. My conclusion is that the GEMM kernel already implemented in this PR will have best performance.

Kernel: https://github.com/mit-han-lab/llm-awq/blob/main/awq/kernels/csrc/quantization/gemv_cuda.cu
New module: https://github.com/mit-han-lab/llm-awq/blob/main/awq/quantize/qmodule.py

@casper-hansen
Copy link
Contributor

@julian-q Could you please look into merging the main branch into your pull request or do you need some help with this? I would love to have AWQ support out in vLLM soon :)

@WoosukKwon
Copy link
Collaborator

For faster integration, I took over this PR. The PR will be polished and merged very soon. Thanks.

@WoosukKwon
Copy link
Collaborator

@julian-q Closed as #1032 is now merged. Thanks again for the great PR! I learned a lot from it.

@WoosukKwon WoosukKwon closed this Sep 16, 2023
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Apr 23, 2025
This PR is identical with vllm-project#705 , except that it's from a branch within
vllm-fork instead of a branch from an external forked repo.

The reason for this is that there are some issues for merge external
branch, see jira
[SW-218309](https://jira.habana-labs.com/browse/SW-218309), we have
consulted folks in SW-218309, they suggest to create a internal branch
to trigger the jenkins test correctly, so I am creating this PR to
speedup the merge process.

cc @yma11 @michalkuligowski @jkaniecki

---------

Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: zhouyu5 <yu.zhou@intel.com>
Co-authored-by: yan ma <yan.ma@intel.com>
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants