-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Refactor AWQ support #926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor AWQ support #926
Conversation
Add awq improvements
Merge linear layers
More improvements awq
…d_awq_quant_support
|
@julian-q Great work, thank you for this. Just one thing, the commit f5e8d15 that I created was only a work in progress for MPT models, so I think we have to revert this once it is ready to be merged. Have you tested other models than LLaMa? I do want to test this with MPT models as they are my go-to since they are Apache 2.0 licensed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution Julian! Left some review comments. And please make sure to format your code with format.sh.
vllm/engine/arg_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way for us to directly tell a model is quantized or not using model's checkpoint or config, without asking a user to pass in an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The config doesn't include quantization info but maybe we can check for names of quantized parameters, e.g. qweight, qzeros, and scales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The config doesn't include quantization info but maybe we can check for names of quantized parameters, e.g.
qweight,qzeros, andscales.
AutoAWQ creates a quant_config.json. If you do not want to rely on this config, you can do what @julian-q said, but beware of the potential overhead associated with loading the state dict and looking for module names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoAWQ looks really cool! @casper-hansen
Yes, loading the state dict is not ideal, so this could be a good solution.
vllm/model_executor/models/llama.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not fixed in this PR: I feel passing quant_config into every class of every model is too big an overhead. We should find some other ways to seamlessly support quantization for different models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created AutoAWQ that automatically saves a quant_config.json and loads it from huggingface.
https://github.com/casper-hansen/AutoAWQ
|
@julian-q We recently merged several PRs. Could you resolve the merge conflicts? Sorry for the inconvenience. |
|
EDIT: After further evaluation, this new GEMV kernel seems to struggle with large context and increasing batch sizes. My conclusion is that the GEMM kernel already implemented in this PR will have best performance. Kernel: https://github.com/mit-han-lab/llm-awq/blob/main/awq/kernels/csrc/quantization/gemv_cuda.cu |
|
@julian-q Could you please look into merging the main branch into your pull request or do you need some help with this? I would love to have AWQ support out in vLLM soon :) |
|
For faster integration, I took over this PR. The PR will be polished and merged very soon. Thanks. |
This PR is identical with vllm-project#705 , except that it's from a branch within vllm-fork instead of a branch from an external forked repo. The reason for this is that there are some issues for merge external branch, see jira [SW-218309](https://jira.habana-labs.com/browse/SW-218309), we have consulted folks in SW-218309, they suggest to create a internal branch to trigger the jenkins test correctly, so I am creating this PR to speedup the merge process. cc @yma11 @michalkuligowski @jkaniecki --------- Signed-off-by: yan ma <yan.ma@intel.com> Signed-off-by: zhouyu5 <yu.zhou@intel.com> Co-authored-by: yan ma <yan.ma@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
This PR is intended to build on top of and be merged into #762 , streamlining the development of supporting AWQ weight quantization for new models and make it easier to maintain. Posting here as a draft for review.
llm-awqcodeSpecial thanks to @robirv938, @ri938, and @casper-hansen for making lots of progress on AWQ support