Refactor AWQ support #926

julian-q · 2023-08-31T23:12:44Z

This PR is intended to build on top of and be merged into #762 , streamlining the development of supporting AWQ weight quantization for new models and make it easier to maintain. Posting here as a draft for review.

Minimizes dependency on llm-awq code
Integrates quantized linear into existing tensor parallel layers
Decomposes quantized weight loading logic where possible
Support tensor parallelism

Special thanks to @robirv938, @ri938, and @casper-hansen for making lots of progress on AWQ support

Add awq improvements

Merge linear layers

More improvements awq

…d_awq_quant_support

casper-hansen · 2023-09-01T08:49:53Z

@julian-q Great work, thank you for this. Just one thing, the commit f5e8d15 that I created was only a work in progress for MPT models, so I think we have to revert this once it is ready to be merged.

Have you tested other models than LLaMa? I do want to test this with MPT models as they are my go-to since they are Apache 2.0 licensed.

csrc/quantization/quantization_kernels.cu

zhuohan123

Thanks for your contribution Julian! Left some review comments. And please make sure to format your code with format.sh.

vllm/config.py

vllm/model_executor/functional.py

vllm/config.py

zhuohan123 · 2023-09-03T23:35:38Z

vllm/engine/arg_utils.py

Is there a way for us to directly tell a model is quantized or not using model's checkpoint or config, without asking a user to pass in an argument?

Good question. The config doesn't include quantization info but maybe we can check for names of quantized parameters, e.g. qweight, qzeros, and scales.

Good question. The config doesn't include quantization info but maybe we can check for names of quantized parameters, e.g. qweight, qzeros, and scales.

AutoAWQ creates a quant_config.json. If you do not want to rely on this config, you can do what @julian-q said, but beware of the potential overhead associated with loading the state dict and looking for module names.

AutoAWQ looks really cool! @casper-hansen
Yes, loading the state dict is not ideal, so this could be a good solution.

vllm/model_executor/parallel_utils/tensor_parallel/layers.py

vllm/model_executor/weight_utils.py

vllm/model_executor/models/mpt.py

zhuohan123 · 2023-09-04T00:16:39Z

vllm/model_executor/models/llama.py

Maybe not fixed in this PR: I feel passing quant_config into every class of every model is too big an overhead. We should find some other ways to seamlessly support quantization for different models.

I have created AutoAWQ that automatically saves a quant_config.json and loads it from huggingface.
https://github.com/casper-hansen/AutoAWQ

vllm/model_executor/weight_utils.py

WoosukKwon · 2023-09-08T06:31:09Z

@julian-q We recently merged several PRs. Could you resolve the merge conflicts? Sorry for the inconvenience.

casper-hansen · 2023-09-08T08:09:25Z

~~There is now a new kernel in AWQ that is supposedly faster than the previous version. The format of the weights is now also slightly different.~~

Is this something we want to pivot to for this PR in order to see if it can improve throughout? It should be as easy as copying over the GEMV kernel and new WQLinear module, quantizing a new model, and running throughout tests.

EDIT: After further evaluation, this new GEMV kernel seems to struggle with large context and increasing batch sizes. My conclusion is that the GEMM kernel already implemented in this PR will have best performance.

Kernel: https://github.com/mit-han-lab/llm-awq/blob/main/awq/kernels/csrc/quantization/gemv_cuda.cu
New module: https://github.com/mit-han-lab/llm-awq/blob/main/awq/quantize/qmodule.py

casper-hansen · 2023-09-12T10:26:28Z

@julian-q Could you please look into merging the main branch into your pull request or do you need some help with this? I would love to have AWQ support out in vLLM soon :)

WoosukKwon · 2023-09-13T12:46:57Z

For faster integration, I took over this PR. The PR will be polished and merged very soon. Thanks.

WoosukKwon · 2023-09-16T07:13:35Z

@julian-q Closed as #1032 is now merged. Thanks again for the great PR! I learned a lot from it.

@yma11

This PR is identical with vllm-project#705 , except that it's from a branch within vllm-fork instead of a branch from an external forked repo. The reason for this is that there are some issues for merge external branch, see jira [SW-218309](https://jira.habana-labs.com/browse/SW-218309), we have consulted folks in SW-218309, they suggest to create a internal branch to trigger the jenkins test correctly, so I am creating this PR to speedup the merge process. cc @yma11 @michalkuligowski @jkaniecki --------- Signed-off-by: yan ma <yan.ma@intel.com> Signed-off-by: zhouyu5 <yu.zhou@intel.com> Co-authored-by: yan ma <yan.ma@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

robirv938 and others added 30 commits August 14, 2023 14:52

add quantisation config

fed3d61

pass down quantisation setings

0f936d0

american englihs

520394a

llama add the code for quantization

640cedf

update

0437ffa

merge in the AWQ code with note saying its source

861d3d7

update

d659e95

update

c0e4862

update

fed311e

fix loading of layers

0937043

update

7109bd3

update

5bd5ed6

quantization config is part of the model config

c3cc5ed

function

02bdfed

update

2f97151

Merge pull request #2 from ri938/add_awq_improvements

c39ec2a

Add awq improvements

working prototype

e5434ef

merge linear layers

ff4d693

update

033e8c1

Merge pull request #3 from ri938/merge_linear_layers

a3ac858

Merge linear layers

Add quant layer in Row and Column parallel.

974bf06

fix pylint errors

fbaf889

improve the quant weight loaded code

db4db0c

Merge pull request #5 from ri938/more_improvements_awq

73db30f

More improvements awq

Merge remote-tracking branch 'upstream/add_awq_quant_support' into ad…

ee7116a

…d_awq_quant_support

Loading works, Refactored Quant into Row/Column Parallel

8ff92c7

WIP. Try to load MPT.

f5e8d15

add quantization utils

409e290

consolidate quantization operations

a6193cc

tweak quant config

37839a9

julian-q added 6 commits August 23, 2023 03:10

unify TP linear layer forward pass

e0520fe

fix shape bugs

b1d2639

move quantized linear to funcitonal

7f1a80a

streamline quantized weight loading

7277fcb

clean up packed dimension calculation

d860aa7

use pack factor member

9735df9

zhuohan123 reviewed Sep 3, 2023

View reviewed changes

csrc/quantization/quantization_kernels.cu Outdated Show resolved Hide resolved

zhuohan123 requested changes Sep 4, 2023

View reviewed changes

PaulConyngham mentioned this pull request Sep 4, 2023

Add Support for Quantized Model in VLLM - $500 Reward #744

Closed

julian-q added 10 commits September 4, 2023 23:43

fix tensor parallelism support

3dc0f59

run autoformat, fix styling

ce97430

revert MPT quantization for now

ac066f4

add link to citation

d605a80

clean up weightquantconfig

269ccfa

placeholders for transposed, packed in other models

d8ee12a

decompose and document special weight configs

d2d10b9

fix parameter default

a2018e2

create quantized_linear.py, simplify AWQ param

3e5a3b4

clean up AWQ constants

05dfe27

julian-q marked this pull request as ready for review September 5, 2023 05:44

casper-hansen mentioned this pull request Sep 10, 2023

Please add support for GPTQ models #994

Closed

WoosukKwon mentioned this pull request Sep 14, 2023

Implement AWQ quantization support for LLaMA #1032

Merged

WoosukKwon closed this Sep 16, 2023

Uh oh!

Refactor AWQ support #926

Refactor AWQ support #926

Uh oh!

Conversation

julian-q commented Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casper-hansen commented Sep 1, 2023

Uh oh!

Uh oh!

zhuohan123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 Sep 3, 2023

Choose a reason for hiding this comment

Uh oh!

julian-q Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

casper-hansen Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

julian-q Sep 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

casper-hansen Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon commented Sep 8, 2023

Uh oh!

casper-hansen commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casper-hansen commented Sep 12, 2023

Uh oh!

WoosukKwon commented Sep 13, 2023

Uh oh!

WoosukKwon commented Sep 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

julian-q commented Aug 31, 2023 •

edited

Loading

zhuohan123 left a comment •

edited

Loading

julian-q Sep 5, 2023 •

edited

Loading

casper-hansen commented Sep 8, 2023 •

edited

Loading