- 
                Notifications
    You must be signed in to change notification settings 
- Fork 352
Allow Int4WeightOnlyQuantizer to set different dtype for scales_and_zeros #479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scales_and_zeros As titled. Currently `Int4WeightOnlyQuantizer` is hardcoded to return `scales_and_zeros` with dtype `torch.bfloat16`. Adding `dtype` argument into the flow so that it can be different dtype.
| 🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/479
 Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f3c320a with merge base a35a1cd ( This comment was automatically generated by Dr. CI and updates every 15 minutes. | 
| super().__init__() | ||
| self.padding = not _check_linear_int4_k(in_features, groupsize, inner_k_tiles) | ||
| if self.padding: | ||
| from model import find_multiple | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a module called model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I think this is a relic of when gptq was more deeply coupled with gpt-fast
| This seems fine to merge although I do worry that most of our gptq tests are disabled right now in  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks fine but FYI we don't really have anyone maintaining the gptq example so if there's a use-case for it please let me know
| 
 I'm migrating torchchat to use these APIs, to be prepared for shared kernels across ET and PyTorch eager/compile. | 
readme update
* Update quantize.py to use torchao Quantizers
Summary:
Remove duplicate code for Int4WeightOnlyQuantizer and
Int8DynActInt4WeightQuantizer and use torchao API.
Test Plan:
```
python torchchat.py generate llama2 --quantize '{"linear:int4": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256
python torchchat.py generate llama2 --quantize '{"linear:a8w4dq": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256
```
Reviewers:
Subscribers:
Tasks:
Tags:
* Fix import
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Install torchao from gh
* Explain import
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Fix dependencies
* Test ao PR pytorch#479
* Update torchao hash
* Update torchao pin
* Fix scheduler bf16/fp16 mix error
* Incorporate torchao changes
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* update hash
* Fix GPU CI job
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* More fix
* Fix executorch CI job
* Use quant api for int4 weight only quantization
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Fix
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Fix again
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Fix 3
* Fix 4
* Try something
* debug
* Only migrate 8a4w
---------
Co-authored-by: Jack Zhang <jackzhxng@fb.com>
    
As titled. Currently
Int4WeightOnlyQuantizeris hardcoded to returnscales_and_zeroswith dtypetorch.bfloat16. Addingdtypeargument into the flow so that it can be different dtype.