-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
triagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
Hi, playing around with quantizing a 70b model. Even with 4x A100s 80gb each it is OOM'ing, is this normal? It seems to be splitting the model on the GPUs correctly but not sure if it's splitting inference memory or just using the 1st GPU? Is there a general guide on size of model -> VRAM needed to quantize it to fp8 or some way for the quantization code to use all gpu memory available?
To reproduce:
- Build TRT-LLM into docker container
- Install ammo requirements from the documentation
- Run the default example quantize 70b command from examples/llama
kkr37
Metadata
Metadata
Assignees
Labels
triagedIssue has been triaged by maintainersIssue has been triaged by maintainers