KEMBAR78
Llama 2 FP8 quantization OOM · Issue #288 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Llama 2 FP8 quantization OOM #288

@0xymoro

Description

@0xymoro

Hi, playing around with quantizing a 70b model. Even with 4x A100s 80gb each it is OOM'ing, is this normal? It seems to be splitting the model on the GPUs correctly but not sure if it's splitting inference memory or just using the 1st GPU? Is there a general guide on size of model -> VRAM needed to quantize it to fp8 or some way for the quantization code to use all gpu memory available?

To reproduce:

  1. Build TRT-LLM into docker container
  2. Install ammo requirements from the documentation
  3. Run the default example quantize 70b command from examples/llama

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions