-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Feature request
There is a new and interesting paper from Google Research that promising 2-3X speedups of LLM inference by running two models in parallel. The core idea is using a faster, and lower quality model, that approximates the target model to sample multiple tokens and then check these samples using the target model.
E.g. Sample from LLaMA 7b quickly, then use LLaMA 70b to check the samples.
Motivation
Adding this kind of support would make LLM sampling much faster.
I have considered running an alternative implementation where I run two copies of TGI and a new web server implementing Speculative Decoding in the same kubernetes pod/vm/server. However the added overhead of running HTTP between all these containers is likely to erase a significant portion of the gains in inference speed.
Core challenges:
- Adding this feature would require making TGI more generic, so that one can run multiple models at once. We would need to make sure that this does not degrade performance or reliability for the single model use case.
- Running many models on a container would make GPU selection more tricky, but, again, this should not be insurmountable.
- We would need to add some new options to the public API, this will require careful thought.
Your contribution
Presuming the maintainers are happy with adding this feature, I would start work and implement it. This would probably take the form of several PRs, as this change would be signficant.