Add support for Speculative Decoding

### Feature request

There is a new and interesting [paper](https://arxiv.org/pdf/2211.17192.pdf) from Google Research that promising 2-3X speedups of LLM inference by running two models in parallel. The core idea is using a faster, and lower quality model, that approximates the target model to sample multiple tokens and then check these samples using the target model. 
E.g. Sample from LLaMA 7b quickly, then use LLaMA 70b to check the samples.



### Motivation

Adding this kind of support would make LLM sampling much faster.

I have considered running an alternative implementation where I run two copies of TGI and a new web server implementing Speculative Decoding in the same kubernetes pod/vm/server. However the added overhead of running HTTP between all these containers is likely to erase a significant portion of the gains in inference speed.

Core challenges:
1. Adding this feature would require making TGI more generic, so that one can run multiple models at once. We would need to make sure that this does not degrade performance or reliability for the single model use case.
2. Running many models on a container would make GPU selection more tricky, but, again, this should not be insurmountable.
3. We would need to add some new options to the public API, this will require careful thought.

### Your contribution

Presuming the maintainers are happy with adding this feature, I would start work and implement it. This would probably take the form of several PRs, as this change would be signficant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for Speculative Decoding #729

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for Speculative Decoding #729

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions