KEMBAR78
GitHub Β· Where software is built
Skip to content

Add support for Speculative DecodingΒ #729

@OliverFM

Description

@OliverFM

Feature request

There is a new and interesting paper from Google Research that promising 2-3X speedups of LLM inference by running two models in parallel. The core idea is using a faster, and lower quality model, that approximates the target model to sample multiple tokens and then check these samples using the target model.
E.g. Sample from LLaMA 7b quickly, then use LLaMA 70b to check the samples.

Motivation

Adding this kind of support would make LLM sampling much faster.

I have considered running an alternative implementation where I run two copies of TGI and a new web server implementing Speculative Decoding in the same kubernetes pod/vm/server. However the added overhead of running HTTP between all these containers is likely to erase a significant portion of the gains in inference speed.

Core challenges:

  1. Adding this feature would require making TGI more generic, so that one can run multiple models at once. We would need to make sure that this does not degrade performance or reliability for the single model use case.
  2. Running many models on a container would make GPU selection more tricky, but, again, this should not be insurmountable.
  3. We would need to add some new options to the public API, this will require careful thought.

Your contribution

Presuming the maintainers are happy with adding this feature, I would start work and implement it. This would probably take the form of several PRs, as this change would be signficant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions