-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Llama/GPTNeoX: add RoPE scaling #24653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
(Of course, tests are missing. Proper validation of whether the feature is working as expected is also missing. I'll add them if we decide to move forward with this feature!) |
| relevant if `config.is_decoder=True`. | ||
| tie_word_embeddings(`bool`, *optional*, defaults to `False`): | ||
| Whether to tie weight embeddings | ||
| rope_scaling (`Dict`, *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(changes in open_llama are copy/paste)
|
The documentation is not available anymore as the PR was closed or merged. |
|
Having this in transformers would be excellent! I've uploaded a bunch of fp16 and GPTQ repos to HF using @jquesnelle 's trust_remote_code Llama modelling patch that implements RoPE using @kaiokendev's method, and I know there are quite a number of people using those already, and I've had a few requests to put out more. And even more are using RoPE outside of transformers via the ExLlama GPTQ implementation. So there's a great deal of appetite for this feature amongst users, understandably. |
|
Could this also be applied to GPT-J models? |
|
@versae yes, it can :) The code needs to be modified there as well, but the concept can be applied to any model with rotary position embeddings |
| def _set_cos_sin_cache(self, seq_len, device, dtype): | ||
| self.max_seq_len_cached = seq_len | ||
|
|
||
| if self.scaling_name == "dynamic": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dynamic, we want to do the scaling only when seq_len > max_position_embeddings (i.e. when we're going past the model's pre-trained length). My original code did this by just having the scaling in the forward() code that re-calculated the frequency cache when seq_len > self.max_seq_len_cached but not in the __init__. Since this code has now been deduplicated (makes sense!), I think this needs to be
if self.scaling_name == "dynamic" and seq_len > self.max_position_embeddings:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jquesnelle that's a great catch, and completely missed in my attempts to unify and refactor the cos/sin cache!
|
Thank you for your work! Just letting you know that I've improved the NTK-aware method in this PR. jquesnelle/yarn#1 It decreases non-finetuned PPL even further (preliminary testing shows 4.9 -> 4.6 PPL at 8192 context size) and theoretically will significantly improve a finetune's convergence/stability compared to previous NTK-aware method. Also because the alpha hyperparameter was difficult to use when predicting effective context size (alpha=4 was something close to ~6400 context size instead of 8192), that problem was fixed and it is now changed to a "scale" factor, which can be used the same way to the "scale" in linear RoPE scaling. (eg. for LLaMA scale=2 is 4096 and scale=4 is 8192) I hope this improved method might be also considered one day as it is one more step towards extending context size for all LLMs! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as this can be used with all existing checkpoints, I have no problem adding this as a new feature. Not too sure about the API you picked as a dictionary though. Since there are two arguments, I would just have put two arguments as a dict can be complicated. Or if we expect more arguments to arrive, having some dataclass to hold them and accept both dict/dataclass at init (though this can also be done later since we are warning this is an experimental feature subject to changes).
And as was mentioned in the comments, this should also be added to other models with Rotary Embeddings :-)
| tie_word_embeddings(`bool`, *optional*, defaults to `False`): | ||
| Whether to tie weight embeddings | ||
| rope_scaling (`Dict`, *optional*): | ||
| Experimental feature -- dictionary containing the scaling configuration for the RoPE embeddings. Currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put "Experimental feature" as a warning at the end of the docstring.
This is an experimental feature, subject to breaking API changes in future versions.
|
Hey @bloc97 @jquesnelle 👋 Looking at your recent PR (this one) -- am I right in saying that
I'm trying to determine how to integrate and document the goodies, while keeping the diff size manageable 🤗 |
|
The technique also seems to work out-of-the-box with GPTNeoX models 🔥 With the latest commit, running the script below from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-1.4b-deduped")
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/pythia-1.4b-deduped",
torch_dtype=torch.bfloat16,
device_map="auto",
rope_scaling={"type": "dynamic", "factor": 2.0},
)
prompt = ... # see PR header for the prompt, >5k tokens
question = "Question: What is the paper about?"
inputs = tokenizer(prompt + question, return_tensors="pt").to("cuda")
print(inputs.input_ids.shape)
gen_out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.batch_decode(gen_out)[0])gets us Without the This is particularly amazing, since we're talking about a 1.4B model 👀 (cc @bloc97 @jquesnelle, you may be interested in this finding) |
|
@amyeroberts @sgugger I'd like to request the review of you two, since this is an unusual post hoc modeling change PR. A few key points:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! As said before, still not a fan of the API but since you are planning to address this in a follow-up PR, that works for me.
|
(For brevity, I'll refer to the new NTK-By-Parts method as NTKv2) NTKv2 is an improved version of NTK. We found that NTK did not perform well when fine-tuned; the reason for this was that the resulting embedding matrix still contained some extrapolated out-of-population values that model had not seen during training. Dynamic NTK hid this by continually scaling NTKv2 is parameterized by In the repository there is also now a Dynamic NTKv2, which is the same idea as the previous dynamic method, i.e. scale the hyperparamter relative to the ratio between the current context length and the model's original trained context length, while using the original embedding values when under the native pre-trained length. This also beats Dynamic NTK in the no-fine-tuning scenario. In the above graph, LLongMA are the fine-tuned OpenLLaMA models we've released, trained on 1B extra tokens (v2 still in the process of training)
Unfortunately no. I understand these different methods can get unwieldly quickly, but NTKv2 appears to be strictly better than original NTK -- I would potentially just advocate replacing the original NTK with this, but that could also be done in a follow-up PR too; the results that this gives you is already Very Good (TM). FWIW the LLongMA models use the exact modeling code here to maintain compatibility without needing |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! 🤩 Thanks for adding ❤️
+1 on @sgugger's comment on using a dataclass, also happy for it to be done in a follow-up.
|
@bloc97 @jquesnelle thank you for your input -- and excited to hear about the performance of NTK-By-Parts! Based on your comments, I will:
|
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
I have question similar to this. The graph showing dynamic scaling in this reddit post showing that the perplexity of the model with dynamic scaling are same with model without scaling until 2048 tokens length (Of course this must be because the base value did not change before 2048 tokens). This got me thinking, If I first generate with long context (say 4096 tokens), the base value would change accordingly (which is around 35000). Then, if I next generate with short context like 1024 context, the |
I have the same concern. In the dynamic scaling, the sin and os may should not be cached |
|
Hi here is my test code. I modified from https://huggingface.co/docs/transformers/perplexity . |
|
@guozhiyao Nothing immediately comes to mind, it could be even a model "feature" (looking at the plot for the original model, which also has the periodicity). Would you be able to a) run the same script for LLaMA and b) repeat your experiment using the script @jquesnelle used (this one)? a) should rule out model-specific issues and b) should rule out code-specific issues. |
@gante Thanks a lot. It is solved by using the code. |
@airaria I had the same problem, not only |
There is a precision difference between the |
# What does this PR do? - Adds Rope NTK scaling. Done because #529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
Init
fix: cleanup
Add load testing
Refactored gRPC interface
Added validation logic
ValidationError was not correctly handled
Use axum
feat: Docker image
feat: Add AML deployment
Update aml deployment
feat: Improve error handling
feat: Add arguments to CLI
v0.1.0
fix(validation): Fix error messages
feat(router): Add max_waiting_tokens
Create LICENSE (#2)
feat(server): Use safetensors
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(client): Simplify sharded logic
feat(server): Support bitsandbytes
feat(server): Support all AutoModelForCausalLM on a best effort basis
feat: Use json formatter by default in docker image
fix(models): Revert buggy support for AutoModel
feat(server): Support generic AutoModelForCausalLM
feat(server): Support AutoModelForSeq2SeqLM
feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard
feat(server): Improved doc
fix(server): Fix Transformers fork version
feat(server): Clarify CausalLMBatch concatenate method
feat(rust): Update to 1.65
fix(router): Fix HTTP status codes
fix(readme): Typo
fix(router): Handle tokenizer errors
feat(server): Support Galactica (#4)
fix(batching): Avoid theoretical hang in batcher loop (#5)
- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute
Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>
feat(server): Add model tests (#6)
fix(server): Only pad to multiple of 8 on GPUs
feat: Support stop sequences (#7)
feat: Return logprobs (#8)
feat(launcher): Add integration tests (#9)
fix(server): Fix stop sequences (#11)
fix(server): Check for device type correctly when determining initial padding (#16)
AFAIK there is no torch device type called "gpu".
fix(router): Include special tokens when tokenizing (#14)
There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.
This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.
This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.
As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
feat(router): Add const parameters to validation logic (#15)
I noticed some opportunity to collapse some of the logic, in case you
are interested.
fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)
Fixes #12 in the easiest way I could think of.
feat(launcher): Log server stdout (#19)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): Minor refactorization using new_zeros (#24)
- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher
fix(router): Obey max batch size (#23)
feat(server): Support SantaCoder (#26)
fix(server): Fix position ids (#28)
feat(docker): Make the image compatible with api-inference (#29)
fix(docker): fix api-inference deployment (#30)
fix(router): fix api-inference deployment (#31)
fix(dockerfile): fix docker build (#32)
feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)
feat(router): Remove second lock from batcher hot path (#27)
@njhill
feat: Support sampling seeding (#37)
Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
feat: Add token streaming using ServerSideEvents support (#36)
Add token streaming using ServerSideEvents (SSE).
The signature of the SSE events is:
```rust
struct Details {
finish_reason: String,
generated_tokens: u32,
seed: Option<u64>,
}
struct StreamResponse {
token: Token,
generated_text: Option<String>,
details: Option<Details>,
}
struct ErrorResponse {
error: String,
}
```
Revert "feat: Add token streaming using ServerSideEvents support" (#40)
Reverts huggingface/text-generation-inference#36
fix(server): fix seeding on gpu (#42)
fix(server): fix seeding with multiple shards (#44)
feat: Add token streaming using ServerSideEvents support (#41)
fix(server): fix quantization for sharded models (#45)
feat(server): Support GPT-Neox (#39)
feat(ci): Docker build and push (#46)
feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)
feat(server): support repetition penalty (#47)
feat(server): allow the server to use a local weight cache (#49)
fix(server): allow greedy repetition penalty (#51)
feat(router): use background task to manage request queue (#52)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
breaking(router): modify /generate API to only return generated text (#50)
@njhill, @yk FYI
generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.
We also remove the unused Vec.
feat(router): refactor API and add openAPI schemas (#53)
feat(docs): Clarify installation steps (#54)
Adds some bits for first-time users (like me 😄 )
feat(ci): push to AML registry (#56)
fix(server): better handling of inference mode (#57)
V0.2.1 (#58)
feat(server): support t5 (#59)
fix(docker): increase shm size (#60)
fixed SSE naming (#61)
https://en.wikipedia.org/wiki/Server-sent_events
feat: add distributed tracing (#62)
feat: add safetensors conversion (#63)
feat(server): improve download logging (#66)
feat(launcher): add disable_custom_kernels arg (#67)
feat(router): add max_total_tokens and empty_input validation (#68)
closes #65
fix(launcher): copy current env vars to subprocesses (#70)
closes #69
feat(router): add prometheus metrics scrape endpoint (#71)
v0.3.0 (#72)
feat(router): add cors allow origin options (#73)
feat(server): enable hf-transfer (#76)
fix(server): remove position_ids from galactica forward (#82)
closes #80
feat(server): pre-allocate max attention mask (#75)
v0.3.1 (#84)
feat(server): add special token bool (#85)
fix(docs): fix openapi schema (#86)
fix(server): fix token_is_special (#87)
feat(router): add legacy route for api-inference support (#88)
feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)
feat(router): add api-inference headers (#91)
feat(server): add logits watermark (#90)
feat(server): update to hf_transfer==0.1.2 (#93)
feat(ci): improve CI speed (#94)
fix(launcher): add router parameters to launcher (#95)
feat(server): fix transformers commit (#96)
v0.3.2 (#97)
fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)
feat: allow local models (#101)
closes #99
feat: add supported models (#102)
feat(clients): Python client (#103)
fix(server): fix galactica batch (#106)
closes #105
feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)
feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)
fix(python-client): stream not set on the sync client (#109)
fix(server): fix index out of range for watermarking (#110)
feat: support typical sampling (#114)
closes #112
fix(server): do not warp prefill logits (#116)
feat(router): support left truncation (#115)
closes #111
feat(router): add best_of parameter (#117)
feat(python-client): add new parameters (#118)
v0.4.0 (#119)
feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)
…ed models
fix(server): revert gpt-neox optims (#123)
fix(server): add position ids to neox (#126)
fix(server): use server tokenizer as gt (#128)
fix(python-client): relax dependencies (#129)
feat(python-client): add cookies to Client constructors and requests (#132)
I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.
Note: I couldn't get the client tests to pass - do you need to have an
HF token?
```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```
feat(ci): add ci paths (#134)
feat: Add note about NVIDIA drivers (#64)
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
feat(python-client): release v0.4.0 (#135)
feat(python-client): add CI (#136)
feat(server): flash neoX (#133)
fix(server): fix flash-neox scores warping (#137)
feat(server): cleanup flash neox loading (#139)
v0.4.1 (#140)
fix(server): Avoid using try/except to determine kind of AutoModel (#142)
feat(server): Add mypy-protobuf (#141)
Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.
feat(server): clear cache on error (#143)
feat(server): reduce mlp and attn in one op for flash neox (#145)
feat: aws sagemaker compatible image (#147)
The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...
---------
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
fix(ci): fix sagemaker action (#148)
feat(benchmark): tui based benchmarking tool (#149)
fix(server): fix flash neox rotary embeddings (#150)
v0.4.2 (#151)
v0.4.3 (#152)
feat(server): flash santacoder (#153)
docs(readme): provide link Logits Warper README (#154)
fix(server): fix escape characters in stop sequence (#155)
feat(docker): improve flash_attention caching (#160)
feat(launcher): allow disabling hf_transfer (#161)
fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)
fix(router): use buckets for metrics histograms (#163)
feat(router): make router input validation optional (#164)
feat(server): add flash attention llama (#144)
feat(server): support OPT models (#55)
OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.
v0.5.0 (#168)
feat(server): optimize decode for sane tokenizers (#170)
feat(server): support sharded santacoder (#167)
fix(launcher): revert change on shard errors (#173)
fix(ci): fix CVE in github-slug-action (#174)
feat(ci): add image signing with cosign (#175)
feat(ci): add Trivy and scan docker image (#178)
feat(ci): use large runners (#179)
feat(ci): faster scanning (#180)
fix(ci): fix ci permissions (#181)
fea(dockerfile): better layer caching (#159)
fix(ci): fix cosign error (#183)
fix(docker): fix docker image (#184)
fix(docker): fix image (#185)
fix(docker): revert dockerfile changes (#186)
fix(docker): fix docker image dependencies (#187)
fix(router): fix truncation (#190)
closes #189
feat(python-client): get list of currently deployed tgi models using the inference API (#191)
feat(router): add info route (#196)
close #125
feat(server): support quantization for flash models (#200)
closes #197
feat(server): check cuda capability when importing flash models (#201)
close #198
fix(server): fix hf_transfer issue with private repos (#203)
fix(docker): remove unused dependencies (#205)
fix(router): add auth token to get model info (#207)
feat(router): add git sha to info route (#208)
feat(router): drop requests when client closes the channel (#202)
fix(ci): fix sha in docker image (#212)
feat(server): flash attention past key value optimizations (#213)
feat(router): add device and dtype info (#215)
fix(server): fix past key values logic (#216)
@njhill fyi
fix(server): cleanup new flash past_key_values logic (#217)
fix(server): fix flash causal (#218)
fix(server): fix flash causal (#219)
fix(server): fix flash batch filtering (#220)
misc: update to rust 1.69 (#221)
v0.6.0 (#222)
feat(server): reduce memory requirement (#214)
chore(server): update huggingface-hub (#227)
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
feat(router): add endpoint info to /info route (#228)
chore(server): update safetensors version (#235)
fix(python-client): add auth headers to is supported requests (#234)
Starting some routing tests. (#233)
fix(benchmarking): fix benchmarking tool
chore(launcher): refactor logic (#242)
Hopefully it's cleaner
feat(router): add tests to validation (#237)
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)
Introduced in #214
Fixes #249
fix(server): Small tidy of code from recent changes (#251)
remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()
chore(server): update transformers (#250)
feat(server): add watermarking tests (#248)
feat(docker): add nvidia env vars (#255)
doc(launcher): add more docs to the `launcher` itself and link in the README (#257)
feat(benchmark): add support for private tokenizers (#262)
Adding docs on how dynamic batching works. (#258)
This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.
Maybe some drawings could help too but I kept it to text for now.
chore(github): add templates (#264)
fix(server): fix typo in tokenizers decode (#269)
closes #268
feat(server): support hf endpoint weight layout (#266)
fix(launcher): pass weights cache override to the download process (#274)
closes #273
fix(launcher): handle hub branches (#278)
fix(server): Removes the parallelism in file convertion (during download) (#275)
feat(launcher): Improve error message when download process fails. (#276)
fix(server): fix convert (#284)
chore: add `flash-attention` to docker ignore (#287)
included when building docker locally.
(Where the local dirs might have the flash-attention folder.)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
fea(server): decrease convert RAM requirements (#286)
fix(dockerfile): fix nvidia env vars (#297)
Fixes #291
feat(router): Adding response schema for compat_generate (#292)
feat(docker): add benchmarking tool to docker image (#298)
fix(docker): fix docker build (#299)
feat(server): optim flash causal lm decode_token (#285)
fix(docker): fix nvidia env vars (#305)
fix(docker): remove nvidia require cuda env (#310)
feat(server): shard token decode (#303)
feat(server): use float16 (#304)
fix(docker): remove CUDA_VERSION
feat(server): use cuda graph in logits warping (#302)
fix(server): fix multinomial implem in Sampling
feat(server): GPTQ quantization (step1) (#277)
Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore(docker): use nvidia base image (#318)
fix(docker): remove quantize default
fix(docker): use ubuntu20.04
Hotfixes for santacoder/bigcode. (#294)
Hotfixes:
- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Lifting check_unitialized. (#325)
Lifting check_unitialized.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Removing dead variables. (#327)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(ci): custom gpu runners (#328)
Single place for TP layers + Dropout Layer Norm + FastLinear (#329)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat: add snapshot testing (#282)
feat(integration-tests): improve comparison and health checks (#336)
fix(server): fix decode token (#334)
Fixes #333
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
fix: set MODEL_ID in sagemaker-entrypoint script (#343)
feat(server): Support BLOOMChat-176B (#348) (#351)
@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): fix init for flash causal lm (#352)
Fixes #347
fix(server): t5 cannot run in f16 (#356)
Fix #349
fix(ci): fix security group (#359)
Switch security group used for ci
(open outbound rules)
Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>
feat: add nightly load testing (#358)
chore(sever): update requirements (#357)
Fixes #338
feat(server): support fp16 for t5 (#360)
Fixes #349
feat(server): do not use device_map auto on single GPU (#362)
feat(server): support trust_remote_code (#363)
feat(router): log input/ouput at debug level (#364)
@njhill FYI
v0.7.0 (#353)
feat: decrease IPC proto size (#367)
Closes #307 #308
feat(benchmarker): add summary tables (#368)
feat(server): support vectorized warpers in flash causal lm (#317)
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
Fix issue when load AutoModelForSeq2SeqLM model (#370)
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(server): fix quantization
feat(server): support RefinedWeb models (#379)
v0.8.0
increase health checks
feat(server): add retry on download (#384)
fix(server): fix bnb quantization for CausalLM models (#385)
v0.8.1
fix(server): fix has_position_ids (#395)
Fix #389
feat(server): remove trust_remote_code requirement for falcon models (#396)
feat(server): load santacoder/starcoder models with safetensors (#393)
Fix #366
v0.8.2
feat(sagemaker): add trust remote code to entrypoint (#394)
feat(launcher): parse oom signal (#404)
feat(server): only compute prefill logprobs when asked (#406)
Close #288
feat(server): batch tokenization for flash causal lm (#411)
chore: update openapi schema
feat(server): Rework model loading (#344)
Reworked the loading logic. Idea is to use cleaner loading code:
- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.
New code layout:
- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(server): optimize dist ops (#434)
docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)
It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.
fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)
This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`
<!-- Remove if not applicable -->
Fixes #422
feat(server): pre-allocate past key values for flash causal LM (#412)
feat(router): add ngrok integration (#453)
feat(server): improve flash attention import errors (#465)
@lewtun, is this enough?
Closes #458
Closes #456
fix(server): fix warpers on CPU (#472)
Closes #471
fix(server): Fixing T5 in case the names are mixed up. (#475)
feat(server): Update convert logic. (#483)
Should be more robust to shared tensors (ok when using
`from_pretrained). But forcing us to add new checks in our loading
code (since the chosen key to keep might be different from
`transformers`).
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
feat(server): Adding new ignore_rule for conversion. (#485)
fix(router): add timeout on flume sends (#488)
feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)
Let's start discussing implementation.
- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).
Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.
My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): Do not init process group if already initialized (#388)
feat(router): add header option to disable buffering for the generate_stream response (#498)
generate_stream endpoint response stream.
Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.
Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.
feat(server): add paged attention to flash models (#516)
Closes #478
feat(router): arg validation (#519)
feat: Add the option to force another dtype than `f16`. (#513)
fix(launcher): fix issue where launcher does not properly report shard failures (#522)
v0.9.0 (#525)
feat(server): Add Non flash MPT. (#514)
This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.
Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290
fix: Update server/Makefile to include Makefile-vllm (#520)
For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)
fix(server): Handle loading from local files for MPT (#534)
This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.
fix(server): avoid errors for very small top_p values (#544)
See https://github.com/huggingface/transformers/pull/24111
I didn't add validation to the `__init__` method since it's not done for
other values/warpers.
feat(server): use latest flash attention commit (#543)
@njhill FYI
feat(router): add argument for hostname in router (#545) (#550)
In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with
```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health' # failed before this commit
```
Trigger CI
---------
Co-authored-by: Phil Chen <philchen2000@gmail.com>
fix(server): decrease memory fragmentation (#557)
v0.9.1 (#558)
fix(server): harden the weights choice to save on disk. (#561)
- Look at `transformers` base class to check for
`_key_to_ignore_on_load_missing` or `_tied_weights` which are the
standard attributes to select the keys to NOT save on disk (since they
are ignored)
- Modified safetensors code (to be reflected in safetensors even if it's
an internal function).
- Will not work for trust_remote_code=True repos (like santacoder).
Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593
feat: better errors for warmup and TP (#575)
Close #571
fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)
Fixes #555
feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)
Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.
Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.
This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.
Fixes #500
chore: migrate ci region for more availability. (#581)
fix(server): T5 weights names. (#582)
Fixes #541
fix(server): Adding logger import to t5_modeling.py (#585)
Logger is referenced during the apex importing but is not imported,
causing a NameError
fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)
This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.
Thanks @Narsil for the original fix.
feat(server): Implements sharding for non divisible `vocab_size`. (#583)
- The code is relatively easy (just disable the checks on Embedding and
Head)
This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.
feat(server): empty cache on errors
GPTQ Env vars: catch correct type of error (#596)
When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.
@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.
feat(launcher): add arg validation and drop subprocess (#595)
feat(router): explicit warning if revision is not set (#608)
docs: README: Add logo + baseline (#611)

fix(server): blacklist local files (#609)
Close #589 #602
v0.9.2 (#616)
fix(server): empty_cache when stopped
fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)
fea(launcher): debug logs (#623)
feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Reworking the quantization script so it's still universal (not llama
specific)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Still need to investigate the potential differences in quantization
results.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(server): flash attention v2 (#624)
feat(server): add support for llamav2 (#633)
v0.9.3 (#634)
fix(server): fix llamav2 config (#635)
feat(server): auto max_batch_total_tokens for flash att models (#630)
feat(router): ngrok edge (#642)
docs: Update README.md (#639)
docs: Update README.md (#643)
Add trust_remote_code to quantize script (#647)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.
With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
-->
fix(server): llama v2 GPTQ (#648)
As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5
Test it:
```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
-H 'Content-Type: application/json'
```
fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)
fix(server): use mem_get_info to get kv cache size (#664)
Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636
feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)
Just trying to get the integration tests to pass.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>
Directly load GPTBigCode to specified device (#618)
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@OlivierDehaene OR @Narsil
feat(server): add local prom and health routes if running w/ ngrok
feat: add cuda memory fraction (#659)
Close #673
fix(server): fix exllama buffers (#689)
Close #683
feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)
- Current PR is not great because we're side stepping the
`Weights.__init__` but Weights shouldn't requires anything related
to the config or the model_id as it aims to be a simple Wrapper
over multi file loading.
- Ideal solution would be to use something like Rust enum
```
enum Quantize{
Bitandbytes(Bitsandbytes),
GPTQ(bits: usize, groupsize: usize)
```
And passing that around during load. Unfortunately we don't
have access to this, so for now, side-stepping seems easier.
- Re-enabling groupsize<0 with exllama (confirmed it works.)
Helps #601
In next steps we should make sure our quantization script uses that
format and make it standard.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(README): update readme
fix(server): fix quantization python requirements (#708)
fix(server): fix missing datasets in quantize
feat(server): support new falcon config (#712)
v0.9.4 (#713)
Add section about TGI on other AI hardware accelerators in README (#715)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
As per title.
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs: Add hardware section to TOC in README (#721)
feat(server): update vllm version (#723)
chore: update license to HFOIL (#725)
v1.0.0 (#727)
Local gptq support. (#738)
Redoes #719
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Fix typing in `Model.generate_token` (#733)
This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.
All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591
I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
CC @OlivierDehaene
Adding Rope scaling. (#741)
- Adds Rope NTK scaling.
Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653
- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).
Fixes #512
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore: fix typo in mpt_modeling.py (#737)
Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
implemetation -> implementation
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure t…
Init
fix: cleanup
Add load testing
Refactored gRPC interface
Added validation logic
ValidationError was not correctly handled
Use axum
feat: Docker image
feat: Add AML deployment
Update aml deployment
feat: Improve error handling
feat: Add arguments to CLI
v0.1.0
fix(validation): Fix error messages
feat(router): Add max_waiting_tokens
Create LICENSE (#2)
feat(server): Use safetensors
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(client): Simplify sharded logic
feat(server): Support bitsandbytes
feat(server): Support all AutoModelForCausalLM on a best effort basis
feat: Use json formatter by default in docker image
fix(models): Revert buggy support for AutoModel
feat(server): Support generic AutoModelForCausalLM
feat(server): Support AutoModelForSeq2SeqLM
feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard
feat(server): Improved doc
fix(server): Fix Transformers fork version
feat(server): Clarify CausalLMBatch concatenate method
feat(rust): Update to 1.65
fix(router): Fix HTTP status codes
fix(readme): Typo
fix(router): Handle tokenizer errors
feat(server): Support Galactica (#4)
fix(batching): Avoid theoretical hang in batcher loop (#5)
- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute
Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>
feat(server): Add model tests (#6)
fix(server): Only pad to multiple of 8 on GPUs
feat: Support stop sequences (#7)
feat: Return logprobs (#8)
feat(launcher): Add integration tests (#9)
fix(server): Fix stop sequences (#11)
fix(server): Check for device type correctly when determining initial padding (#16)
AFAIK there is no torch device type called "gpu".
fix(router): Include special tokens when tokenizing (#14)
There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.
This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.
This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.
As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
feat(router): Add const parameters to validation logic (#15)
I noticed some opportunity to collapse some of the logic, in case you
are interested.
fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)
Fixes #12 in the easiest way I could think of.
feat(launcher): Log server stdout (#19)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): Minor refactorization using new_zeros (#24)
- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher
fix(router): Obey max batch size (#23)
feat(server): Support SantaCoder (#26)
fix(server): Fix position ids (#28)
feat(docker): Make the image compatible with api-inference (#29)
fix(docker): fix api-inference deployment (#30)
fix(router): fix api-inference deployment (#31)
fix(dockerfile): fix docker build (#32)
feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)
feat(router): Remove second lock from batcher hot path (#27)
@njhill
feat: Support sampling seeding (#37)
Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
feat: Add token streaming using ServerSideEvents support (#36)
Add token streaming using ServerSideEvents (SSE).
The signature of the SSE events is:
```rust
struct Details {
finish_reason: String,
generated_tokens: u32,
seed: Option<u64>,
}
struct StreamResponse {
token: Token,
generated_text: Option<String>,
details: Option<Details>,
}
struct ErrorResponse {
error: String,
}
```
Revert "feat: Add token streaming using ServerSideEvents support" (#40)
Reverts huggingface/text-generation-inference#36
fix(server): fix seeding on gpu (#42)
fix(server): fix seeding with multiple shards (#44)
feat: Add token streaming using ServerSideEvents support (#41)
fix(server): fix quantization for sharded models (#45)
feat(server): Support GPT-Neox (#39)
feat(ci): Docker build and push (#46)
feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)
feat(server): support repetition penalty (#47)
feat(server): allow the server to use a local weight cache (#49)
fix(server): allow greedy repetition penalty (#51)
feat(router): use background task to manage request queue (#52)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
breaking(router): modify /generate API to only return generated text (#50)
@njhill, @yk FYI
generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.
We also remove the unused Vec.
feat(router): refactor API and add openAPI schemas (#53)
feat(docs): Clarify installation steps (#54)
Adds some bits for first-time users (like me 😄 )
feat(ci): push to AML registry (#56)
fix(server): better handling of inference mode (#57)
V0.2.1 (#58)
feat(server): support t5 (#59)
fix(docker): increase shm size (#60)
fixed SSE naming (#61)
https://en.wikipedia.org/wiki/Server-sent_events
feat: add distributed tracing (#62)
feat: add safetensors conversion (#63)
feat(server): improve download logging (#66)
feat(launcher): add disable_custom_kernels arg (#67)
feat(router): add max_total_tokens and empty_input validation (#68)
closes #65
fix(launcher): copy current env vars to subprocesses (#70)
closes #69
feat(router): add prometheus metrics scrape endpoint (#71)
v0.3.0 (#72)
feat(router): add cors allow origin options (#73)
feat(server): enable hf-transfer (#76)
fix(server): remove position_ids from galactica forward (#82)
closes #80
feat(server): pre-allocate max attention mask (#75)
v0.3.1 (#84)
feat(server): add special token bool (#85)
fix(docs): fix openapi schema (#86)
fix(server): fix token_is_special (#87)
feat(router): add legacy route for api-inference support (#88)
feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)
feat(router): add api-inference headers (#91)
feat(server): add logits watermark (#90)
feat(server): update to hf_transfer==0.1.2 (#93)
feat(ci): improve CI speed (#94)
fix(launcher): add router parameters to launcher (#95)
feat(server): fix transformers commit (#96)
v0.3.2 (#97)
fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)
feat: allow local models (#101)
closes #99
feat: add supported models (#102)
feat(clients): Python client (#103)
fix(server): fix galactica batch (#106)
closes #105
feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)
feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)
fix(python-client): stream not set on the sync client (#109)
fix(server): fix index out of range for watermarking (#110)
feat: support typical sampling (#114)
closes #112
fix(server): do not warp prefill logits (#116)
feat(router): support left truncation (#115)
closes #111
feat(router): add best_of parameter (#117)
feat(python-client): add new parameters (#118)
v0.4.0 (#119)
feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)
…ed models
fix(server): revert gpt-neox optims (#123)
fix(server): add position ids to neox (#126)
fix(server): use server tokenizer as gt (#128)
fix(python-client): relax dependencies (#129)
feat(python-client): add cookies to Client constructors and requests (#132)
I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.
Note: I couldn't get the client tests to pass - do you need to have an
HF token?
```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```
feat(ci): add ci paths (#134)
feat: Add note about NVIDIA drivers (#64)
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
feat(python-client): release v0.4.0 (#135)
feat(python-client): add CI (#136)
feat(server): flash neoX (#133)
fix(server): fix flash-neox scores warping (#137)
feat(server): cleanup flash neox loading (#139)
v0.4.1 (#140)
fix(server): Avoid using try/except to determine kind of AutoModel (#142)
feat(server): Add mypy-protobuf (#141)
Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.
feat(server): clear cache on error (#143)
feat(server): reduce mlp and attn in one op for flash neox (#145)
feat: aws sagemaker compatible image (#147)
The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...
---------
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
fix(ci): fix sagemaker action (#148)
feat(benchmark): tui based benchmarking tool (#149)
fix(server): fix flash neox rotary embeddings (#150)
v0.4.2 (#151)
v0.4.3 (#152)
feat(server): flash santacoder (#153)
docs(readme): provide link Logits Warper README (#154)
fix(server): fix escape characters in stop sequence (#155)
feat(docker): improve flash_attention caching (#160)
feat(launcher): allow disabling hf_transfer (#161)
fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)
fix(router): use buckets for metrics histograms (#163)
feat(router): make router input validation optional (#164)
feat(server): add flash attention llama (#144)
feat(server): support OPT models (#55)
OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.
v0.5.0 (#168)
feat(server): optimize decode for sane tokenizers (#170)
feat(server): support sharded santacoder (#167)
fix(launcher): revert change on shard errors (#173)
fix(ci): fix CVE in github-slug-action (#174)
feat(ci): add image signing with cosign (#175)
feat(ci): add Trivy and scan docker image (#178)
feat(ci): use large runners (#179)
feat(ci): faster scanning (#180)
fix(ci): fix ci permissions (#181)
fea(dockerfile): better layer caching (#159)
fix(ci): fix cosign error (#183)
fix(docker): fix docker image (#184)
fix(docker): fix image (#185)
fix(docker): revert dockerfile changes (#186)
fix(docker): fix docker image dependencies (#187)
fix(router): fix truncation (#190)
closes #189
feat(python-client): get list of currently deployed tgi models using the inference API (#191)
feat(router): add info route (#196)
close #125
feat(server): support quantization for flash models (#200)
closes #197
feat(server): check cuda capability when importing flash models (#201)
close #198
fix(server): fix hf_transfer issue with private repos (#203)
fix(docker): remove unused dependencies (#205)
fix(router): add auth token to get model info (#207)
feat(router): add git sha to info route (#208)
feat(router): drop requests when client closes the channel (#202)
fix(ci): fix sha in docker image (#212)
feat(server): flash attention past key value optimizations (#213)
feat(router): add device and dtype info (#215)
fix(server): fix past key values logic (#216)
@njhill fyi
fix(server): cleanup new flash past_key_values logic (#217)
fix(server): fix flash causal (#218)
fix(server): fix flash causal (#219)
fix(server): fix flash batch filtering (#220)
misc: update to rust 1.69 (#221)
v0.6.0 (#222)
feat(server): reduce memory requirement (#214)
chore(server): update huggingface-hub (#227)
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
feat(router): add endpoint info to /info route (#228)
chore(server): update safetensors version (#235)
fix(python-client): add auth headers to is supported requests (#234)
Starting some routing tests. (#233)
fix(benchmarking): fix benchmarking tool
chore(launcher): refactor logic (#242)
Hopefully it's cleaner
feat(router): add tests to validation (#237)
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)
Introduced in #214
Fixes #249
fix(server): Small tidy of code from recent changes (#251)
remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()
chore(server): update transformers (#250)
feat(server): add watermarking tests (#248)
feat(docker): add nvidia env vars (#255)
doc(launcher): add more docs to the `launcher` itself and link in the README (#257)
feat(benchmark): add support for private tokenizers (#262)
Adding docs on how dynamic batching works. (#258)
This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.
Maybe some drawings could help too but I kept it to text for now.
chore(github): add templates (#264)
fix(server): fix typo in tokenizers decode (#269)
closes #268
feat(server): support hf endpoint weight layout (#266)
fix(launcher): pass weights cache override to the download process (#274)
closes #273
fix(launcher): handle hub branches (#278)
fix(server): Removes the parallelism in file convertion (during download) (#275)
feat(launcher): Improve error message when download process fails. (#276)
fix(server): fix convert (#284)
chore: add `flash-attention` to docker ignore (#287)
included when building docker locally.
(Where the local dirs might have the flash-attention folder.)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
fea(server): decrease convert RAM requirements (#286)
fix(dockerfile): fix nvidia env vars (#297)
Fixes #291
feat(router): Adding response schema for compat_generate (#292)
feat(docker): add benchmarking tool to docker image (#298)
fix(docker): fix docker build (#299)
feat(server): optim flash causal lm decode_token (#285)
fix(docker): fix nvidia env vars (#305)
fix(docker): remove nvidia require cuda env (#310)
feat(server): shard token decode (#303)
feat(server): use float16 (#304)
fix(docker): remove CUDA_VERSION
feat(server): use cuda graph in logits warping (#302)
fix(server): fix multinomial implem in Sampling
feat(server): GPTQ quantization (step1) (#277)
Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore(docker): use nvidia base image (#318)
fix(docker): remove quantize default
fix(docker): use ubuntu20.04
Hotfixes for santacoder/bigcode. (#294)
Hotfixes:
- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Lifting check_unitialized. (#325)
Lifting check_unitialized.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Removing dead variables. (#327)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(ci): custom gpu runners (#328)
Single place for TP layers + Dropout Layer Norm + FastLinear (#329)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat: add snapshot testing (#282)
feat(integration-tests): improve comparison and health checks (#336)
fix(server): fix decode token (#334)
Fixes #333
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
fix: set MODEL_ID in sagemaker-entrypoint script (#343)
feat(server): Support BLOOMChat-176B (#348) (#351)
@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): fix init for flash causal lm (#352)
Fixes #347
fix(server): t5 cannot run in f16 (#356)
Fix #349
fix(ci): fix security group (#359)
Switch security group used for ci
(open outbound rules)
Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>
feat: add nightly load testing (#358)
chore(sever): update requirements (#357)
Fixes #338
feat(server): support fp16 for t5 (#360)
Fixes #349
feat(server): do not use device_map auto on single GPU (#362)
feat(server): support trust_remote_code (#363)
feat(router): log input/ouput at debug level (#364)
@njhill FYI
v0.7.0 (#353)
feat: decrease IPC proto size (#367)
Closes #307 #308
feat(benchmarker): add summary tables (#368)
feat(server): support vectorized warpers in flash causal lm (#317)
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
Fix issue when load AutoModelForSeq2SeqLM model (#370)
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(server): fix quantization
feat(server): support RefinedWeb models (#379)
v0.8.0
increase health checks
feat(server): add retry on download (#384)
fix(server): fix bnb quantization for CausalLM models (#385)
v0.8.1
fix(server): fix has_position_ids (#395)
Fix #389
feat(server): remove trust_remote_code requirement for falcon models (#396)
feat(server): load santacoder/starcoder models with safetensors (#393)
Fix #366
v0.8.2
feat(sagemaker): add trust remote code to entrypoint (#394)
feat(launcher): parse oom signal (#404)
feat(server): only compute prefill logprobs when asked (#406)
Close #288
feat(server): batch tokenization for flash causal lm (#411)
chore: update openapi schema
feat(server): Rework model loading (#344)
Reworked the loading logic. Idea is to use cleaner loading code:
- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.
New code layout:
- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(server): optimize dist ops (#434)
docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)
It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.
fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)
This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`
<!-- Remove if not applicable -->
Fixes #422
feat(server): pre-allocate past key values for flash causal LM (#412)
feat(router): add ngrok integration (#453)
feat(server): improve flash attention import errors (#465)
@lewtun, is this enough?
Closes #458
Closes #456
fix(server): fix warpers on CPU (#472)
Closes #471
fix(server): Fixing T5 in case the names are mixed up. (#475)
feat(server): Update convert logic. (#483)
Should be more robust to shared tensors (ok when using
`from_pretrained). But forcing us to add new checks in our loading
code (since the chosen key to keep might be different from
`transformers`).
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
feat(server): Adding new ignore_rule for conversion. (#485)
fix(router): add timeout on flume sends (#488)
feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)
Let's start discussing implementation.
- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).
Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.
My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): Do not init process group if already initialized (#388)
feat(router): add header option to disable buffering for the generate_stream response (#498)
generate_stream endpoint response stream.
Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.
Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.
feat(server): add paged attention to flash models (#516)
Closes #478
feat(router): arg validation (#519)
feat: Add the option to force another dtype than `f16`. (#513)
fix(launcher): fix issue where launcher does not properly report shard failures (#522)
v0.9.0 (#525)
feat(server): Add Non flash MPT. (#514)
This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.
Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290
fix: Update server/Makefile to include Makefile-vllm (#520)
For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)
fix(server): Handle loading from local files for MPT (#534)
This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.
fix(server): avoid errors for very small top_p values (#544)
See https://github.com/huggingface/transformers/pull/24111
I didn't add validation to the `__init__` method since it's not done for
other values/warpers.
feat(server): use latest flash attention commit (#543)
@njhill FYI
feat(router): add argument for hostname in router (#545) (#550)
In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with
```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health' # failed before this commit
```
Trigger CI
---------
Co-authored-by: Phil Chen <philchen2000@gmail.com>
fix(server): decrease memory fragmentation (#557)
v0.9.1 (#558)
fix(server): harden the weights choice to save on disk. (#561)
- Look at `transformers` base class to check for
`_key_to_ignore_on_load_missing` or `_tied_weights` which are the
standard attributes to select the keys to NOT save on disk (since they
are ignored)
- Modified safetensors code (to be reflected in safetensors even if it's
an internal function).
- Will not work for trust_remote_code=True repos (like santacoder).
Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593
feat: better errors for warmup and TP (#575)
Close #571
fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)
Fixes #555
feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)
Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.
Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.
This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.
Fixes #500
chore: migrate ci region for more availability. (#581)
fix(server): T5 weights names. (#582)
Fixes #541
fix(server): Adding logger import to t5_modeling.py (#585)
Logger is referenced during the apex importing but is not imported,
causing a NameError
fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)
This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.
Thanks @Narsil for the original fix.
feat(server): Implements sharding for non divisible `vocab_size`. (#583)
- The code is relatively easy (just disable the checks on Embedding and
Head)
This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.
feat(server): empty cache on errors
GPTQ Env vars: catch correct type of error (#596)
When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.
@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.
feat(launcher): add arg validation and drop subprocess (#595)
feat(router): explicit warning if revision is not set (#608)
docs: README: Add logo + baseline (#611)

fix(server): blacklist local files (#609)
Close #589 #602
v0.9.2 (#616)
fix(server): empty_cache when stopped
fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)
fea(launcher): debug logs (#623)
feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Reworking the quantization script so it's still universal (not llama
specific)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Still need to investigate the potential differences in quantization
results.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(server): flash attention v2 (#624)
feat(server): add support for llamav2 (#633)
v0.9.3 (#634)
fix(server): fix llamav2 config (#635)
feat(server): auto max_batch_total_tokens for flash att models (#630)
feat(router): ngrok edge (#642)
docs: Update README.md (#639)
docs: Update README.md (#643)
Add trust_remote_code to quantize script (#647)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.
With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
-->
fix(server): llama v2 GPTQ (#648)
As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5
Test it:
```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
-H 'Content-Type: application/json'
```
fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)
fix(server): use mem_get_info to get kv cache size (#664)
Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636
feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)
Just trying to get the integration tests to pass.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>
Directly load GPTBigCode to specified device (#618)
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@OlivierDehaene OR @Narsil
feat(server): add local prom and health routes if running w/ ngrok
feat: add cuda memory fraction (#659)
Close #673
fix(server): fix exllama buffers (#689)
Close #683
feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)
- Current PR is not great because we're side stepping the
`Weights.__init__` but Weights shouldn't requires anything related
to the config or the model_id as it aims to be a simple Wrapper
over multi file loading.
- Ideal solution would be to use something like Rust enum
```
enum Quantize{
Bitandbytes(Bitsandbytes),
GPTQ(bits: usize, groupsize: usize)
```
And passing that around during load. Unfortunately we don't
have access to this, so for now, side-stepping seems easier.
- Re-enabling groupsize<0 with exllama (confirmed it works.)
Helps #601
In next steps we should make sure our quantization script uses that
format and make it standard.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(README): update readme
fix(server): fix quantization python requirements (#708)
fix(server): fix missing datasets in quantize
feat(server): support new falcon config (#712)
v0.9.4 (#713)
Add section about TGI on other AI hardware accelerators in README (#715)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
As per title.
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs: Add hardware section to TOC in README (#721)
feat(server): update vllm version (#723)
chore: update license to HFOIL (#725)
v1.0.0 (#727)
Local gptq support. (#738)
Redoes #719
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Fix typing in `Model.generate_token` (#733)
This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.
All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591
I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
CC @OlivierDehaene
Adding Rope scaling. (#741)
- Adds Rope NTK scaling.
Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653
- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).
Fixes #512
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore: fix typo in mpt_modeling.py (#737)
Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
implemetation -> implementation
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure t…
Init
fix: cleanup
Add load testing
Refactored gRPC interface
Added validation logic
ValidationError was not correctly handled
Use axum
feat: Docker image
feat: Add AML deployment
Update aml deployment
feat: Improve error handling
feat: Add arguments to CLI
v0.1.0
fix(validation): Fix error messages
feat(router): Add max_waiting_tokens
Create LICENSE (#2)
feat(server): Use safetensors
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(client): Simplify sharded logic
feat(server): Support bitsandbytes
feat(server): Support all AutoModelForCausalLM on a best effort basis
feat: Use json formatter by default in docker image
fix(models): Revert buggy support for AutoModel
feat(server): Support generic AutoModelForCausalLM
feat(server): Support AutoModelForSeq2SeqLM
feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard
feat(server): Improved doc
fix(server): Fix Transformers fork version
feat(server): Clarify CausalLMBatch concatenate method
feat(rust): Update to 1.65
fix(router): Fix HTTP status codes
fix(readme): Typo
fix(router): Handle tokenizer errors
feat(server): Support Galactica (#4)
fix(batching): Avoid theoretical hang in batcher loop (#5)
- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute
Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>
feat(server): Add model tests (#6)
fix(server): Only pad to multiple of 8 on GPUs
feat: Support stop sequences (#7)
feat: Return logprobs (#8)
feat(launcher): Add integration tests (#9)
fix(server): Fix stop sequences (#11)
fix(server): Check for device type correctly when determining initial padding (#16)
AFAIK there is no torch device type called "gpu".
fix(router): Include special tokens when tokenizing (#14)
There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.
This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.
This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.
As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
feat(router): Add const parameters to validation logic (#15)
I noticed some opportunity to collapse some of the logic, in case you
are interested.
fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)
Fixes #12 in the easiest way I could think of.
feat(launcher): Log server stdout (#19)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): Minor refactorization using new_zeros (#24)
- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher
fix(router): Obey max batch size (#23)
feat(server): Support SantaCoder (#26)
fix(server): Fix position ids (#28)
feat(docker): Make the image compatible with api-inference (#29)
fix(docker): fix api-inference deployment (#30)
fix(router): fix api-inference deployment (#31)
fix(dockerfile): fix docker build (#32)
feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)
feat(router): Remove second lock from batcher hot path (#27)
@njhill
feat: Support sampling seeding (#37)
Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
feat: Add token streaming using ServerSideEvents support (#36)
Add token streaming using ServerSideEvents (SSE).
The signature of the SSE events is:
```rust
struct Details {
finish_reason: String,
generated_tokens: u32,
seed: Option<u64>,
}
struct StreamResponse {
token: Token,
generated_text: Option<String>,
details: Option<Details>,
}
struct ErrorResponse {
error: String,
}
```
Revert "feat: Add token streaming using ServerSideEvents support" (#40)
Reverts huggingface/text-generation-inference#36
fix(server): fix seeding on gpu (#42)
fix(server): fix seeding with multiple shards (#44)
feat: Add token streaming using ServerSideEvents support (#41)
fix(server): fix quantization for sharded models (#45)
feat(server): Support GPT-Neox (#39)
feat(ci): Docker build and push (#46)
feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)
feat(server): support repetition penalty (#47)
feat(server): allow the server to use a local weight cache (#49)
fix(server): allow greedy repetition penalty (#51)
feat(router): use background task to manage request queue (#52)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
breaking(router): modify /generate API to only return generated text (#50)
@njhill, @yk FYI
generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.
We also remove the unused Vec.
feat(router): refactor API and add openAPI schemas (#53)
feat(docs): Clarify installation steps (#54)
Adds some bits for first-time users (like me 😄 )
feat(ci): push to AML registry (#56)
fix(server): better handling of inference mode (#57)
V0.2.1 (#58)
feat(server): support t5 (#59)
fix(docker): increase shm size (#60)
fixed SSE naming (#61)
https://en.wikipedia.org/wiki/Server-sent_events
feat: add distributed tracing (#62)
feat: add safetensors conversion (#63)
feat(server): improve download logging (#66)
feat(launcher): add disable_custom_kernels arg (#67)
feat(router): add max_total_tokens and empty_input validation (#68)
closes #65
fix(launcher): copy current env vars to subprocesses (#70)
closes #69
feat(router): add prometheus metrics scrape endpoint (#71)
v0.3.0 (#72)
feat(router): add cors allow origin options (#73)
feat(server): enable hf-transfer (#76)
fix(server): remove position_ids from galactica forward (#82)
closes #80
feat(server): pre-allocate max attention mask (#75)
v0.3.1 (#84)
feat(server): add special token bool (#85)
fix(docs): fix openapi schema (#86)
fix(server): fix token_is_special (#87)
feat(router): add legacy route for api-inference support (#88)
feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)
feat(router): add api-inference headers (#91)
feat(server): add logits watermark (#90)
feat(server): update to hf_transfer==0.1.2 (#93)
feat(ci): improve CI speed (#94)
fix(launcher): add router parameters to launcher (#95)
feat(server): fix transformers commit (#96)
v0.3.2 (#97)
fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)
feat: allow local models (#101)
closes #99
feat: add supported models (#102)
feat(clients): Python client (#103)
fix(server): fix galactica batch (#106)
closes #105
feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)
feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)
fix(python-client): stream not set on the sync client (#109)
fix(server): fix index out of range for watermarking (#110)
feat: support typical sampling (#114)
closes #112
fix(server): do not warp prefill logits (#116)
feat(router): support left truncation (#115)
closes #111
feat(router): add best_of parameter (#117)
feat(python-client): add new parameters (#118)
v0.4.0 (#119)
feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)
…ed models
fix(server): revert gpt-neox optims (#123)
fix(server): add position ids to neox (#126)
fix(server): use server tokenizer as gt (#128)
fix(python-client): relax dependencies (#129)
feat(python-client): add cookies to Client constructors and requests (#132)
I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.
Note: I couldn't get the client tests to pass - do you need to have an
HF token?
```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```
feat(ci): add ci paths (#134)
feat: Add note about NVIDIA drivers (#64)
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
feat(python-client): release v0.4.0 (#135)
feat(python-client): add CI (#136)
feat(server): flash neoX (#133)
fix(server): fix flash-neox scores warping (#137)
feat(server): cleanup flash neox loading (#139)
v0.4.1 (#140)
fix(server): Avoid using try/except to determine kind of AutoModel (#142)
feat(server): Add mypy-protobuf (#141)
Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.
feat(server): clear cache on error (#143)
feat(server): reduce mlp and attn in one op for flash neox (#145)
feat: aws sagemaker compatible image (#147)
The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...
---------
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
fix(ci): fix sagemaker action (#148)
feat(benchmark): tui based benchmarking tool (#149)
fix(server): fix flash neox rotary embeddings (#150)
v0.4.2 (#151)
v0.4.3 (#152)
feat(server): flash santacoder (#153)
docs(readme): provide link Logits Warper README (#154)
fix(server): fix escape characters in stop sequence (#155)
feat(docker): improve flash_attention caching (#160)
feat(launcher): allow disabling hf_transfer (#161)
fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)
fix(router): use buckets for metrics histograms (#163)
feat(router): make router input validation optional (#164)
feat(server): add flash attention llama (#144)
feat(server): support OPT models (#55)
OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.
v0.5.0 (#168)
feat(server): optimize decode for sane tokenizers (#170)
feat(server): support sharded santacoder (#167)
fix(launcher): revert change on shard errors (#173)
fix(ci): fix CVE in github-slug-action (#174)
feat(ci): add image signing with cosign (#175)
feat(ci): add Trivy and scan docker image (#178)
feat(ci): use large runners (#179)
feat(ci): faster scanning (#180)
fix(ci): fix ci permissions (#181)
fea(dockerfile): better layer caching (#159)
fix(ci): fix cosign error (#183)
fix(docker): fix docker image (#184)
fix(docker): fix image (#185)
fix(docker): revert dockerfile changes (#186)
fix(docker): fix docker image dependencies (#187)
fix(router): fix truncation (#190)
closes #189
feat(python-client): get list of currently deployed tgi models using the inference API (#191)
feat(router): add info route (#196)
close #125
feat(server): support quantization for flash models (#200)
closes #197
feat(server): check cuda capability when importing flash models (#201)
close #198
fix(server): fix hf_transfer issue with private repos (#203)
fix(docker): remove unused dependencies (#205)
fix(router): add auth token to get model info (#207)
feat(router): add git sha to info route (#208)
feat(router): drop requests when client closes the channel (#202)
fix(ci): fix sha in docker image (#212)
feat(server): flash attention past key value optimizations (#213)
feat(router): add device and dtype info (#215)
fix(server): fix past key values logic (#216)
@njhill fyi
fix(server): cleanup new flash past_key_values logic (#217)
fix(server): fix flash causal (#218)
fix(server): fix flash causal (#219)
fix(server): fix flash batch filtering (#220)
misc: update to rust 1.69 (#221)
v0.6.0 (#222)
feat(server): reduce memory requirement (#214)
chore(server): update huggingface-hub (#227)
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
feat(router): add endpoint info to /info route (#228)
chore(server): update safetensors version (#235)
fix(python-client): add auth headers to is supported requests (#234)
Starting some routing tests. (#233)
fix(benchmarking): fix benchmarking tool
chore(launcher): refactor logic (#242)
Hopefully it's cleaner
feat(router): add tests to validation (#237)
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)
Introduced in #214
Fixes #249
fix(server): Small tidy of code from recent changes (#251)
remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()
chore(server): update transformers (#250)
feat(server): add watermarking tests (#248)
feat(docker): add nvidia env vars (#255)
doc(launcher): add more docs to the `launcher` itself and link in the README (#257)
feat(benchmark): add support for private tokenizers (#262)
Adding docs on how dynamic batching works. (#258)
This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.
Maybe some drawings could help too but I kept it to text for now.
chore(github): add templates (#264)
fix(server): fix typo in tokenizers decode (#269)
closes #268
feat(server): support hf endpoint weight layout (#266)
fix(launcher): pass weights cache override to the download process (#274)
closes #273
fix(launcher): handle hub branches (#278)
fix(server): Removes the parallelism in file convertion (during download) (#275)
feat(launcher): Improve error message when download process fails. (#276)
fix(server): fix convert (#284)
chore: add `flash-attention` to docker ignore (#287)
included when building docker locally.
(Where the local dirs might have the flash-attention folder.)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
fea(server): decrease convert RAM requirements (#286)
fix(dockerfile): fix nvidia env vars (#297)
Fixes #291
feat(router): Adding response schema for compat_generate (#292)
feat(docker): add benchmarking tool to docker image (#298)
fix(docker): fix docker build (#299)
feat(server): optim flash causal lm decode_token (#285)
fix(docker): fix nvidia env vars (#305)
fix(docker): remove nvidia require cuda env (#310)
feat(server): shard token decode (#303)
feat(server): use float16 (#304)
fix(docker): remove CUDA_VERSION
feat(server): use cuda graph in logits warping (#302)
fix(server): fix multinomial implem in Sampling
feat(server): GPTQ quantization (step1) (#277)
Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore(docker): use nvidia base image (#318)
fix(docker): remove quantize default
fix(docker): use ubuntu20.04
Hotfixes for santacoder/bigcode. (#294)
Hotfixes:
- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Lifting check_unitialized. (#325)
Lifting check_unitialized.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Removing dead variables. (#327)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(ci): custom gpu runners (#328)
Single place for TP layers + Dropout Layer Norm + FastLinear (#329)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat: add snapshot testing (#282)
feat(integration-tests): improve comparison and health checks (#336)
fix(server): fix decode token (#334)
Fixes #333
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
fix: set MODEL_ID in sagemaker-entrypoint script (#343)
feat(server): Support BLOOMChat-176B (#348) (#351)
@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): fix init for flash causal lm (#352)
Fixes #347
fix(server): t5 cannot run in f16 (#356)
Fix #349
fix(ci): fix security group (#359)
Switch security group used for ci
(open outbound rules)
Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>
feat: add nightly load testing (#358)
chore(sever): update requirements (#357)
Fixes #338
feat(server): support fp16 for t5 (#360)
Fixes #349
feat(server): do not use device_map auto on single GPU (#362)
feat(server): support trust_remote_code (#363)
feat(router): log input/ouput at debug level (#364)
@njhill FYI
v0.7.0 (#353)
feat: decrease IPC proto size (#367)
Closes #307 #308
feat(benchmarker): add summary tables (#368)
feat(server): support vectorized warpers in flash causal lm (#317)
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
Fix issue when load AutoModelForSeq2SeqLM model (#370)
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(server): fix quantization
feat(server): support RefinedWeb models (#379)
v0.8.0
increase health checks
feat(server): add retry on download (#384)
fix(server): fix bnb quantization for CausalLM models (#385)
v0.8.1
fix(server): fix has_position_ids (#395)
Fix #389
feat(server): remove trust_remote_code requirement for falcon models (#396)
feat(server): load santacoder/starcoder models with safetensors (#393)
Fix #366
v0.8.2
feat(sagemaker): add trust remote code to entrypoint (#394)
feat(launcher): parse oom signal (#404)
feat(server): only compute prefill logprobs when asked (#406)
Close #288
feat(server): batch tokenization for flash causal lm (#411)
chore: update openapi schema
feat(server): Rework model loading (#344)
Reworked the loading logic. Idea is to use cleaner loading code:
- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.
New code layout:
- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(server): optimize dist ops (#434)
docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)
It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.
fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)
This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`
<!-- Remove if not applicable -->
Fixes #422
feat(server): pre-allocate past key values for flash causal LM (#412)
feat(router): add ngrok integration (#453)
feat(server): improve flash attention import errors (#465)
@lewtun, is this enough?
Closes #458
Closes #456
fix(server): fix warpers on CPU (#472)
Closes #471
fix(server): Fixing T5 in case the names are mixed up. (#475)
feat(server): Update convert logic. (#483)
Should be more robust to shared tensors (ok when using
`from_pretrained). But forcing us to add new checks in our loading
code (since the chosen key to keep might be different from
`transformers`).
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
feat(server): Adding new ignore_rule for conversion. (#485)
fix(router): add timeout on flume sends (#488)
feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)
Let's start discussing implementation.
- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).
Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.
My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): Do not init process group if already initialized (#388)
feat(router): add header option to disable buffering for the generate_stream response (#498)
generate_stream endpoint response stream.
Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.
Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.
feat(server): add paged attention to flash models (#516)
Closes #478
feat(router): arg validation (#519)
feat: Add the option to force another dtype than `f16`. (#513)
fix(launcher): fix issue where launcher does not properly report shard failures (#522)
v0.9.0 (#525)
feat(server): Add Non flash MPT. (#514)
This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.
Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290
fix: Update server/Makefile to include Makefile-vllm (#520)
For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)
fix(server): Handle loading from local files for MPT (#534)
This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.
fix(server): avoid errors for very small top_p values (#544)
See https://github.com/huggingface/transformers/pull/24111
I didn't add validation to the `__init__` method since it's not done for
other values/warpers.
feat(server): use latest flash attention commit (#543)
@njhill FYI
feat(router): add argument for hostname in router (#545) (#550)
In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with
```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health' # failed before this commit
```
Trigger CI
---------
Co-authored-by: Phil Chen <philchen2000@gmail.com>
fix(server): decrease memory fragmentation (#557)
v0.9.1 (#558)
fix(server): harden the weights choice to save on disk. (#561)
- Look at `transformers` base class to check for
`_key_to_ignore_on_load_missing` or `_tied_weights` which are the
standard attributes to select the keys to NOT save on disk (since they
are ignored)
- Modified safetensors code (to be reflected in safetensors even if it's
an internal function).
- Will not work for trust_remote_code=True repos (like santacoder).
Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593
feat: better errors for warmup and TP (#575)
Close #571
fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)
Fixes #555
feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)
Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.
Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.
This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.
Fixes #500
chore: migrate ci region for more availability. (#581)
fix(server): T5 weights names. (#582)
Fixes #541
fix(server): Adding logger import to t5_modeling.py (#585)
Logger is referenced during the apex importing but is not imported,
causing a NameError
fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)
This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.
Thanks @Narsil for the original fix.
feat(server): Implements sharding for non divisible `vocab_size`. (#583)
- The code is relatively easy (just disable the checks on Embedding and
Head)
This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.
feat(server): empty cache on errors
GPTQ Env vars: catch correct type of error (#596)
When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.
@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.
feat(launcher): add arg validation and drop subprocess (#595)
feat(router): explicit warning if revision is not set (#608)
docs: README: Add logo + baseline (#611)

fix(server): blacklist local files (#609)
Close #589 #602
v0.9.2 (#616)
fix(server): empty_cache when stopped
fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)
fea(launcher): debug logs (#623)
feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Reworking the quantization script so it's still universal (not llama
specific)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Still need to investigate the potential differences in quantization
results.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(server): flash attention v2 (#624)
feat(server): add support for llamav2 (#633)
v0.9.3 (#634)
fix(server): fix llamav2 config (#635)
feat(server): auto max_batch_total_tokens for flash att models (#630)
feat(router): ngrok edge (#642)
docs: Update README.md (#639)
docs: Update README.md (#643)
Add trust_remote_code to quantize script (#647)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.
With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
-->
fix(server): llama v2 GPTQ (#648)
As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5
Test it:
```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
-H 'Content-Type: application/json'
```
fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)
fix(server): use mem_get_info to get kv cache size (#664)
Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636
feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)
Just trying to get the integration tests to pass.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>
Directly load GPTBigCode to specified device (#618)
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@OlivierDehaene OR @Narsil
feat(server): add local prom and health routes if running w/ ngrok
feat: add cuda memory fraction (#659)
Close #673
fix(server): fix exllama buffers (#689)
Close #683
feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)
- Current PR is not great because we're side stepping the
`Weights.__init__` but Weights shouldn't requires anything related
to the config or the model_id as it aims to be a simple Wrapper
over multi file loading.
- Ideal solution would be to use something like Rust enum
```
enum Quantize{
Bitandbytes(Bitsandbytes),
GPTQ(bits: usize, groupsize: usize)
```
And passing that around during load. Unfortunately we don't
have access to this, so for now, side-stepping seems easier.
- Re-enabling groupsize<0 with exllama (confirmed it works.)
Helps #601
In next steps we should make sure our quantization script uses that
format and make it standard.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(README): update readme
fix(server): fix quantization python requirements (#708)
fix(server): fix missing datasets in quantize
feat(server): support new falcon config (#712)
v0.9.4 (#713)
Add section about TGI on other AI hardware accelerators in README (#715)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
As per title.
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs: Add hardware section to TOC in README (#721)
feat(server): update vllm version (#723)
chore: update license to HFOIL (#725)
v1.0.0 (#727)
Local gptq support. (#738)
Redoes #719
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Fix typing in `Model.generate_token` (#733)
This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.
All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591
I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
CC @OlivierDehaene
Adding Rope scaling. (#741)
- Adds Rope NTK scaling.
Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653
- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).
Fixes #512
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore: fix typo in mpt_modeling.py (#737)
Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
implemetation -> implementation
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure t…
Init
fix: cleanup
Add load testing
Refactored gRPC interface
Added validation logic
ValidationError was not correctly handled
Use axum
feat: Docker image
feat: Add AML deployment
Update aml deployment
feat: Improve error handling
feat: Add arguments to CLI
v0.1.0
fix(validation): Fix error messages
feat(router): Add max_waiting_tokens
Create LICENSE (#2)
feat(server): Use safetensors
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(client): Simplify sharded logic
feat(server): Support bitsandbytes
feat(server): Support all AutoModelForCausalLM on a best effort basis
feat: Use json formatter by default in docker image
fix(models): Revert buggy support for AutoModel
feat(server): Support generic AutoModelForCausalLM
feat(server): Support AutoModelForSeq2SeqLM
feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard
feat(server): Improved doc
fix(server): Fix Transformers fork version
feat(server): Clarify CausalLMBatch concatenate method
feat(rust): Update to 1.65
fix(router): Fix HTTP status codes
fix(readme): Typo
fix(router): Handle tokenizer errors
feat(server): Support Galactica (#4)
fix(batching): Avoid theoretical hang in batcher loop (#5)
- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute
Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>
feat(server): Add model tests (#6)
fix(server): Only pad to multiple of 8 on GPUs
feat: Support stop sequences (#7)
feat: Return logprobs (#8)
feat(launcher): Add integration tests (#9)
fix(server): Fix stop sequences (#11)
fix(server): Check for device type correctly when determining initial padding (#16)
AFAIK there is no torch device type called "gpu".
fix(router): Include special tokens when tokenizing (#14)
There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.
This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.
This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.
As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
feat(router): Add const parameters to validation logic (#15)
I noticed some opportunity to collapse some of the logic, in case you
are interested.
fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)
Fixes #12 in the easiest way I could think of.
feat(launcher): Log server stdout (#19)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): Minor refactorization using new_zeros (#24)
- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher
fix(router): Obey max batch size (#23)
feat(server): Support SantaCoder (#26)
fix(server): Fix position ids (#28)
feat(docker): Make the image compatible with api-inference (#29)
fix(docker): fix api-inference deployment (#30)
fix(router): fix api-inference deployment (#31)
fix(dockerfile): fix docker build (#32)
feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)
feat(router): Remove second lock from batcher hot path (#27)
@njhill
feat: Support sampling seeding (#37)
Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
feat: Add token streaming using ServerSideEvents support (#36)
Add token streaming using ServerSideEvents (SSE).
The signature of the SSE events is:
```rust
struct Details {
finish_reason: String,
generated_tokens: u32,
seed: Option<u64>,
}
struct StreamResponse {
token: Token,
generated_text: Option<String>,
details: Option<Details>,
}
struct ErrorResponse {
error: String,
}
```
Revert "feat: Add token streaming using ServerSideEvents support" (#40)
Reverts huggingface/text-generation-inference#36
fix(server): fix seeding on gpu (#42)
fix(server): fix seeding with multiple shards (#44)
feat: Add token streaming using ServerSideEvents support (#41)
fix(server): fix quantization for sharded models (#45)
feat(server): Support GPT-Neox (#39)
feat(ci): Docker build and push (#46)
feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)
feat(server): support repetition penalty (#47)
feat(server): allow the server to use a local weight cache (#49)
fix(server): allow greedy repetition penalty (#51)
feat(router): use background task to manage request queue (#52)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
breaking(router): modify /generate API to only return generated text (#50)
@njhill, @yk FYI
generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.
We also remove the unused Vec.
feat(router): refactor API and add openAPI schemas (#53)
feat(docs): Clarify installation steps (#54)
Adds some bits for first-time users (like me 😄 )
feat(ci): push to AML registry (#56)
fix(server): better handling of inference mode (#57)
V0.2.1 (#58)
feat(server): support t5 (#59)
fix(docker): increase shm size (#60)
fixed SSE naming (#61)
https://en.wikipedia.org/wiki/Server-sent_events
feat: add distributed tracing (#62)
feat: add safetensors conversion (#63)
feat(server): improve download logging (#66)
feat(launcher): add disable_custom_kernels arg (#67)
feat(router): add max_total_tokens and empty_input validation (#68)
closes #65
fix(launcher): copy current env vars to subprocesses (#70)
closes #69
feat(router): add prometheus metrics scrape endpoint (#71)
v0.3.0 (#72)
feat(router): add cors allow origin options (#73)
feat(server): enable hf-transfer (#76)
fix(server): remove position_ids from galactica forward (#82)
closes #80
feat(server): pre-allocate max attention mask (#75)
v0.3.1 (#84)
feat(server): add special token bool (#85)
fix(docs): fix openapi schema (#86)
fix(server): fix token_is_special (#87)
feat(router): add legacy route for api-inference support (#88)
feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)
feat(router): add api-inference headers (#91)
feat(server): add logits watermark (#90)
feat(server): update to hf_transfer==0.1.2 (#93)
feat(ci): improve CI speed (#94)
fix(launcher): add router parameters to launcher (#95)
feat(server): fix transformers commit (#96)
v0.3.2 (#97)
fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)
feat: allow local models (#101)
closes #99
feat: add supported models (#102)
feat(clients): Python client (#103)
fix(server): fix galactica batch (#106)
closes #105
feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)
feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)
fix(python-client): stream not set on the sync client (#109)
fix(server): fix index out of range for watermarking (#110)
feat: support typical sampling (#114)
closes #112
fix(server): do not warp prefill logits (#116)
feat(router): support left truncation (#115)
closes #111
feat(router): add best_of parameter (#117)
feat(python-client): add new parameters (#118)
v0.4.0 (#119)
feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)
…ed models
fix(server): revert gpt-neox optims (#123)
fix(server): add position ids to neox (#126)
fix(server): use server tokenizer as gt (#128)
fix(python-client): relax dependencies (#129)
feat(python-client): add cookies to Client constructors and requests (#132)
I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.
Note: I couldn't get the client tests to pass - do you need to have an
HF token?
```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```
feat(ci): add ci paths (#134)
feat: Add note about NVIDIA drivers (#64)
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
feat(python-client): release v0.4.0 (#135)
feat(python-client): add CI (#136)
feat(server): flash neoX (#133)
fix(server): fix flash-neox scores warping (#137)
feat(server): cleanup flash neox loading (#139)
v0.4.1 (#140)
fix(server): Avoid using try/except to determine kind of AutoModel (#142)
feat(server): Add mypy-protobuf (#141)
Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.
feat(server): clear cache on error (#143)
feat(server): reduce mlp and attn in one op for flash neox (#145)
feat: aws sagemaker compatible image (#147)
The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...
---------
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
fix(ci): fix sagemaker action (#148)
feat(benchmark): tui based benchmarking tool (#149)
fix(server): fix flash neox rotary embeddings (#150)
v0.4.2 (#151)
v0.4.3 (#152)
feat(server): flash santacoder (#153)
docs(readme): provide link Logits Warper README (#154)
fix(server): fix escape characters in stop sequence (#155)
feat(docker): improve flash_attention caching (#160)
feat(launcher): allow disabling hf_transfer (#161)
fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)
fix(router): use buckets for metrics histograms (#163)
feat(router): make router input validation optional (#164)
feat(server): add flash attention llama (#144)
feat(server): support OPT models (#55)
OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.
v0.5.0 (#168)
feat(server): optimize decode for sane tokenizers (#170)
feat(server): support sharded santacoder (#167)
fix(launcher): revert change on shard errors (#173)
fix(ci): fix CVE in github-slug-action (#174)
feat(ci): add image signing with cosign (#175)
feat(ci): add Trivy and scan docker image (#178)
feat(ci): use large runners (#179)
feat(ci): faster scanning (#180)
fix(ci): fix ci permissions (#181)
fea(dockerfile): better layer caching (#159)
fix(ci): fix cosign error (#183)
fix(docker): fix docker image (#184)
fix(docker): fix image (#185)
fix(docker): revert dockerfile changes (#186)
fix(docker): fix docker image dependencies (#187)
fix(router): fix truncation (#190)
closes #189
feat(python-client): get list of currently deployed tgi models using the inference API (#191)
feat(router): add info route (#196)
close #125
feat(server): support quantization for flash models (#200)
closes #197
feat(server): check cuda capability when importing flash models (#201)
close #198
fix(server): fix hf_transfer issue with private repos (#203)
fix(docker): remove unused dependencies (#205)
fix(router): add auth token to get model info (#207)
feat(router): add git sha to info route (#208)
feat(router): drop requests when client closes the channel (#202)
fix(ci): fix sha in docker image (#212)
feat(server): flash attention past key value optimizations (#213)
feat(router): add device and dtype info (#215)
fix(server): fix past key values logic (#216)
@njhill fyi
fix(server): cleanup new flash past_key_values logic (#217)
fix(server): fix flash causal (#218)
fix(server): fix flash causal (#219)
fix(server): fix flash batch filtering (#220)
misc: update to rust 1.69 (#221)
v0.6.0 (#222)
feat(server): reduce memory requirement (#214)
chore(server): update huggingface-hub (#227)
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
feat(router): add endpoint info to /info route (#228)
chore(server): update safetensors version (#235)
fix(python-client): add auth headers to is supported requests (#234)
Starting some routing tests. (#233)
fix(benchmarking): fix benchmarking tool
chore(launcher): refactor logic (#242)
Hopefully it's cleaner
feat(router): add tests to validation (#237)
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)
Introduced in #214
Fixes #249
fix(server): Small tidy of code from recent changes (#251)
remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()
chore(server): update transformers (#250)
feat(server): add watermarking tests (#248)
feat(docker): add nvidia env vars (#255)
doc(launcher): add more docs to the `launcher` itself and link in the README (#257)
feat(benchmark): add support for private tokenizers (#262)
Adding docs on how dynamic batching works. (#258)
This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.
Maybe some drawings could help too but I kept it to text for now.
chore(github): add templates (#264)
fix(server): fix typo in tokenizers decode (#269)
closes #268
feat(server): support hf endpoint weight layout (#266)
fix(launcher): pass weights cache override to the download process (#274)
closes #273
fix(launcher): handle hub branches (#278)
fix(server): Removes the parallelism in file convertion (during download) (#275)
feat(launcher): Improve error message when download process fails. (#276)
fix(server): fix convert (#284)
chore: add `flash-attention` to docker ignore (#287)
included when building docker locally.
(Where the local dirs might have the flash-attention folder.)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
fea(server): decrease convert RAM requirements (#286)
fix(dockerfile): fix nvidia env vars (#297)
Fixes #291
feat(router): Adding response schema for compat_generate (#292)
feat(docker): add benchmarking tool to docker image (#298)
fix(docker): fix docker build (#299)
feat(server): optim flash causal lm decode_token (#285)
fix(docker): fix nvidia env vars (#305)
fix(docker): remove nvidia require cuda env (#310)
feat(server): shard token decode (#303)
feat(server): use float16 (#304)
fix(docker): remove CUDA_VERSION
feat(server): use cuda graph in logits warping (#302)
fix(server): fix multinomial implem in Sampling
feat(server): GPTQ quantization (step1) (#277)
Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore(docker): use nvidia base image (#318)
fix(docker): remove quantize default
fix(docker): use ubuntu20.04
Hotfixes for santacoder/bigcode. (#294)
Hotfixes:
- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Lifting check_unitialized. (#325)
Lifting check_unitialized.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Removing dead variables. (#327)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(ci): custom gpu runners (#328)
Single place for TP layers + Dropout Layer Norm + FastLinear (#329)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat: add snapshot testing (#282)
feat(integration-tests): improve comparison and health checks (#336)
fix(server): fix decode token (#334)
Fixes #333
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
fix: set MODEL_ID in sagemaker-entrypoint script (#343)
feat(server): Support BLOOMChat-176B (#348) (#351)
@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
fix(server): fix init for flash causal lm (#352)
Fixes #347
fix(server): t5 cannot run in f16 (#356)
Fix #349
fix(ci): fix security group (#359)
Switch security group used for ci
(open outbound rules)
Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>
feat: add nightly load testing (#358)
chore(sever): update requirements (#357)
Fixes #338
feat(server): support fp16 for t5 (#360)
Fixes #349
feat(server): do not use device_map auto on single GPU (#362)
feat(server): support trust_remote_code (#363)
feat(router): log input/ouput at debug level (#364)
@njhill FYI
v0.7.0 (#353)
feat: decrease IPC proto size (#367)
Closes #307 #308
feat(benchmarker): add summary tables (#368)
feat(server): support vectorized warpers in flash causal lm (#317)
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
Fix issue when load AutoModelForSeq2SeqLM model (#370)
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
fix(server): fix quantization
feat(server): support RefinedWeb models (#379)
v0.8.0
increase health checks
feat(server): add retry on download (#384)
fix(server): fix bnb quantization for CausalLM models (#385)
v0.8.1
fix(server): fix has_position_ids (#395)
Fix #389
feat(server): remove trust_remote_code requirement for falcon models (#396)
feat(server): load santacoder/starcoder models with safetensors (#393)
Fix #366
v0.8.2
feat(sagemaker): add trust remote code to entrypoint (#394)
feat(launcher): parse oom signal (#404)
feat(server): only compute prefill logprobs when asked (#406)
Close #288
feat(server): batch tokenization for flash causal lm (#411)
chore: update openapi schema
feat(server): Rework model loading (#344)
Reworked the loading logic. Idea is to use cleaner loading code:
- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.
New code layout:
- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
feat(server): optimize dist ops (#434)
docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)
It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.
fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)
This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`
<!-- Remove if not applicable -->
Fixes #422
feat(server): pre-allocate past key values for flash causal LM (#412)
feat(router): add ngrok integration (#453)
feat(server): improve flash attention import errors (#465)
@lewtun, is this enough?
Closes #458
Closes #456
fix(server): fix warpers on CPU (#472)
Closes #471
fix(server): Fixing T5 in case the names are mixed up. (#475)
feat(server): Update convert logic. (#483)
Should be more robust to shared tensors (ok when using
`from_pretrained). But forcing us to add new checks in our loading
code (since the chosen key to keep might be different from
`transformers`).
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
feat(server): Adding new ignore_rule for conversion. (#485)
fix(router): add timeout on flume sends (#488)
feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)
Let's start discussing implementation.
- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).
Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.
My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
fix(server): Do not init process group if already initialized (#388)
feat(router): add header option to disable buffering for the generate_stream response (#498)
generate_stream endpoint response stream.
Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.
Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.
feat(server): add paged attention to flash models (#516)
Closes #478
feat(router): arg validation (#519)
feat: Add the option to force another dtype than `f16`. (#513)
fix(launcher): fix issue where launcher does not properly report shard failures (#522)
v0.9.0 (#525)
feat(server): Add Non flash MPT. (#514)
This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.
Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290
fix: Update server/Makefile to include Makefile-vllm (#520)
For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)
fix(server): Handle loading from local files for MPT (#534)
This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.
fix(server): avoid errors for very small top_p values (#544)
See https://github.com/huggingface/transformers/pull/24111
I didn't add validation to the `__init__` method since it's not done for
other values/warpers.
feat(server): use latest flash attention commit (#543)
@njhill FYI
feat(router): add argument for hostname in router (#545) (#550)
In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with
```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health' # failed before this commit
```
Trigger CI
---------
Co-authored-by: Phil Chen <philchen2000@gmail.com>
fix(server): decrease memory fragmentation (#557)
v0.9.1 (#558)
fix(server): harden the weights choice to save on disk. (#561)
- Look at `transformers` base class to check for
`_key_to_ignore_on_load_missing` or `_tied_weights` which are the
standard attributes to select the keys to NOT save on disk (since they
are ignored)
- Modified safetensors code (to be reflected in safetensors even if it's
an internal function).
- Will not work for trust_remote_code=True repos (like santacoder).
Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593
feat: better errors for warmup and TP (#575)
Close #571
fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)
Fixes #555
feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)
Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.
Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.
This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.
Fixes #500
chore: migrate ci region for more availability. (#581)
fix(server): T5 weights names. (#582)
Fixes #541
fix(server): Adding logger import to t5_modeling.py (#585)
Logger is referenced during the apex importing but is not imported,
causing a NameError
fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)
This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.
Thanks @Narsil for the original fix.
feat(server): Implements sharding for non divisible `vocab_size`. (#583)
- The code is relatively easy (just disable the checks on Embedding and
Head)
This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.
feat(server): empty cache on errors
GPTQ Env vars: catch correct type of error (#596)
When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.
@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.
feat(launcher): add arg validation and drop subprocess (#595)
feat(router): explicit warning if revision is not set (#608)
docs: README: Add logo + baseline (#611)

fix(server): blacklist local files (#609)
Close #589 #602
v0.9.2 (#616)
fix(server): empty_cache when stopped
fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)
fea(launcher): debug logs (#623)
feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Reworking the quantization script so it's still universal (not llama
specific)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).
Still need to investigate the potential differences in quantization
results.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
feat(server): flash attention v2 (#624)
feat(server): add support for llamav2 (#633)
v0.9.3 (#634)
fix(server): fix llamav2 config (#635)
feat(server): auto max_batch_total_tokens for flash att models (#630)
feat(router): ngrok edge (#642)
docs: Update README.md (#639)
docs: Update README.md (#643)
Add trust_remote_code to quantize script (#647)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.
With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
-->
fix(server): llama v2 GPTQ (#648)
As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5
Test it:
```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
-H 'Content-Type: application/json'
```
fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)
fix(server): use mem_get_info to get kv cache size (#664)
Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636
feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)
Just trying to get the integration tests to pass.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
---------
Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>
Directly load GPTBigCode to specified device (#618)
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@OlivierDehaene OR @Narsil
feat(server): add local prom and health routes if running w/ ngrok
feat: add cuda memory fraction (#659)
Close #673
fix(server): fix exllama buffers (#689)
Close #683
feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)
- Current PR is not great because we're side stepping the
`Weights.__init__` but Weights shouldn't requires anything related
to the config or the model_id as it aims to be a simple Wrapper
over multi file loading.
- Ideal solution would be to use something like Rust enum
```
enum Quantize{
Bitandbytes(Bitsandbytes),
GPTQ(bits: usize, groupsize: usize)
```
And passing that around during load. Unfortunately we don't
have access to this, so for now, side-stepping seems easier.
- Re-enabling groupsize<0 with exllama (confirmed it works.)
Helps #601
In next steps we should make sure our quantization script uses that
format and make it standard.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs(README): update readme
fix(server): fix quantization python requirements (#708)
fix(server): fix missing datasets in quantize
feat(server): support new falcon config (#712)
v0.9.4 (#713)
Add section about TGI on other AI hardware accelerators in README (#715)
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
As per title.
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
docs: Add hardware section to TOC in README (#721)
feat(server): update vllm version (#723)
chore: update license to HFOIL (#725)
v1.0.0 (#727)
Local gptq support. (#738)
Redoes #719
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Fix typing in `Model.generate_token` (#733)
This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.
All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804
https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591
I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
CC @OlivierDehaene
Adding Rope scaling. (#741)
- Adds Rope NTK scaling.
Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653
- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).
Fixes #512
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
chore: fix typo in mpt_modeling.py (#737)
Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
implemetation -> implementation
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure t…


What does this PR do?
This is an experimental PR for discussion, so we can decide whether to add this pattern.
Context
In the past week, there have been several developments about scaling RoPE (Rotary Position Embeddings, i.e. Llama's position embeddings) so as to be able to extrapolate beyond 2048 tokens. Without any scaling and/or finetuning, the perplexity quickly explodes when we go beyond 2048 tokens. Here's the sequence of RoPE scaling improvements, announced mostly on Reddit:
/u/kaiokendev./u/bloc97. EDIT: following the comments in this thread, this technique will not be added!/u/emozilla.Changes in the PR
The goal of this PR is to debate whether we want to include RoPE scaling support, with working code as reference. The field is evolving quite fast, so I've added it in a way that we can quicky add to new scaling strategies and keep surfing the wave 🏄 Of course, the implementation itself is up for discussion! (An alternative implementation would be to have separate classes for the scalable RoPEs)
Pros:
Cons:
rope_scalingis a dictionary input, which is somewhat undesirable;Example
Consider the following prompt from a paper transcript, containing ~6k tokens:
prompt built from the transcript of https://arxiv.org/abs/2306.15595
If we place it in the following example
we get:
However, if we add
rope_scaling={"type": "dynamic", "factor": 2.0}infrom_pretrained, we now get:Better generation parameterization can definitely be selected, but you get the idea -- with these changes, models with RoPE can handle much larger contexts right out of the box 🔥