-
Notifications
You must be signed in to change notification settings - Fork 30.9k
[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
tests/t5/test_tokenization_t5.py
Outdated
| ) | ||
| self.assertIsInstance(batch, BatchEncoding) | ||
| self.assertEqual(batch.input_ids.shape, (2, 512)) | ||
| self.assertEqual(batch.input_ids.shape, (2, 8001)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgugger note that while this IMO fixes a bug (T5 has no fixed max length), it might break backwards compatibility in some edge cases. T5 is used a lot, but still think it's better to correct it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a serious change if a user forgot to set a max_length. I understand it fixes a bug, but still would like @LysandreJik 's take on it as well.
Thanks for the PR in any case!
Agree! We should at least put some ❗ mark in this PR stating that this change could lead to unexpected behavior OOM if |
|
That is definitely a breaking change we want to avoid, IMO. This is likely to break user pipelines with OOM errors or a non consistent number of tokens generated. I'd advocate against this change, and would push to:
Updating the warning just for T5You can override this method, which is in transformers/src/transformers/tokenization_utils_base.py Lines 3379 to 3397 in e6f00a1
I wouldn't recommend skipping the warning altogether as it still gives important information regarding why the text was eventually truncated or padded. But updating the message makes sense: def _eventual_warn_about_too_long_sequence(self, ids: List[int], max_length: Optional[int], verbose: bool):
"""
Depending on the input and internal state we might trigger a warning about a sequence that is too long for its
corresponding model
Args:
ids (`List[str]`): The ids produced by the tokenization
max_length (`int`, *optional*): The max_length desired (does not trigger a warning if it is set)
verbose (`bool`): Whether or not to print more information and warnings.
"""
if max_length is None and len(ids) > self.model_max_length and verbose:
if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False):
logger.warning(
- "Token indices sequence length is longer than the specified maximum sequence length "
- f"for this model ({len(ids)} > {self.model_max_length}). Running this sequence through the model "
- "will result in indexing errors"
+ "The T5 model has no maximum length, but a maximum length is still set for backwards compatibility "
+ "purposes. To take advantage of the full capabilities of the model, we recommend setting a "
+ "max_length manually."
)
self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True |
|
Okey took some time to think about it - it's really not easy. I agree @LysandreJik that the previous change (while correct) is too strong as it might break quite some pipelines. To begin with, note that In this PR there two things are changed the 2.) can be considered a small breaking change, but it's really a bug correction for me.
Previously no warning appeared. Note that this warning appears every time at init. However it can be disabled as described above and it's also good to warn the user about upcoming changes this way.
#!/usr/bin/env python3
from transformers import T5TokenizerFast
tok = T5TokenizerFast.from_pretrained("t5-base", model_max_length=600)
out = tok(100 * "hello there is a", padding="longest", truncation=True).input_ids
print(len(out))does not throw a warning (since the user defines
To be crystal clear 2.) changes the behavior - e.g. run the code snippet before/after the PR, but it's really a bug correction here IMO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for iterating on this and making it more backward compatible. Your proposed solution looks great!
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your solution looks good to me. Thanks for working on it @patrickvonplaten, LGTM.
| if init_max_model_length is not None and init_max_model_length != max_model_length: | ||
| return init_max_model_length | ||
| elif init_max_model_length is None: | ||
| logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a warnings.warn(..., FutureWarning) so that it is correctly displayed as a deprecation warning for the users
…ten/transformers into fix_t5_tok_warning
|
Failure is unrelated |
huggingface#16990) * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * correct t5 tokenizer * correct t5 tokenizer * fix test * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * finish Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
huggingface#16990) * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * correct t5 tokenizer * correct t5 tokenizer * fix test * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * finish Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
…d max length
What does this PR do?
Fixes #16986
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.