[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990

patrickvonplaten · 2022-04-28T13:13:48Z

…d max length

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…d max length

HuggingFaceDocBuilderDev · 2022-04-28T13:40:37Z

The documentation is not available anymore as the PR was closed or merged.

patrickvonplaten · 2022-04-28T22:48:21Z

tests/t5/test_tokenization_t5.py

        )
        self.assertIsInstance(batch, BatchEncoding)
-        self.assertEqual(batch.input_ids.shape, (2, 512))
+        self.assertEqual(batch.input_ids.shape, (2, 8001))


@sgugger note that while this IMO fixes a bug (T5 has no fixed max length), it might break backwards compatibility in some edge cases. T5 is used a lot, but still think it's better to correct it here.

sgugger

Oh, that's a serious change if a user forgot to set a max_length. I understand it fixes a bug, but still would like @LysandreJik 's take on it as well.
Thanks for the PR in any case!

patrickvonplaten · 2022-04-29T12:04:35Z

Oh, that's a serious change if a user forgot to set a max_length. I understand it fixes a bug, but still would like @LysandreJik 's take on it as well. Thanks for the PR in any case!

Agree! We should at least put some ❗ mark in this PR stating that this change could lead to unexpected behavior OOM if max_length is not defined.

LysandreJik · 2022-04-29T12:29:24Z

That is definitely a breaking change we want to avoid, IMO. This is likely to break user pipelines with OOM errors or a non consistent number of tokens generated. I'd advocate against this change, and would push to:

Document that while the limit is set to 512, T5 can handle longer lengths and encourage users to define their own max lengths
Document that this limit will be removed in v5
Update the warning just for T5 (see below)

Updating the warning just for T5

You can override this method, which is in tokenization_utils_base.py, in tokenization_t5.py and tokenization_t5_fast.py

transformers/src/transformers/tokenization_utils_base.py

Lines 3379 to 3397 in e6f00a1

    
               def _eventual_warn_about_too_long_sequence(self, ids: List[int], max_length: Optional[int], verbose: bool): 
        
                   """ 
        
                   Depending on the input and internal state we might trigger a warning about a sequence that is too long for its 
        
                   corresponding model 
        
                   Args: 
        
                       ids (`List[str]`): The ids produced by the tokenization 
        
                       max_length (`int`, *optional*): The max_length desired (does not trigger a warning if it is set) 
        
                       verbose (`bool`): Whether or not to print more information and warnings. 
        
                   """ 
        
                   if max_length is None and len(ids) > self.model_max_length and verbose: 
        
                       if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False): 
        
                           logger.warning( 
        
                               "Token indices sequence length is longer than the specified maximum sequence length " 
        
                               f"for this model ({len(ids)} > {self.model_max_length}). Running this sequence through the model " 
        
                               "will result in indexing errors" 
        
                           ) 
        
                       self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True

I wouldn't recommend skipping the warning altogether as it still gives important information regarding why the text was eventually truncated or padded. But updating the message makes sense:

    def _eventual_warn_about_too_long_sequence(self, ids: List[int], max_length: Optional[int], verbose: bool):
        """
        Depending on the input and internal state we might trigger a warning about a sequence that is too long for its
        corresponding model

        Args:
            ids (`List[str]`): The ids produced by the tokenization
            max_length (`int`, *optional*): The max_length desired (does not trigger a warning if it is set)
            verbose (`bool`): Whether or not to print more information and warnings.

        """
        if max_length is None and len(ids) > self.model_max_length and verbose:
            if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False):
                logger.warning(
-                    "Token indices sequence length is longer than the specified maximum sequence length "
-                    f"for this model ({len(ids)} > {self.model_max_length}). Running this sequence through the model "
-                    "will result in indexing errors"
+                    "The T5 model has no maximum length, but a maximum length is still set for backwards compatibility "
+                    "purposes. To take advantage of the full capabilities of the model, we recommend setting a "
+                    "max_length manually."
                )
            self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True

…to fix_t5_tok_warning

patrickvonplaten · 2022-05-02T16:22:54Z

Okey took some time to think about it - it's really not easy. I agree @LysandreJik that the previous change (while correct) is too strong as it might break quite some pipelines.

To begin with, note that model_max_length or max_length is only relevant if truncation=True is set. So for all other cases this bug is not relevant.
Now the problem is that by default T5 should not have a set maximum length.
However it is completely reasonable for people to set their own maximum length. To me this means the following: If a user instantiates T5 Tokenizer with model_max_length or passes max_length when encoding/padding, then these values should always be the true max length values and in this case the (incorrectly) hard-coded max length values can be discarded.
Only if a user does not pass max_length when encoding/padding and does not define model_max_length at init, then we should fall back to the (incorrect) hard-coded max length values until v5.

In this PR there two things are changed the 2.) can be considered a small breaking change, but it's really a bug correction for me.

If T5 Tokenizer is instantiated without a custom model_max_length and one of the identifiers for which model_max_length is hardcoded is used, the following warning appears:

This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.

Previously no warning appeared. Note that this warning appears every time at init. However it can be disabled as described above and it's also good to warn the user about upcoming changes this way.

If T5 Tokenizer is instantiated with a model_max_length, this model_max_length always counts even if it's longer than the hardcoded ones. This means the following snippet:

#!/usr/bin/env python3
from transformers import T5TokenizerFast

tok = T5TokenizerFast.from_pretrained("t5-base", model_max_length=600)

out = tok(100 * "hello there is a", padding="longest", truncation=True).input_ids
print(len(out))

does not throw a warning (since the user defines model_max_length) and print a length of 600 (not 512). <- this behavior is different from how it was before.
My rational on changing this is the following:

T5's hardcoded model max lengths are wrong, I'm fine with using those if no model_max_length is defined or no max_length is passed
But, if a user already passes a model_max_length <- then this should be the only source of truth. E.g. In the example above 600 should be tha max length and not 512.

To be crystal clear 2.) changes the behavior - e.g. run the code snippet before/after the PR, but it's really a bug correction here IMO

sgugger

LGTM, thanks a lot for iterating on this and making it more backward compatible. Your proposed solution looks great!

src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik

Your solution looks good to me. Thanks for working on it @patrickvonplaten, LGTM.

LysandreJik · 2022-05-02T16:51:57Z

src/transformers/models/t5/tokenization_t5.py

+            if init_max_model_length is not None and init_max_model_length != max_model_length:
+                return init_max_model_length
+            elif init_max_model_length is None:
+                logger.warning(


This could be a warnings.warn(..., FutureWarning) so that it is correctly displayed as a deprecation warning for the users

…ten/transformers into fix_t5_tok_warning

patrickvonplaten · 2022-05-02T19:27:17Z

Failure is unrelated

huggingface#16990) * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * correct t5 tokenizer * correct t5 tokenizer * fix test * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * finish Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

patrickvonplaten added 2 commits April 28, 2022 15:13

[T5 Tokenizer] Model has no fixed position ids - there is no hardcode…

5cab916

…d max length

[T5 Tokenizer] Model has no fixed position ids - there is no hardcode…

c039a83

…d max length

patrickvonplaten requested review from patil-suraj and sgugger April 28, 2022 22:45

patrickvonplaten commented Apr 28, 2022

View reviewed changes

sgugger approved these changes Apr 29, 2022

View reviewed changes

patrickvonplaten requested a review from LysandreJik April 29, 2022 12:03

patrickvonplaten added 4 commits May 2, 2022 17:38

correct t5 tokenizer

dde6e45

correct t5 tokenizer

b4066c1

fix test

fd12c7f

Merge branch 'main' of https://github.com/huggingface/transformers in…

178452e

…to fix_t5_tok_warning

sgugger approved these changes May 2, 2022

View reviewed changes

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved

Apply suggestions from code review

a6993f7

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik approved these changes May 2, 2022

View reviewed changes

patrickvonplaten added 2 commits May 2, 2022 19:15

finish

0e0aa42

Merge branch 'fix_t5_tok_warning' of https://github.com/patrickvonpla…

9663efb

…ten/transformers into fix_t5_tok_warning

patrickvonplaten merged commit 31616b8 into huggingface:main May 2, 2022

patrickvonplaten deleted the fix_t5_tok_warning branch May 2, 2022 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990

[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990

Uh oh!

patrickvonplaten commented Apr 28, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2022 •

edited

Loading

Uh oh!

patrickvonplaten Apr 28, 2022

Uh oh!

sgugger left a comment

Uh oh!

patrickvonplaten commented Apr 29, 2022

Uh oh!

LysandreJik commented Apr 29, 2022

Uh oh!

patrickvonplaten commented May 2, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

LysandreJik left a comment

Uh oh!

LysandreJik May 2, 2022

Uh oh!

patrickvonplaten commented May 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990

[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990

Uh oh!

Conversation

patrickvonplaten commented Apr 28, 2022

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten Apr 28, 2022

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Apr 29, 2022

Uh oh!

LysandreJik commented Apr 29, 2022

Uh oh!

patrickvonplaten commented May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik May 2, 2022

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented May 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuggingFaceDocBuilderDev commented Apr 28, 2022 •

edited

Loading

patrickvonplaten commented May 2, 2022 •

edited

Loading