-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Fix #16660 (tokenizers setters of ids of special tokens) #16661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
Thank you very much for sharing your solution, it looks great! 🤗 Just to be able to merge this change on main I think it would be good to add a test to make sure that future changes don't break the behaviour you just fixed - I let @LysandreJik confirm. This new test would surely have its place in |
|
Sure! I've just added the checks into an existing test. Some things could be done differently depending on your preferences: Please let me know. |
|
I ended up creating a different method for this test because I've seen at least two models where the ids shouldn't be set and I needed to disable only this test for those tokenizers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding the test! 🤗 I think it is a good choice to have added this test independently. I've thrown in a little comment to discuss together.
tests/test_tokenization_common.py
Outdated
| vocab = tokenizer.get_vocab() | ||
| except NotImplementedError: | ||
| vocab = None | ||
| should_test_setters_not_none = (vocab is not None) and (len(vocab) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obviously up for discussion, but I wonder if this case is not only designed for ByT5. If that's the case, I think it's worth removing this complexity from the common test and designing a test specifically for ByT5, what do you think? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've checked it with a script and get_vocab is only missing in ByT5 and canine tokenizers, but this test is already "disabled" for canine.
My thought process was basically:
- I need to get one random token id to test the setters, let's get the first item of the vocab.
- It looks that, at least,
ByT5needs this check. Because I'm not sure I'm not missing anything I'll take the safe path and always make sure that we have a vocab.
I was thinking a pro of my approach is that if a new model doesn't define get_vocab, the author doesn't need to make any decision regarding this test. However, I understand one could argue that this isn't a pro actually. I mean, with your approach, if someone adds a new model that doesn't define get_vocab, the test will fail and the author will be aware of this situation and will need to explicitly override or disable (define pass empty body) the test. Is this better? Now I'm leaning to think it is, so OK, I will modify the PR as you suggest :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I completely understand your way of thinking. Canine and ByT5 are 2 tokenizers a bit different from the others so I think it's good to do it that way! Thanks a lot for your last commit!
This discussion makes me think that it would be nice to have a test for Canine as well, because this tokenizer also defines special tokens.
transformers/src/transformers/models/canine/tokenization_canine.py
Lines 79 to 90 in 70851a6
| def __init__( | |
| self, | |
| bos_token=chr(CLS), | |
| eos_token=chr(SEP), | |
| sep_token=chr(SEP), | |
| cls_token=chr(CLS), | |
| pad_token=chr(PAD), | |
| mask_token=chr(MASK), | |
| add_prefix_space=False, | |
| model_max_length=2048, | |
| **kwargs | |
| ): |
What do you think about it? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I know nothing about Canine , and after seeing the SPECIAL_CODEPOINTS I thought about it like a fixed config for special tokens, but I guess that if the setters are there (inherited and not disabled/raise not implemented), they should be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the new test! It is perfect!
I should have mentioned it at the same time, it seems to me that V also defines special tokens:
transformers/src/transformers/models/speech_to_text/tokenization_speech_to_text.py
Lines 109 to 123 in f7196f2
| def __init__( | |
| self, | |
| vocab_file, | |
| spm_file, | |
| bos_token="<s>", | |
| eos_token="</s>", | |
| pad_token="<pad>", | |
| unk_token="<unk>", | |
| do_upper_case=False, | |
| do_lower_case=False, | |
| tgt_lang=None, | |
| lang_codes=None, | |
| sp_model_kwargs: Optional[Dict[str, Any]] = None, | |
| **kwargs, | |
| ) -> None: |
Do you think we can also make a (custom) test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 I could have realized too. It was a remainder of the original test that used the token id 0 to test the setters and a problem that I wrongly attributed to the existence of a fixed vocab 😅. Using get_vocab, the default test was already able to cover this tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for fixing the setters, the additions of the tests, and our last exchanges! 😄 I think PR is ready! 🚀
I'll take the liberty of pinging one of the transformers' main maintainers if they want to do a review too cc @LysandreJik , @sgugger , @patrickvonplaten or @patil-suraj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, and awesome for the new test! Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just left one comment. Thanks a lot for working on this :)
| @bos_token_id.setter | ||
| def bos_token_id(self, value): | ||
| self._bos_token = self.convert_tokens_to_ids(value) | ||
| self._bos_token = self.convert_ids_to_tokens(value) if value is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! But wouldn't it be better to let it fail when the value is None ? Because if one sets it to a None value and it doesn't fail, one would assume the token is set and try to use and then it'll fail in some later part of the code. It's better to fail as early as possible IMO. wdyt @SaulLu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my side, I think that some models don't need to define some/all special tokens (bos, eos, etc). It seems fine to me that the id can be set to None. 🙂
Besides, we can see in the property bos_token_id (the getter) that a None value is well handled:
transformers/src/transformers/tokenization_utils_base.py
Lines 1066 to 1074 in 12bfa97
| @property | |
| def bos_token_id(self) -> Optional[int]: | |
| """ | |
| `Optional[int]`: Id of the beginning of sentence token in the vocabulary. Returns `None` if the token has not | |
| been set. | |
| """ | |
| if self._bos_token is None: | |
| return None | |
| return self.convert_tokens_to_ids(self.bos_token) |
| def test_tokenizers_common_ids_setters(self): | ||
| tokenizers = self.get_tokenizers() | ||
| for tokenizer in tokenizers: | ||
| with self.subTest(f"{tokenizer.__class__.__name__}"): | ||
| attributes_list = [ | ||
| "bos_token", | ||
| "eos_token", | ||
| "unk_token", | ||
| "sep_token", | ||
| "pad_token", | ||
| "cls_token", | ||
| "mask_token", | ||
| ] | ||
|
|
||
| vocab = tokenizer.get_vocab() | ||
| token_id_to_test_setters = next(iter(vocab.values())) | ||
| token_to_test_setters = tokenizer.convert_ids_to_tokens( | ||
| token_id_to_test_setters, skip_special_tokens=False | ||
| ) | ||
|
|
||
| for attr in attributes_list: | ||
| setattr(tokenizer, attr + "_id", None) | ||
| self.assertEqual(getattr(tokenizer, attr), None) | ||
| self.assertEqual(getattr(tokenizer, attr + "_id"), None) | ||
|
|
||
| setattr(tokenizer, attr + "_id", token_id_to_test_setters) | ||
| self.assertEqual(getattr(tokenizer, attr), token_to_test_setters) | ||
| self.assertEqual(getattr(tokenizer, attr + "_id"), token_id_to_test_setters) | ||
|
|
||
| setattr(tokenizer, "additional_special_tokens_ids", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice test!
…uggingface#16661) * Fix setters of *_token_id properties of SpecialTokensMixin * Test setters of common tokens ids * Move to a separate test checks of setters of tokens ids * Add independent test for ByT5 * Add Canine test * Test speech to text
What does this PR do?
Fixes #16660
@LysandreJik
It seems no one was using those setters because the code is quite old, but here's a fix if you want to merge it.