KEMBAR78
Fix #16660 (tokenizers setters of ids of special tokens) by davidleonfdez · Pull Request #16661 · huggingface/transformers · GitHub
Skip to content

Conversation

@davidleonfdez
Copy link
Contributor

What does this PR do?

Fixes #16660

@LysandreJik

It seems no one was using those setters because the code is quite old, but here's a fix if you want to merge it.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 8, 2022

The documentation is not available anymore as the PR was closed or merged.

@SaulLu
Copy link
Contributor

SaulLu commented Apr 8, 2022

Thank you very much for sharing your solution, it looks great! 🤗

Just to be able to merge this change on main I think it would be good to add a test to make sure that future changes don't break the behaviour you just fixed - I let @LysandreJik confirm.

This new test would surely have its place in tests/test_tokenization_common.py. If you ever need more help or don't have time to do it, don't hesitate to tell us!

@davidleonfdez
Copy link
Contributor Author

davidleonfdez commented Apr 8, 2022

Sure!

I've just added the checks into an existing test. Some things could be done differently depending on your preferences:
a) Simplify and assume that len(tokenizer) > 0 always.
b) Add a dummy token before testing the setters when len(tokenizer) == 0. I've chosen not to do it because you may want to have tests as independent as possible, and this change would make the test fail when the method to add tokens fails.

Please let me know.

@davidleonfdez
Copy link
Contributor Author

davidleonfdez commented Apr 9, 2022

I ended up creating a different method for this test because I've seen at least two models where the ids shouldn't be set and I needed to disable only this test for those tokenizers.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding the test! 🤗 I think it is a good choice to have added this test independently. I've thrown in a little comment to discuss together.

vocab = tokenizer.get_vocab()
except NotImplementedError:
vocab = None
should_test_setters_not_none = (vocab is not None) and (len(vocab) > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously up for discussion, but I wonder if this case is not only designed for ByT5. If that's the case, I think it's worth removing this complexity from the common test and designing a test specifically for ByT5, what do you think? 😄

Copy link
Contributor Author

@davidleonfdez davidleonfdez Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've checked it with a script and get_vocab is only missing in ByT5 and canine tokenizers, but this test is already "disabled" for canine.

My thought process was basically:

  1. I need to get one random token id to test the setters, let's get the first item of the vocab.
  2. It looks that, at least, ByT5 needs this check. Because I'm not sure I'm not missing anything I'll take the safe path and always make sure that we have a vocab.

I was thinking a pro of my approach is that if a new model doesn't define get_vocab, the author doesn't need to make any decision regarding this test. However, I understand one could argue that this isn't a pro actually. I mean, with your approach, if someone adds a new model that doesn't define get_vocab, the test will fail and the author will be aware of this situation and will need to explicitly override or disable (define pass empty body) the test. Is this better? Now I'm leaning to think it is, so OK, I will modify the PR as you suggest :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I completely understand your way of thinking. Canine and ByT5 are 2 tokenizers a bit different from the others so I think it's good to do it that way! Thanks a lot for your last commit!

This discussion makes me think that it would be nice to have a test for Canine as well, because this tokenizer also defines special tokens.

def __init__(
self,
bos_token=chr(CLS),
eos_token=chr(SEP),
sep_token=chr(SEP),
cls_token=chr(CLS),
pad_token=chr(PAD),
mask_token=chr(MASK),
add_prefix_space=False,
model_max_length=2048,
**kwargs
):

What do you think about it? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I know nothing about Canine , and after seeing the SPECIAL_CODEPOINTS I thought about it like a fixed config for special tokens, but I guess that if the setters are there (inherited and not disabled/raise not implemented), they should be tested.

Copy link
Contributor

@SaulLu SaulLu Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the new test! It is perfect!

I should have mentioned it at the same time, it seems to me that V also defines special tokens:

def __init__(
self,
vocab_file,
spm_file,
bos_token="<s>",
eos_token="</s>",
pad_token="<pad>",
unk_token="<unk>",
do_upper_case=False,
do_lower_case=False,
tgt_lang=None,
lang_codes=None,
sp_model_kwargs: Optional[Dict[str, Any]] = None,
**kwargs,
) -> None:

Do you think we can also make a (custom) test for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 I could have realized too. It was a remainder of the original test that used the token id 0 to test the setters and a problem that I wrongly attributed to the existence of a fixed vocab 😅. Using get_vocab, the default test was already able to cover this tokenizer.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for fixing the setters, the additions of the tests, and our last exchanges! 😄 I think PR is ready! 🚀

I'll take the liberty of pinging one of the transformers' main maintainers if they want to do a review too cc @LysandreJik , @sgugger , @patrickvonplaten or @patil-suraj

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, and awesome for the new test! Thanks a lot!

@sgugger sgugger merged commit 9f8bfe7 into huggingface:main Apr 13, 2022
Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just left one comment. Thanks a lot for working on this :)

@bos_token_id.setter
def bos_token_id(self, value):
self._bos_token = self.convert_tokens_to_ids(value)
self._bos_token = self.convert_ids_to_tokens(value) if value is not None else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! But wouldn't it be better to let it fail when the value is None ? Because if one sets it to a None value and it doesn't fail, one would assume the token is set and try to use and then it'll fail in some later part of the code. It's better to fail as early as possible IMO. wdyt @SaulLu ?

Copy link
Contributor

@SaulLu SaulLu Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, I think that some models don't need to define some/all special tokens (bos, eos, etc). It seems fine to me that the id can be set to None. 🙂

Besides, we can see in the property bos_token_id (the getter) that a None value is well handled:

@property
def bos_token_id(self) -> Optional[int]:
"""
`Optional[int]`: Id of the beginning of sentence token in the vocabulary. Returns `None` if the token has not
been set.
"""
if self._bos_token is None:
return None
return self.convert_tokens_to_ids(self.bos_token)

Comment on lines +543 to +572
def test_tokenizers_common_ids_setters(self):
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
attributes_list = [
"bos_token",
"eos_token",
"unk_token",
"sep_token",
"pad_token",
"cls_token",
"mask_token",
]

vocab = tokenizer.get_vocab()
token_id_to_test_setters = next(iter(vocab.values()))
token_to_test_setters = tokenizer.convert_ids_to_tokens(
token_id_to_test_setters, skip_special_tokens=False
)

for attr in attributes_list:
setattr(tokenizer, attr + "_id", None)
self.assertEqual(getattr(tokenizer, attr), None)
self.assertEqual(getattr(tokenizer, attr + "_id"), None)

setattr(tokenizer, attr + "_id", token_id_to_test_setters)
self.assertEqual(getattr(tokenizer, attr), token_to_test_setters)
self.assertEqual(getattr(tokenizer, attr + "_id"), token_id_to_test_setters)

setattr(tokenizer, "additional_special_tokens_ids", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice test!

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…uggingface#16661)

* Fix setters of *_token_id properties of SpecialTokensMixin

* Test setters of common tokens ids

* Move to a separate test checks of setters of tokens ids

* Add independent test for ByT5

* Add Canine test

* Test speech to text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tokenizers setter of ids of special tokens don't work

5 participants