Fix #16660 (tokenizers setters of ids of special tokens) #16661

davidleonfdez · 2022-04-08T01:31:48Z

What does this PR do?

Fixes #16660

@LysandreJik

It seems no one was using those setters because the code is quite old, but here's a fix if you want to merge it.

HuggingFaceDocBuilderDev · 2022-04-08T01:48:02Z

The documentation is not available anymore as the PR was closed or merged.

SaulLu · 2022-04-08T08:50:14Z

Thank you very much for sharing your solution, it looks great! 🤗

Just to be able to merge this change on main I think it would be good to add a test to make sure that future changes don't break the behaviour you just fixed - I let @LysandreJik confirm.

This new test would surely have its place in tests/test_tokenization_common.py. If you ever need more help or don't have time to do it, don't hesitate to tell us!

davidleonfdez · 2022-04-08T19:02:39Z

Sure!

I've just added the checks into an existing test. Some things could be done differently depending on your preferences:
a) Simplify and assume that len(tokenizer) > 0 always.
b) Add a dummy token before testing the setters when len(tokenizer) == 0. I've chosen not to do it because you may want to have tests as independent as possible, and this change would make the test fail when the method to add tokens fails.

Please let me know.

davidleonfdez · 2022-04-09T01:20:10Z

I ended up creating a different method for this test because I've seen at least two models where the ids shouldn't be set and I needed to disable only this test for those tokenizers.

SaulLu

Thanks a lot for adding the test! 🤗 I think it is a good choice to have added this test independently. I've thrown in a little comment to discuss together.

SaulLu · 2022-04-11T16:44:30Z

tests/test_tokenization_common.py

+                    vocab = tokenizer.get_vocab()
+                except NotImplementedError:
+                    vocab = None
+                should_test_setters_not_none = (vocab is not None) and (len(vocab) > 0)


This is obviously up for discussion, but I wonder if this case is not only designed for ByT5. If that's the case, I think it's worth removing this complexity from the common test and designing a test specifically for ByT5, what do you think? 😄

Yes, I've checked it with a script and get_vocab is only missing in ByT5 and canine tokenizers, but this test is already "disabled" for canine.

My thought process was basically:

I need to get one random token id to test the setters, let's get the first item of the vocab.

It looks that, at least, ByT5 needs this check. Because I'm not sure I'm not missing anything I'll take the safe path and always make sure that we have a vocab.

I was thinking a pro of my approach is that if a new model doesn't define get_vocab, the author doesn't need to make any decision regarding this test. However, I understand one could argue that this isn't a pro actually. I mean, with your approach, if someone adds a new model that doesn't define get_vocab, the test will fail and the author will be aware of this situation and will need to explicitly override or disable (define pass empty body) the test. Is this better? Now I'm leaning to think it is, so OK, I will modify the PR as you suggest :)

Indeed, I completely understand your way of thinking. Canine and ByT5 are 2 tokenizers a bit different from the others so I think it's good to do it that way! Thanks a lot for your last commit!

This discussion makes me think that it would be nice to have a test for Canine as well, because this tokenizer also defines special tokens.

transformers/src/transformers/models/canine/tokenization_canine.py

Lines 79 to 90 in 70851a6

def __init__(

self,

bos_token=chr(CLS),

eos_token=chr(SEP),

sep_token=chr(SEP),

cls_token=chr(CLS),

pad_token=chr(PAD),

mask_token=chr(MASK),

add_prefix_space=False,

model_max_length=2048,

**kwargs

):

What do you think about it? 😄

Ah yes, I know nothing about Canine , and after seeing the SPECIAL_CODEPOINTS I thought about it like a fixed config for special tokens, but I guess that if the setters are there (inherited and not disabled/raise not implemented), they should be tested.

Thank you very much for the new test! It is perfect!

I should have mentioned it at the same time, it seems to me that V also defines special tokens:

transformers/src/transformers/models/speech_to_text/tokenization_speech_to_text.py

Lines 109 to 123 in f7196f2

def __init__(

self,

vocab_file,

spm_file,

bos_token="<s>",

eos_token="</s>",

pad_token="<pad>",

unk_token="<unk>",

do_upper_case=False,

do_lower_case=False,

tgt_lang=None,

lang_codes=None,

sp_model_kwargs: Optional[Dict[str, Any]] = None,

**kwargs,

) -> None:

Do you think we can also make a (custom) test for it?

😄 I could have realized too. It was a remainder of the original test that used the token id 0 to test the setters and a problem that I wrongly attributed to the existence of a fixed vocab 😅. Using get_vocab, the default test was already able to cover this tokenizer.

SaulLu

Thank you very much for fixing the setters, the additions of the tests, and our last exchanges! 😄 I think PR is ready! 🚀

I'll take the liberty of pinging one of the transformers' main maintainers if they want to do a review too cc @LysandreJik , @sgugger , @patrickvonplaten or @patil-suraj

sgugger

Very nice, and awesome for the new test! Thanks a lot!

patil-suraj

LGTM, just left one comment. Thanks a lot for working on this :)

patil-suraj · 2022-04-13T11:49:37Z

src/transformers/tokenization_utils_base.py

    @bos_token_id.setter
    def bos_token_id(self, value):
-        self._bos_token = self.convert_tokens_to_ids(value)
+        self._bos_token = self.convert_ids_to_tokens(value) if value is not None else None


Nice! But wouldn't it be better to let it fail when the value is None ? Because if one sets it to a None value and it doesn't fail, one would assume the token is set and try to use and then it'll fail in some later part of the code. It's better to fail as early as possible IMO. wdyt @SaulLu ?

From my side, I think that some models don't need to define some/all special tokens (bos, eos, etc). It seems fine to me that the id can be set to None. 🙂

Besides, we can see in the property bos_token_id (the getter) that a None value is well handled:

transformers/src/transformers/tokenization_utils_base.py

Lines 1066 to 1074 in 12bfa97

@property

def bos_token_id(self) -> Optional[int]:

"""

`Optional[int]`: Id of the beginning of sentence token in the vocabulary. Returns `None` if the token has not

been set.

"""

if self._bos_token is None:

return None

return self.convert_tokens_to_ids(self.bos_token)

patil-suraj · 2022-04-13T11:50:04Z

tests/test_tokenization_common.py

+    def test_tokenizers_common_ids_setters(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                attributes_list = [
+                    "bos_token",
+                    "eos_token",
+                    "unk_token",
+                    "sep_token",
+                    "pad_token",
+                    "cls_token",
+                    "mask_token",
+                ]
+
+                vocab = tokenizer.get_vocab()
+                token_id_to_test_setters = next(iter(vocab.values()))
+                token_to_test_setters = tokenizer.convert_ids_to_tokens(
+                    token_id_to_test_setters, skip_special_tokens=False
+                )
+
+                for attr in attributes_list:
+                    setattr(tokenizer, attr + "_id", None)
+                    self.assertEqual(getattr(tokenizer, attr), None)
+                    self.assertEqual(getattr(tokenizer, attr + "_id"), None)
+
+                    setattr(tokenizer, attr + "_id", token_id_to_test_setters)
+                    self.assertEqual(getattr(tokenizer, attr), token_to_test_setters)
+                    self.assertEqual(getattr(tokenizer, attr + "_id"), token_id_to_test_setters)
+
+                setattr(tokenizer, "additional_special_tokens_ids", [])


Really nice test!

…uggingface#16661) * Fix setters of *_token_id properties of SpecialTokensMixin * Test setters of common tokens ids * Move to a separate test checks of setters of tokens ids * Add independent test for ByT5 * Add Canine test * Test speech to text

Fix setters of *_token_id properties of SpecialTokensMixin

3b749f0

Test setters of common tokens ids

e8f34ab

Move to a separate test checks of setters of tokens ids

288d5c1

SaulLu reviewed Apr 11, 2022

View reviewed changes

davidleonfdez added 3 commits April 11, 2022 23:05

Add independent test for ByT5

82ca333

Add Canine test

16410d8

Test speech to text

e3a1fc8

SaulLu approved these changes Apr 13, 2022

View reviewed changes

SaulLu requested review from LysandreJik, patil-suraj, patrickvonplaten and sgugger April 13, 2022 11:11

sgugger approved these changes Apr 13, 2022

View reviewed changes

sgugger merged commit 9f8bfe7 into huggingface:main Apr 13, 2022

patil-suraj approved these changes Apr 13, 2022

View reviewed changes

	def __init__(
	self,
	bos_token=chr(CLS),
	eos_token=chr(SEP),
	sep_token=chr(SEP),
	cls_token=chr(CLS),
	pad_token=chr(PAD),
	mask_token=chr(MASK),
	add_prefix_space=False,
	model_max_length=2048,
	**kwargs
	):

	def __init__(
	self,
	vocab_file,
	spm_file,
	bos_token="<s>",
	eos_token="</s>",
	pad_token="<pad>",
	unk_token="<unk>",
	do_upper_case=False,
	do_lower_case=False,
	tgt_lang=None,
	lang_codes=None,
	sp_model_kwargs: Optional[Dict[str, Any]] = None,
	**kwargs,
	) -> None:

	@property
	def bos_token_id(self) -> Optional[int]:
	"""
	`Optional[int]`: Id of the beginning of sentence token in the vocabulary. Returns `None` if the token has not
	been set.
	"""
	if self._bos_token is None:
	return None
	return self.convert_tokens_to_ids(self.bos_token)

Fix #16660 (tokenizers setters of ids of special tokens) #16661

Fix #16660 (tokenizers setters of ids of special tokens) #16661

Uh oh!

Conversation

davidleonfdez commented Apr 8, 2022

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SaulLu commented Apr 8, 2022

Uh oh!

davidleonfdez commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidleonfdez commented Apr 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SaulLu left a comment

Choose a reason for hiding this comment

Uh oh!

SaulLu Apr 11, 2022

Choose a reason for hiding this comment

Uh oh!

davidleonfdez Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaulLu Apr 12, 2022

Choose a reason for hiding this comment

Uh oh!

davidleonfdez Apr 12, 2022

Choose a reason for hiding this comment

Uh oh!

SaulLu Apr 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidleonfdez Apr 13, 2022

Choose a reason for hiding this comment

Uh oh!

SaulLu left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

patil-suraj Apr 13, 2022

Choose a reason for hiding this comment

Uh oh!

SaulLu Apr 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patil-suraj Apr 13, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HuggingFaceDocBuilderDev commented Apr 8, 2022 •

edited

Loading

davidleonfdez commented Apr 8, 2022 •

edited

Loading

davidleonfdez commented Apr 9, 2022 •

edited

Loading

davidleonfdez Apr 11, 2022 •

edited

Loading

SaulLu Apr 13, 2022 •

edited

Loading

SaulLu Apr 13, 2022 •

edited

Loading