MobileBERT tokenizer tests #16896

leondz · 2022-04-22T16:49:42Z

What does this PR do?

This PR implements tests for MobileBERT. As MobileBERT uses a copy of the BERT Tokenizer, the test inherits from BertTokenizationTest, and also checks the merge & vocab files for these two models are identical.

Contributes fixes to issue #16627

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

cc. @LysandreJik @SaulLu

HuggingFaceDocBuilderDev · 2022-04-22T17:15:11Z

The documentation is not available anymore as the PR was closed or merged.

SaulLu

Thank you very much for working on adding tests for the MobileBert tokenizer! 🚀:

I've left 2 comments that suggest doing these tests a little differently. Don't hesitate to tell me what you think!

tests/mobilebert/test_tokenization_mobilebert.py

SaulLu · 2022-04-22T17:18:29Z

tests/mobilebert/test_tokenization_mobilebert.py

+# disable duplicate incorporation of tests from parent class in this module
+BertTokenizationTest.__test__ = False
+
+
+@require_tokenizers
+class MobileBertTokenizationTest(BertTokenizationTest, unittest.TestCase):


To facilitate future maintenance of this test, I think it would be easier to just copy and paste the Bert's test into this file so that MobileBertTokenizationTest does not depend on BertTokenizationTest. In transformers - in general - we prefer to copy/paste code rather than create inheritances between models 🙂

Ah, OK. In this case the solution is trivial of course. Might I ask what the reasoning is behind the preference to copy code instead of minimise duplication, just so I know in future?

I think this part of our documentation should interest you!

I think for the particular case of MobileBert it is not obvious because I think that today we would not have created a MobileBertTokenizer class but only referenced the BertTokenizer in the config of the checkpoints if we really don't want to have any changes between the 2 tokenizers.

Now that the MobileBertTokenizer class exists I think the spirit is rather that MobileBertTokenizer's behaviour should not be constrained by the BERT model.

I would like to ping @LysandreJik - one of the main maintainers of transformers - on this subject if he has a different opinion than me on this issue, especially because it's a design discussion 🙂 .

@patrickvonplaten has a recent blog post for this 😄

https://huggingface.co/blog/transformers-design-philosophy

Perfect, thanks! I especially like points 3+4.

Re: tests - I wonder what we do if a bug is found for the Bert tokenizer, in order to get the resulting test also implemented in eg. MobileBERT's tests. Where is that provenance annotated? Just in the ancestors? That said, I suppose new BERT tokenizer bugs are unlikely at this point :)

There are things like (only for models/tokenizers, etc. for now, not for tests)

# Copied from <predecessor_model>.<function>

For example,

transformers/src/transformers/models/deberta_v2/modeling_deberta_v2.py

Line 325 in 21decb7

# Copied from transformers.models.deberta.modeling_deberta.DebertaLayer with Deberta->DebertaV2

For tests, I think we haven't really focus on this point (yet)

SaulLu

Thank you very much for the changes. I've left in a comment another change that I think is necessary to test MobileBert's classes.

Also, I think you would have to run the formatting (make style && make quality) and commit the changes for all the tests to pass 😄

tests/mobilebert/test_tokenization_mobilebert.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

leondz · 2022-05-06T06:18:57Z

Obviously - thanks!

… pathname

ydshieh · 2022-05-06T06:40:17Z

Hi, @leondz

The main branch has recently merged a PR that changes test folders, like

tests/mobilebert -> tests/models/mobilebert

Could you follow the ideas shown in the instructions in this to incorporate the changes into your working branch. Thank you. (You might need to fix a few import places)

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

… pathname

…sformers into mobilebert-tok-tests

leondz · 2022-05-06T07:09:21Z

Hi, @leondz

The main branch has recently merged a PR that changes test folders, like
tests/mobilebert -> tests/models/mobilebert
Could you follow the ideas shown in the instructions in this to incorporate the changes into your working branch. Thank you. (You might need to fix a few import places)

Thanks for this, it makes sense. By the way, make fixup seems to adjust content in /examples and /docs in a way that looks mistaken - out of scope for this PR but is that something to be looked at?

e.g.


--- a/docs/source/en/model_doc/bert-generation.mdx
+++ b/docs/source/en/model_doc/bert-generation.mdx
@@ -49,7 +49,7 @@ Usage:
 
 >>> input_ids = tokenizer(
 ...     "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
-... ).input_ids
+>>> ).input_ids
 >>> labels = tokenizer("This is a short summary", return_tensors="pt").input_ids

ydshieh · 2022-05-06T07:24:14Z

Could you check this comment

#17008 (comment)

and see if it works well? That's my first thought :-)

leondz · 2022-05-06T07:25:43Z

By the way, make fixup seems to adjust content in /examples and /docs in a way that looks mistaken - out of scope for this PR but is that something to be looked at?

e.g.
--- a/docs/source/en/model_doc/bert-generation.mdx
+++ b/docs/source/en/model_doc/bert-generation.mdx
@@ -49,7 +49,7 @@ Usage:
 
 >>> input_ids = tokenizer(
 ...     "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
-... ).input_ids
+>>> ).input_ids
 >>> labels = tokenizer("This is a short summary", return_tensors="pt").input_ids
 

OK this was fixed in huggingface/doc-builder#207 :)

Could you check this comment

#17008 (comment)

and see if it works well? That's my first thought :-)

Yes! All done :)

SaulLu

LGTM! Thank you for your contribution 🤗

LysandreJik

Thank you, @leondz!

* unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * amend paths for model tests being in models/ subdir of /tests * explicitly rm test from prev path Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

leondz added 5 commits April 22, 2022 14:26

unhardcode pretrained model path, make it a class var

0b54ef9

add tests for mobilebert tokenizer

c21e44c

allow tempfiles for vocab & merge similarity test to autodelete

f1b666b

add explanatory comments

12f68e5

remove unused imports, let make style do its.. thing

24433d9

SaulLu reviewed Apr 22, 2022

View reviewed changes

remove inheritance and use BERT tok tests for MobileBERT

22eb1d0

SaulLu reviewed Apr 29, 2022

View reviewed changes

tests/mobilebert/test_tokenization_mobilebert.py Outdated Show resolved Hide resolved

SaulLu mentioned this pull request May 2, 2022

Add missing tokenizer test files [:building_construction: in progress] #16627

Closed

9 tasks

Update tests/mobilebert/test_tokenization_mobilebert.py

3d886a0

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

amend class names, remove unused import, add fix for mobilebert's hub…

d79317c

… pathname

leondz and others added 12 commits May 6, 2022 08:41

Merge branch 'huggingface:main' into mobilebert-tok-tests

6c06985

unhardcode pretrained model path, make it a class var

48ac05d

add tests for mobilebert tokenizer

134c243

allow tempfiles for vocab & merge similarity test to autodelete

44edc79

add explanatory comments

ead9dac

remove unused imports, let make style do its.. thing

edd1830

remove inheritance and use BERT tok tests for MobileBERT

9d452b8

Update tests/mobilebert/test_tokenization_mobilebert.py

4722efd

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

amend class names, remove unused import, add fix for mobilebert's hub…

7085b84

… pathname

amend paths for model tests being in models/ subdir of /tests

b4691be

Merge branch 'mobilebert-tok-tests' of https://github.com/leondz/tran…

016ca4e

…sformers into mobilebert-tok-tests

explicitly rm test from prev path

7281675

SaulLu approved these changes May 6, 2022

View reviewed changes

LysandreJik approved these changes May 10, 2022

View reviewed changes

LysandreJik merged commit 4a419d4 into huggingface:main May 10, 2022

leondz deleted the mobilebert-tok-tests branch May 11, 2022 04:30

MobileBERT tokenizer tests #16896

MobileBERT tokenizer tests #16896

Uh oh!

Conversation

leondz commented Apr 22, 2022

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SaulLu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SaulLu Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

leondz Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

SaulLu Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

leondz Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

ydshieh Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

ydshieh Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

SaulLu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leondz commented May 6, 2022

Uh oh!

ydshieh commented May 6, 2022

Uh oh!

leondz commented May 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented May 6, 2022

Uh oh!

leondz commented May 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SaulLu left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HuggingFaceDocBuilderDev commented Apr 22, 2022 •

edited

Loading

SaulLu Apr 22, 2022 •

edited

Loading

leondz commented May 6, 2022 •

edited

Loading

leondz commented May 6, 2022 •

edited

Loading