KEMBAR78
MobileBERT tokenizer tests by leondz · Pull Request #16896 · huggingface/transformers · GitHub
Skip to content

Conversation

@leondz
Copy link
Contributor

@leondz leondz commented Apr 22, 2022

What does this PR do?

This PR implements tests for MobileBERT. As MobileBERT uses a copy of the BERT Tokenizer, the test inherits from BertTokenizationTest, and also checks the merge & vocab files for these two models are identical.

Contributes fixes to issue #16627

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

cc. @LysandreJik @SaulLu

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 22, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on adding tests for the MobileBert tokenizer! 🚀:

I've left 2 comments that suggest doing these tests a little differently. Don't hesitate to tell me what you think!

Comment on lines 29 to 34
# disable duplicate incorporation of tests from parent class in this module
BertTokenizationTest.__test__ = False


@require_tokenizers
class MobileBertTokenizationTest(BertTokenizationTest, unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To facilitate future maintenance of this test, I think it would be easier to just copy and paste the Bert's test into this file so that MobileBertTokenizationTest does not depend on BertTokenizationTest. In transformers - in general - we prefer to copy/paste code rather than create inheritances between models 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. In this case the solution is trivial of course. Might I ask what the reasoning is behind the preference to copy code instead of minimise duplication, just so I know in future?

Copy link
Contributor

@SaulLu SaulLu Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part of our documentation should interest you!

I think for the particular case of MobileBert it is not obvious because I think that today we would not have created a MobileBertTokenizer class but only referenced the BertTokenizer in the config of the checkpoints if we really don't want to have any changes between the 2 tokenizers.

Now that the MobileBertTokenizer class exists I think the spirit is rather that MobileBertTokenizer's behaviour should not be constrained by the BERT model.

I would like to ping @LysandreJik - one of the main maintainers of transformers - on this subject if he has a different opinion than me on this issue, especially because it's a design discussion 🙂 .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks! I especially like points 3+4.

Re: tests - I wonder what we do if a bug is found for the Bert tokenizer, in order to get the resulting test also implemented in eg. MobileBERT's tests. Where is that provenance annotated? Just in the ancestors? That said, I suppose new BERT tokenizer bugs are unlikely at this point :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are things like (only for models/tokenizers, etc. for now, not for tests)

# Copied from <predecessor_model>.<function>

For example,

# Copied from transformers.models.deberta.modeling_deberta.DebertaLayer with Deberta->DebertaV2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests, I think we haven't really focus on this point (yet)

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the changes. I've left in a comment another change that I think is necessary to test MobileBert's classes.

Also, I think you would have to run the formatting (make style && make quality) and commit the changes for all the tests to pass 😄

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
@leondz
Copy link
Contributor Author

leondz commented May 6, 2022

Obviously - thanks!

@ydshieh
Copy link
Collaborator

ydshieh commented May 6, 2022

Hi, @leondz

The main branch has recently merged a PR that changes test folders, like

tests/mobilebert -> tests/models/mobilebert

Could you follow the ideas shown in the instructions in this to incorporate the changes into your working branch. Thank you. (You might need to fix a few import places)

@leondz
Copy link
Contributor Author

leondz commented May 6, 2022

Hi, @leondz

The main branch has recently merged a PR that changes test folders, like

tests/mobilebert -> tests/models/mobilebert

Could you follow the ideas shown in the instructions in this to incorporate the changes into your working branch. Thank you. (You might need to fix a few import places)

Thanks for this, it makes sense. By the way, make fixup seems to adjust content in /examples and /docs in a way that looks mistaken - out of scope for this PR but is that something to be looked at?

e.g.


--- a/docs/source/en/model_doc/bert-generation.mdx
+++ b/docs/source/en/model_doc/bert-generation.mdx
@@ -49,7 +49,7 @@ Usage:
 
 >>> input_ids = tokenizer(
 ...     "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
-... ).input_ids
+>>> ).input_ids
 >>> labels = tokenizer("This is a short summary", return_tensors="pt").input_ids
 

@ydshieh
Copy link
Collaborator

ydshieh commented May 6, 2022

Could you check this comment

#17008 (comment)

and see if it works well? That's my first thought :-)

@leondz
Copy link
Contributor Author

leondz commented May 6, 2022

By the way, make fixup seems to adjust content in /examples and /docs in a way that looks mistaken - out of scope for this PR but is that something to be looked at?

e.g.


--- a/docs/source/en/model_doc/bert-generation.mdx
+++ b/docs/source/en/model_doc/bert-generation.mdx
@@ -49,7 +49,7 @@ Usage:
 
 >>> input_ids = tokenizer(
 ...     "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
-... ).input_ids
+>>> ).input_ids
 >>> labels = tokenizer("This is a short summary", return_tensors="pt").input_ids
 

OK this was fixed in huggingface/doc-builder#207 :)

Could you check this comment

#17008 (comment)

and see if it works well? That's my first thought :-)

Yes! All done :)

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution 🤗

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @leondz!

@LysandreJik LysandreJik merged commit 4a419d4 into huggingface:main May 10, 2022
@leondz leondz deleted the mobilebert-tok-tests branch May 11, 2022 04:30
ArthurZucker pushed a commit to ArthurZucker/transformers that referenced this pull request May 12, 2022
* unhardcode pretrained model path, make it a class var

* add tests for mobilebert tokenizer

* allow tempfiles for vocab & merge similarity test to autodelete

* add explanatory comments

* remove unused imports, let make style do its.. thing

* remove inheritance and use BERT tok tests for MobileBERT

* Update tests/mobilebert/test_tokenization_mobilebert.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* amend class names, remove unused import, add fix for mobilebert's hub pathname

* unhardcode pretrained model path, make it a class var

* add tests for mobilebert tokenizer

* allow tempfiles for vocab & merge similarity test to autodelete

* add explanatory comments

* remove unused imports, let make style do its.. thing

* remove inheritance and use BERT tok tests for MobileBERT

* Update tests/mobilebert/test_tokenization_mobilebert.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* amend class names, remove unused import, add fix for mobilebert's hub pathname

* amend paths for model tests being in models/ subdir of /tests

* explicitly rm test from prev path

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
* unhardcode pretrained model path, make it a class var

* add tests for mobilebert tokenizer

* allow tempfiles for vocab & merge similarity test to autodelete

* add explanatory comments

* remove unused imports, let make style do its.. thing

* remove inheritance and use BERT tok tests for MobileBERT

* Update tests/mobilebert/test_tokenization_mobilebert.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* amend class names, remove unused import, add fix for mobilebert's hub pathname

* unhardcode pretrained model path, make it a class var

* add tests for mobilebert tokenizer

* allow tempfiles for vocab & merge similarity test to autodelete

* add explanatory comments

* remove unused imports, let make style do its.. thing

* remove inheritance and use BERT tok tests for MobileBERT

* Update tests/mobilebert/test_tokenization_mobilebert.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* amend class names, remove unused import, add fix for mobilebert's hub pathname

* amend paths for model tests being in models/ subdir of /tests

* explicitly rm test from prev path

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants