-
Notifications
You must be signed in to change notification settings - Fork 30.9k
MobileBERT tokenizer tests #16896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MobileBERT tokenizer tests #16896
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for working on adding tests for the MobileBert tokenizer! 🚀:
I've left 2 comments that suggest doing these tests a little differently. Don't hesitate to tell me what you think!
| # disable duplicate incorporation of tests from parent class in this module | ||
| BertTokenizationTest.__test__ = False | ||
|
|
||
|
|
||
| @require_tokenizers | ||
| class MobileBertTokenizationTest(BertTokenizationTest, unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To facilitate future maintenance of this test, I think it would be easier to just copy and paste the Bert's test into this file so that MobileBertTokenizationTest does not depend on BertTokenizationTest. In transformers - in general - we prefer to copy/paste code rather than create inheritances between models 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. In this case the solution is trivial of course. Might I ask what the reasoning is behind the preference to copy code instead of minimise duplication, just so I know in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part of our documentation should interest you!
I think for the particular case of MobileBert it is not obvious because I think that today we would not have created a MobileBertTokenizer class but only referenced the BertTokenizer in the config of the checkpoints if we really don't want to have any changes between the 2 tokenizers.
Now that the MobileBertTokenizer class exists I think the spirit is rather that MobileBertTokenizer's behaviour should not be constrained by the BERT model.
I would like to ping @LysandreJik - one of the main maintainers of transformers - on this subject if he has a different opinion than me on this issue, especially because it's a design discussion 🙂 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten has a recent blog post for this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks! I especially like points 3+4.
Re: tests - I wonder what we do if a bug is found for the Bert tokenizer, in order to get the resulting test also implemented in eg. MobileBERT's tests. Where is that provenance annotated? Just in the ancestors? That said, I suppose new BERT tokenizer bugs are unlikely at this point :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are things like (only for models/tokenizers, etc. for now, not for tests)
# Copied from <predecessor_model>.<function>
For example,
| # Copied from transformers.models.deberta.modeling_deberta.DebertaLayer with Deberta->DebertaV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tests, I think we haven't really focus on this point (yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the changes. I've left in a comment another change that I think is necessary to test MobileBert's classes.
Also, I think you would have to run the formatting (make style && make quality) and commit the changes for all the tests to pass 😄
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
|
Obviously - thanks! |
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
…sformers into mobilebert-tok-tests
Thanks for this, it makes sense. By the way, e.g. |
|
Could you check this comment and see if it works well? That's my first thought :-) |
OK this was fixed in huggingface/doc-builder#207 :)
Yes! All done :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for your contribution 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @leondz!
* unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * amend paths for model tests being in models/ subdir of /tests * explicitly rm test from prev path Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * amend paths for model tests being in models/ subdir of /tests * explicitly rm test from prev path Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
What does this PR do?
This PR implements tests for MobileBERT. As MobileBERT uses a copy of the BERT Tokenizer, the test inherits from BertTokenizationTest, and also checks the merge & vocab files for these two models are identical.
Contributes fixes to issue #16627
Before submitting
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
cc. @LysandreJik @SaulLu