-
Notifications
You must be signed in to change notification settings - Fork 30.9k
tiny tweak to allow BatchEncoding.token_to_char when token doesn't correspond to chars #15901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The documentation is not available anymore as the PR was closed or merged. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Can I follow up on this? Would love to get some feedback -- even if it's about the specific reasons why this PR is not up to standard. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the ping! Sorry I didn't pick this PR earlier. @LysandreJik do you want to have a look too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me too!
Sorry for taking so long to review, could you just rebase on cc @SaulLu for knowledge |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Done! @sgugger @LysandreJik |
Thanks again for your contribution! |
…rrespond to chars (huggingface#15901) * tweak to allow BatchEncoding.char_to_token(0) * update docstring * remote trailing whitespace * make fixup * make value checking for span_indices explicit Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Tagging @n1t0 @thomwolf @sgugger but this PR should be extremely quick to review for anyone.
Problem
BatchEncoding.token_to_char
is supposed to return the char spans in the original string; however, right now, for tokens such as "<s>, </s>, <CLS>" that don't correspond to any chars in the original string, an error is raisedTypeError: type object argument after * must be an iterable, not NoneType
.Run the following snippet replicate:
Fix
The solution is to return
None
instead of raising an error for tokens not corresponding to any chars in the original string.P.S.
I am lost as to why
run_tests_torch
failed forif [ -f test_list.txt ]; then python -m pytest -n 3 --dist=loadfile -s --make-reports=tests_torch $(cat test_list.txt) | tee tests_output.txt fi
. Some help would be appreciated.