KEMBAR78
Add efficient primitives for str.strip(). by advait-dixit · Pull Request #18742 · python/mypy · GitHub
Skip to content

Conversation

@advait-dixit
Copy link
Contributor

Fixes mypyc/mypyc#1090.

Copying cpython implementation for strip, lstrip and rstrip to str_ops.c.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good overall, just a few things I noticed. Can you run some microbenchmarks to make sure the performance impact is as expected?

assert repr("foobar\x00\xab\ud912\U00012345") == r"'foobar\x00«\ud912𒍅'"

[case testStrip]
# This is a negative test. strip variants without args does not use efficient primitives.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment out of date / incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was out of date. Removed it.

s = "xxb2yy"
assert s.lstrip("xy") == "b2yy"
assert s.strip("xy") == "b2"
assert s.rstrip("xy") == "xxb2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test all string kinds and different character code ranges, such as these (and mixing these):

  • Character codes between 128 and 255 (0x80 to 0xff)
  • Character codes between 256 and 65535 (0x100 to 0xffff)
  • Character codes 65536+ (0x10000+)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <Python.h>
#include "CPy.h"

// Copied from cpython.git:Objects/unicodeobject.c.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention the Python version or commit date from where this is from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BLOOM_MASK sepmask;
Py_ssize_t seplen;

kind = PyUnicode_KIND(self);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to call PyUnicode_READY on self and sepobj (and check the return values) on 3.9 at least, before you can call PyUnicode_KIND etc. You can check the relevant function in Python 3.9 to see how it's used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Copied from do_strip function in cpython.git/Objects/unicodeobject.c.
PyObject *_CPyStr_Strip(PyObject *self, int strip_type, PyObject *sep) {
if (sep == NULL || sep == Py_None) {
Py_ssize_t len, i, j;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I think you'll need to call PyUnicode_READY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Fixing code comments.
* Adding tests with more unicode chars.
* Adding commit ID for code copied from cpython.git.
@advait-dixit
Copy link
Contributor Author

Thanks for the PR! Looks good overall, just a few things I noticed. Can you run some microbenchmarks to make sure the performance impact is as expected?

I have pushed a new commit that addresses review comments.
Below is a micro-benchmark on on my laptop. It compares CPython vs mypyc@this branch vs mypyc@master.

(py313_venv) advait@advaits-mbp mypy % cat prof_strip.py 
def solve1(s: str) -> str:
    for i in range(200):
        s.strip()
    return "s"

(py313_venv) advait@advaits-mbp mypy % cat driver.py 
import prof_strip
import timeit
import random
random_characters = "".join([chr(random.randint(0x0020, 0x10FFFF)) for _ in range(20)])
print(timeit.timeit(lambda: prof_strip.solve1(random_characters), number=100000))

(py313_venv) advait@advaits-mbp mypy % python driver.py                            
0.38529958299477585

(py313_venv) advait@advaits-mbp mypy % mypyc prof_strip.py > /dev/null && python driver.py
0.22940995899261907

(py313_venv) advait@advaits-mbp mypy % rm -rf build *.so && git checkout master && mypyc prof_strip.py > /dev/null && python driver.py
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
0.45345445899874903

@advait-dixit advait-dixit requested a review from JukkaL March 5, 2025 00:08
@cdce8p cdce8p added the topic-mypyc mypyc bugs label Mar 6, 2025
Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updates! The performance improvement is great.

@JukkaL JukkaL merged commit a4313e4 into python:master Mar 10, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic-mypyc mypyc bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add primitives for strip(), rstrip() and lstrip()

3 participants