KEMBAR78
bpo-38530: Refactor AttributeError suggestions by sweeneyde · Pull Request #25776 · python/cpython · GitHub
Skip to content

Conversation

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented May 1, 2021

  • Make case-swaps half the cost of any other edit
  • Refactor Levenshtein code to not use memory allocator, and to bail early on no match.
  • Add comments to Levenshtein distance code
  • Add test cases for Levenshtein distance behind a debug macro
  • Set threshold to (name_size + item_size + 3) * MOVE_COST / 6.
    • Reasoning: similar to difflib.SequenceMatcher.ratio() >= 2/3:
"Multiset Jaccard similarity" >= 2/3
matching letters / total letters >= 2/3
(name_size - distance + item_size - distance) / (name_size + item_size) >= 2/3
1 - (2*distance) / (name_size + item_size) >= 2/3
1/3 >= (2*distance) / (name_size + item_size)
(name_size + item_size) / 6 >= distance
With rounding:
(name_size + item_size + 3) // 6 >= distance

https://bugs.python.org/issue38530

assert(d2 == d4);
}

static void
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleas remove these functions as we don't normally tests things this way (even with the macros).

I would recommend to either adapt this tests to sole NameError example's in test_exceptions or maybe expose them so they can be used in the _testcapi module.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

calculate_suggestions(PyObject *dir,
PyObject *name) {
PyObject *name)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here: please remove the tests or move them to the test suite

if (name_str == NULL) {
return NULL;
}
size_t *work_buffer = PyMem_Calloc(name_size, sizeof(size_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that as we work with max length strings we can statically allocate a fixed array here, which simplifies the code a lot

@pablogsal pablogsal self-assigned this May 1, 2021
static inline int
substitution_cost(char a, char b)
{
if ((a & 31) != (b & 31)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, encapsule the 31 in a macro or a static constant for readability

@pablogsal
Copy link
Member

@sweeneyde No rush if you don't have the bandwidth to do it but if we get this ready before Monday we can get it into beta 1 :)

@sweeneyde
Copy link
Member Author

I tried modifying MAX_CANDIDATE_ITEMS and MAX_STRING_SIZE to be much larger, and I got these benchmark results:

----- Before this PR -----

STRING_LENGTH = 50
DIR_LENGTH = 8000
0.05950444 0.05835115 0.05945749 0.05820514 0.05914853

STRING_LENGTH = 25
DIR_LENGTH = 8000
0.01657816 0.01678682 0.01645611 0.01621476 0.01597858


----- After this PR -----

STRING_LENGTH = 50
DIR_LENGTH = 8000
0.01344847 0.02069782 0.02076828 0.01686316 0.02023981

STRING_LENGTH = 25
DIR_LENGTH = 8000
0.00470112 0.00391163 0.00810734 0.00343608 0.00528902

Benchmark script:

import random
from test.support import captured_stderr
import sys
from time import perf_counter

STRING_LENGTH = 50
DIR_LENGTH = 8000

def rand_string(n):
    return ''.join(random.choices("abc", k=n))

def bench(repeat=100):
    rand_strings = [rand_string(STRING_LENGTH)
                    for _ in range(DIR_LENGTH)]

    class Class:
        locals().update({h: h for h in rand_strings})

    attr = rand_strings[DIR_LENGTH//2]
    attr = attr[:9] + "x" + attr[10:]

    t0 = perf_counter()
    for _ in range(repeat):
        try:
            getattr(Class, attr)
        except AttributeError as exc:
            with captured_stderr() as err:
                sys.__excepthook__(*sys.exc_info())
    t1 = perf_counter()

    assert "Did you mean" in err.getvalue(), err.getvalue()
    assert attr in err.getvalue()
    return (t1 - t0) / repeat

if __name__ == "__main__":
    print(f"{STRING_LENGTH = }")
    print(f"{DIR_LENGTH = }")
    for _ in range(5):
        print(f"{bench():.8f}", end=" ")
    print()

@pablogsal
Copy link
Member

pablogsal commented May 2, 2021

Thanks, @sweeneyde for the benchmarking. This is great work! Although performance here is not required (because the interpreter is going to finish soon) this is still important in some edge cases like a thread that is constantly raising exceptions (yeah, I know 😉 ) so I am pleased to see that this PR will make it faster.

I guess that if we remove the call to the memory allocator and do stack locations will be slightly faster as well. Although for the average namespace this probably will not be super important. In any case, unless is super expensive I prefer to center our efforts on usability an ensure that the suggestions are the better we can have :)

@sweeneyde
Copy link
Member Author

If that is the priority, I can take a look in the next couple of hours at implementing the "transpositions" variant with a rotation of three matrix rows. Would that be good for this PR or should it be a later PR?

@pablogsal
Copy link
Member

I can take a look in the next couple of hours at implementing the "transpositions" variant with a rotation of three matrix rows. Would that be good for this PR or should it be a later PR?

You mean the gcc version here https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/spellcheck.c#L46-L142?

I have implemented that and played a bit with it but I didn't see any cases where is obviously better than the ones we cover already. I was also a bit wary of performance and also the three calls to the allocators also make it a bit messy. I changed them to static arrays of max length and I saw a bit of a speedup in general but I am not sure if it was worth it. It would be great to have some "realistic" examples where that approach is better than the one in this PR or the one that we already have so we can justify the extra complexity.

@sweeneyde
Copy link
Member Author

I increased the range of possible dir()s to check, based on looking at some of the following, particularly numpy:

len(dir(collections)) ~= 36
len(dir(str)) ~= 80
len(dir(__builtins__)) ~= 157 (frequently used here)
len(dir(numpy.ndarray)) ~= 161
len(dir(pandas.DataFrame)) ~= 441
len(dir(scipy)) ~= 601
len(dir(numpy)) ~= 606
len(dir(sympy)) ~= 989

@pablogsal
Copy link
Member

As this is still marked as WIP, tell me when is ready to review and I will do another pass.

@sweeneyde sweeneyde changed the title bpo-38530: WIP: Refactor AttributeError suggestions bpo-38530: Refactor AttributeError suggestions May 2, 2021
@sweeneyde
Copy link
Member Author

I think this should be ready to review again.

self.assertNotIn("'w'", err.getvalue())
self.assertNotIn("'pytho'", err.getvalue())

def test_name_error_suggestions_do_not_trigger_for_too_many_locals(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must find a way to make this test more bearable, maybe using compile() as this is getting a bit wild. :(

Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks, @sweeneyde, this is fantastic work 🚀

I left a comment but we can address that later

@pablogsal
Copy link
Member

Oh, apparently we have some leaks:

❯ ./python -m test test_exceptions test_capi -R :
0:00:00 load avg: 1.41 Run tests sequentially
0:00:00 load avg: 1.41 [1/2] test_exceptions
beginning 9 repetitions
123456789
.........
0:00:10 load avg: 1.51 [2/2] test_capi
beginning 9 repetitions
123456789
.........
test_capi leaked [38, 38, 38, 38] references, sum=152
test_capi leaked [31, 31, 31, 31] memory blocks, sum=124
test_capi failed in 46.7 sec

== Tests result: FAILURE ==

1 test OK.

1 test failed:
    test_capi

Total duration: 57.2 sec
Tests result: FAILUR

Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find the leaks before we merge

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@pablogsal
Copy link
Member

I fixed the refleaks in ce87e15

Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked for refleaks:

❯ ./python -m test  test_capi -R :
0:00:00 load avg: 1.13 Run tests sequentially
0:00:00 load avg: 1.13 [1/1] test_capi
beginning 9 repetitions
123456789
.........
test_capi passed in 46.1 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 46.1 sec
Tests result: SUCCESS

@sweeneyde
Copy link
Member Author

Thanks for the last minute review and fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants