KEMBAR78
Utilize last character gap for two-way periodic needles · Issue #119879 · python/cpython · GitHub
Skip to content

Utilize last character gap for two-way periodic needles #119879

@dg-pb

Description

@dg-pb

Feature or enhancement

Proposal:

Two-way algorithm currently does not incorporate Horspool skip rule when needle is periodic. As part of my work this led to difficulties in run-time predictions.

This rule is a significant determinant of run-time and a common factor of 2 algorithms. If this rule was implemented for periodic needles it might be achievable to predict runtimes and implement dynamic adaptive_find.

The image below shows different permutations of string search inputs. X-axis is of the format: needle matching characters + needle non-matching characters. And all cases are define so that they all have a window for 100K iterations. Actual runtimes and predicted.

Period circled in red is where periodic needles are constructed, all of the other periods construct non-periodic periods.

The default algorithm always uses Horspool skip rule and its addition to two-way algorithm makes both algorithms consistent and comparable.

image

Below are run-times for 3 points - 1 from period circled in red and one from each nearby period at the same points within periods.

# Adjacent period
nd = '_' * 3 + '1' * 10 + '_'
hs = '_' * (100_000 + 5)
%timeit hs.find(nd)    # 36.3 µs
# Target Period
nd = '_' + '1' * 10 + '_'
hs = '_' * (100_000 + 3)
%timeit hs.find(nd)    # 388 µs
# Adjacent period on the other side
nd = '1' * 10 + '_'
hs = '_' * (100_000 + 2)
%timeit hs.find(nd)    # 36.7 µs

Here is an example of absence of the rule:

>>> '_111_' in '_____________'
###### Finding "_111_" in "_____________".
split: "_" + "111_"
===== Two-way: "_111_" in "_____________". =====
Needle is periodic.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
>  "_111_"
Right half does not match.
> "_____________"
>   "_111_"
Right half does not match.
> "_____________"
>    "_111_"
Right half does not match.
> "_____________"
>     "_111_"
Right half does not match.
> "_____________"
>      "_111_"
Right half does not match.
> "_____________"
>       "_111_"
Right half does not match.
> "_____________"
>        "_111_"
Right half does not match.
> "_____________"
>         "_111_"

Finally, same plot after inclusion of the rule for periodic needles:

Screenshot 2024-05-31 at 22 06 44

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

There was no discussion on this particular issue, but this is one of the building blocks to achieve:
https://discuss.python.org/t/optimize-str-find-aka-fastsearch/54483/12

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions