-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
Description
Feature or enhancement
Proposal:
Two-way algorithm currently does not incorporate Horspool skip rule when needle is periodic. As part of my work this led to difficulties in run-time predictions.
This rule is a significant determinant of run-time and a common factor of 2 algorithms. If this rule was implemented for periodic needles it might be achievable to predict runtimes and implement dynamic adaptive_find
.
The image below shows different permutations of string search inputs
. X-axis is of the format: needle matching characters + needle non-matching characters
. And all cases are define so that they all have a window for 100K iterations. Actual runtimes and predicted.
Period circled in red is where periodic needles are constructed, all of the other periods construct non-periodic periods.
The default algorithm always uses Horspool skip rule and its addition to two-way
algorithm makes both algorithms consistent and comparable.
Below are run-times for 3 points - 1 from period circled in red and one from each nearby period at the same points within periods.
# Adjacent period
nd = '_' * 3 + '1' * 10 + '_'
hs = '_' * (100_000 + 5)
%timeit hs.find(nd) # 36.3 µs
# Target Period
nd = '_' + '1' * 10 + '_'
hs = '_' * (100_000 + 3)
%timeit hs.find(nd) # 388 µs
# Adjacent period on the other side
nd = '1' * 10 + '_'
hs = '_' * (100_000 + 2)
%timeit hs.find(nd) # 36.7 µs
Here is an example of absence of the rule:
>>> '_111_' in '_____________'
###### Finding "_111_" in "_____________".
split: "_" + "111_"
===== Two-way: "_111_" in "_____________". =====
Needle is periodic.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Right half does not match.
> "_____________"
> "_111_"
Finally, same plot after inclusion of the rule for periodic needles:

Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
There was no discussion on this particular issue, but this is one of the building blocks to achieve:
https://discuss.python.org/t/optimize-str-find-aka-fastsearch/54483/12