-
Notifications
You must be signed in to change notification settings - Fork 484
Closed
Labels
Description
I am using rustc version 1.14.0 and regex version 0.2.1. I found a pattern that panics a non-Unicode bytes::Regex:
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/libcore/option.rs:323
This program demonstrates the panic:
extern crate regex;
fn main() {
let _ = regex::bytes::Regex::new(r"(?-u).[.]\S\w\x00\x02\x03\x05\x06\x08\x0c\x0e\x10\x12\x14\x16\x17\x19\x1a\x1c\x1e\x21\x23\x25\x27\x28\x2a\x2c\x31\x33\x35\x36\x38\x3b\x3d\x3e\x40\x42\x44\x46\x48\x4a\x4c\x4d\x4f\x50\x52\x53\x55\x56\x58\x59\x5b\x5d\x61\x63\x65\x67\x69\x6a\x6c\x6e\x6f\x70\x72\x74\x76\x77\x79\x7c\x7e\x81\x82\x84\x86\x88\x89\x8b\x8c\x8e\x91\x93\x95\x97\x99\x9b\x9d\x9e\xa1\xa3\xa5\xa6\xa8\xaa\xac\xad\xaf\xb0\xb2\xb4\xb5\xb7\xb9\xbb\xbd\xbe\xc0\xc3\xc5\xc7\xc9\xcb\xcc\xce\xcf\xd1\xd3\xd4\xd6\xd7\xd9\xdb\xdc\xde\xe2\xe3\xe5\xe6\xe8\xe9\xeb\xee\xf1\xf3\xf5\xf7\xf9\xfb\xfc\xfe");
}
The pattern consists of 136 unique literal byte values, plus the sub-patterns ., [.], \S, and \w. Here is what I have been able to find out:
- The pattern is close to minimal. Removing any of the 136 literal bytes, or any of the four sub-patterns, avoids the panic. You can add to the pattern anywhere and it still panics.
- Order doesn't matter. I constructed the pattern by starting with a much larger
bytes::RegexSetBuilderthat panicked, removing as much as I could while still having it panic, and sorting. - The panic isn't related to backslash escapes. I.e., you can replace
\x44with a literalDand it still panics. - Some modifications avoid the panic and some do not. For example, changing
\x00to\x01still panics, but changing\xdeto\xdfdoes not. Changing\wto\bstill panics, but changing\wto\ddoes not. - Non-Unicode is necessary. If you omit the
(?-u)(orunicode(false)when using aRegexBuilderorRegexSetBuilder), then it does not panic.
The same thing happens with bytes::RegexBuilder, bytes::RegexSet, and bytes::RegexSetBuilder. The 140 necessary elements can be distributed across multiple patterns when using a builder. Here is an example of a bytes::RegexSetBuilder that panics:
extern crate regex;
fn main() {
let mut rsb = regex::bytes::RegexSetBuilder::new([
r"\xa3\xd7\x40\x95\x59\xd4\x2a\x86\x93\xaf",
r"\x16\xa1\x14\x19\x00\x2c\x27\xcc\x10\xcb\xee\xf5\xeb\xfb\xb5\xd9\x46\x25\x23\x38\x36\x35\x56\x31\x4a\x44\x4c\x99\xc7\x9d\x3d\w\xc0\x9b\x3b\x12\xdb\x89",
r"\x84",
r"\xcf\x8c",
r"\x7c\xbd\x97\xfc\x3e\x6c\x79\x7e\xc3\x9e\x5b\x42\xf3\x17\x06\x08\xc5\xac\x05\x53\xe9",
r"\xdc\xc9\x8b\x1a\x02\x1c\x76\x6a\xd3\xb4\x91\x0c\x1e\x03\x70\x77\x55\x52\x0e",
r"\xde\xb2\xad.\x8e\x88\xd6\x81\xf9\xb7\xfe\xce\xf7\xb0\xe6",
r"\xd1\x4d\x72\x6f\x74\x63\x61\x6e\x67\x65\x50\x4f\x33\xe2\xe5\xf1\xe8[.]\xe3",
r"\xa6\xbe\xb9\xaa\xbb\x28\x69\xa5\x48\xa8",
r"\S\x5d\x21\x58\x82",
].iter());
rsb.unicode(false);
match rsb.build() {
Ok(_) => println!("ok"),
Err(e) => println!("error {}", e),
};
}
Here are all 140 elements of the pattern in order:
.
[.]
\S
\w
\x00
\x02
\x03
\x05
\x06
\x08
\x0c
\x0e
\x10
\x12
\x14
\x16
\x17
\x19
\x1a
\x1c
\x1e
\x21
\x23
\x25
\x27
\x28
\x2a
\x2c
\x31
\x33
\x35
\x36
\x38
\x3b
\x3d
\x3e
\x40
\x42
\x44
\x46
\x48
\x4a
\x4c
\x4d
\x4f
\x50
\x52
\x53
\x55
\x56
\x58
\x59
\x5b
\x5d
\x61
\x63
\x65
\x67
\x69
\x6a
\x6c
\x6e
\x6f
\x70
\x72
\x74
\x76
\x77
\x79
\x7c
\x7e
\x81
\x82
\x84
\x86
\x88
\x89
\x8b
\x8c
\x8e
\x91
\x93
\x95
\x97
\x99
\x9b
\x9d
\x9e
\xa1
\xa3
\xa5
\xa6
\xa8
\xaa
\xac
\xad
\xaf
\xb0
\xb2
\xb4
\xb5
\xb7
\xb9
\xbb
\xbd
\xbe
\xc0
\xc3
\xc5
\xc7
\xc9
\xcb
\xcc
\xce
\xcf
\xd1
\xd3
\xd4
\xd6
\xd7
\xd9
\xdb
\xdc
\xde
\xe2
\xe3
\xe5
\xe6
\xe8
\xe9
\xeb
\xee
\xf1
\xf3
\xf5
\xf7
\xf9
\xfb
\xfc
\xfe