-
Notifications
You must be signed in to change notification settings - Fork 2k
Refactor interpolation (and string and regex) handling in lexer #3770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Did not look at the code yet, but thanks for the tests, and great work! |
src/lexer.coffee
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like leaving TODOs around. This isn't a particularly important one either, so let's remove it.
LGTM. Very nice job, especially with all the tests. This was one of the most direly needed features. 👍. |
- Fix jashkenas#3394: Unclosed single-quoted strings (both regular ones and heredocs) used to pass through the lexer, causing a parsing error later, while double-quoted strings caused an error already in the lexing phase. Now both single and double-quoted unclosed strings error out in the lexer (which is the more logical option) with consistent error messages. This also fixes the last comment by @satyr in jashkenas#3301. - Similar to the above, unclosed heregexes also used to pass through the lexer and not error until in the parsing phase, which resulted in confusing error messages. This has been fixed, too. - Fix jashkenas#3348, by adding passing tests. - Fix jashkenas#3529: If a string starts with an interpolation, an empty string is no longer emitted before the interpolation (unless it is needed to coerce the interpolation into a string). - Block comments cannot contain `*/`. Now the error message also shows exactly where the offending `*/`. This improvement might seem unrelated, but I had to touch that code anyway to refactor string and regex related code, and the change was very trivial. Moreover, it's consistent with the next two points. - Regexes cannot start with `*`. Now the error message also shows exactly where the offending `*` is. (It might actually not be exatly at the start in heregexes.) It is a very minor improvement, but it was trivial to add. - Octal escapes in strings are forbidden in CoffeeScript (just like in JavaScript strict mode). However, this used to be the case only for regular strings. Now they are also forbidden in heredocs. Moreover, the errors now point at the offending octal escape. - Invalid regex flags are no longer allowed. This includes repeated modifiers and unknown ones. Moreover, invalid modifiers do not stop a heregex from being matched, which results in better error messages. - Fix jashkenas#3621: `///a#{1}///` compiles to `RegExp("a" + 1)`. So does `RegExp("a#{1}")`. Still, those two code snippets used to generate different tokens, which is a bit weird, but more importantly causes problems for coffeelint (see clutchski/coffeelint#340). This required lots of tests in test/location.coffee to be updated. Note that some updates to those tests are unrelated to this point; some have been updated to be more consistent (I discovered this because the refactored code happened to be seemingly more correct). - Regular regex literals used to erraneously allow newlines to be escaped, causing invalid JavaScript output. This has been fixed. - Heregexes may now be completely empty (`//////`), instead of erroring out with a confusing message. - Fix jashkenas#2388: Heredocs and heregexes used to be lexed simply, which meant that you couldn't nest a heredoc within a heredoc (double-quoted, that is) or a heregex inside a heregex. - Fix jashkenas#2321: If you used division inside interpolation and then a slash later in the string containing that interpolation, the division slash and the latter slash was erraneously matched as a regex. This has been fixed. - Indentation inside interpolations in heredocs no longer affect how much indentation is removed from each line of the heredoc (which is more intuitive). - Whitespace is now correctly trimmed from the start and end of strings in a few edge cases. - Last but not least, the lexing of interpolated strings now seems to be more efficient. For a regular double-quoted string, we used to use a custom function to find the end of it (taking interpolations and interpolations within interpolations etc. into account). Then we used to re-find the interpolations and recursively lex their contents. In effect, the same string was processed twice, or even more in the case of deeper nesting of interpolations. Now the same string is processed just once. - Code duplication between regular strings, heredocs, regular regexes and heregexes has been reduced. - The above two points should result in more easily read code, too.
Previously such errors pointed at the end of the input, which wasn't very helpful. This is also consistent with unclosed strings, where the errors point at the opening quote. Note that this includes unclosed #{ (interpolations).
I have fixed all mentioned issues now. Moreover, now it also fixes #3529. Ready for another round of review. |
LGTM. |
This looks very good. Thanks for putting such work on some of the hairy corners of the compiler, @lydell! |
Refactor interpolation (and string and regex) handling in lexer
@lydell 🏆 |
And let's tag 'em as we close 'em ;) |
I wasn't sure how to tag a PR that fixes a bug. |
It was a big ol' change/rewrite, so I'm going with "change fixed". If it had just been a single bug or two, then "bug fixed". |
I'm not sure we need the "fixed" tag. I mean, a "bug"-tagged PR without the "fixed" tag would mean the PR added a bug on purpose? |
We need it. It's nice to be able to search through Issues and PR's consistently, and that's harder if we only keep things organized half the time. On occasion, a minor, but unfixable-at-the-time bug might be closed without fixing, and we want to be able to find those as well. |
Instead of compiling to `"" + + (+"-");`, `"#{+}-"'` now gives an appropriate error message: [stdin]:1:5: error: unexpected end of interpolation "#{+}-" ^ I got tired of updating the tests in test/location.coffee (which I had enough of in jashkenas#3770), which relies on implementation details (the exact amount of tokens generated for a given string of code) to do their testing, so I refactored them to be less fragile.
Instead of compiling to `"" + + (+"-");`, `"#{+}-"'` now gives an appropriate error message: [stdin]:1:5: error: unexpected end of interpolation "#{+}-" ^ This is done by _always_ (instead of just sometimes) wrapping the interpolations in parentheses in the lexer. Unnecessary parentheses won't be output anyway. I got tired of updating the tests in test/location.coffee (which I had enough of in jashkenas#3770), which relies on implementation details (the exact amount of tokens generated for a given string of code) to do their testing, so I refactored them to be less fragile.
The new lexer was merged from PR made in jashkenas/coffeescript#3770 Also closes clutchski#372 no_unnecessary_double_quotes - Removed Block RegExp Fix that was originally patched in PR clutchski#340. Thanks to the new PR, this is no longer needed. - Shortened function name `containsSingleQuote` to `hasSingleQuote`. - Changed how we detect if we're inside an interpolated string to be more like the way we handle interpolation in space_operators. no_trailing_semicolons rule - Due to the new way tokens are generated, we now check for 'STRING_END' token rather than a 'CALL_END' token space_operators - Removed old code that assumed an interpolated string always begins with empty quotes. Due to new changes in CoffeeScript lexer, we can now use 'STRING_START' to determine the beginning of an interpolated string. - Renamed lintParens, lintCall to trackParens, trackCall respectively
The new lexer was merged from PR made in jashkenas/coffeescript#3770 Also closes clutchski#372 no_unnecessary_double_quotes - Removed Block RegExp Fix that was originally patched in PR clutchski#340. Thanks to the new PR, this is no longer needed - Shortened function name `containsSingleQuote` to `hasSingleQuote` - Changed how we detect if we're inside an interpolated string to be more like the way we handle interpolation in space_operators no_trailing_semicolons rule - Due to the new way tokens are generated, we now check for 'STRING_END' token rather than a 'CALL_END' token space_operators - Removed old code that assumed an interpolated string always begins with empty quotes. Due to new changes in CoffeeScript lexer, we can now use 'STRING_START' to determine the beginning of an interpolated string - Renamed lintParens, lintCall to trackParens, trackCall respectively
The new lexer was merged from PR made in jashkenas/coffeescript#3770 Also closes clutchski#372 no_unnecessary_double_quotes - Removed Block RegExp Fix that was originally patched in PR clutchski#340. Thanks to the new PR, this is no longer needed - Shortened function name `containsSingleQuote` to `hasSingleQuote` - Changed how we detect if we're inside an interpolated string to be more like the way we handle interpolation in space_operators no_trailing_semicolons rule - Due to the new way tokens are generated, we now check for 'STRING_END' token rather than a 'CALL_END' token space_operators - Removed old code that assumed an interpolated string always begins with empty quotes. Due to new changes in CoffeeScript lexer, we can now use 'STRING_START' to determine the beginning of an interpolated string - Renamed lintParens, lintCall to trackParens, trackCall respectively
See the commit messages for details.
I’m sorry about the size of the first commit, but I don’t think I could have split it up.
Fixes #3621, #3301, #3394, #3348, #2388, #2321.