Refactor interpolation (and string and regex) handling in lexer #3770

lydell · 2015-01-03T23:39:05Z

See the commit messages for details.

I’m sorry about the size of the first commit, but I don’t think I could have split it up.

Fixes #3621, #3301, #3394, #3348, #2388, #2321.

vendethiel · 2015-01-03T23:50:07Z

Did not look at the code yet, but thanks for the tests, and great work!

michaelficarra · 2015-01-03T23:59:26Z

src/lexer.coffee

I don't like leaving TODOs around. This isn't a particularly important one either, so let's remove it.

michaelficarra · 2015-01-04T00:30:52Z

LGTM. Very nice job, especially with all the tests. This was one of the most direly needed features. 👍.

@satyr

- Fix jashkenas#3394: Unclosed single-quoted strings (both regular ones and heredocs) used to pass through the lexer, causing a parsing error later, while double-quoted strings caused an error already in the lexing phase. Now both single and double-quoted unclosed strings error out in the lexer (which is the more logical option) with consistent error messages. This also fixes the last comment by @satyr in jashkenas#3301. - Similar to the above, unclosed heregexes also used to pass through the lexer and not error until in the parsing phase, which resulted in confusing error messages. This has been fixed, too. - Fix jashkenas#3348, by adding passing tests. - Fix jashkenas#3529: If a string starts with an interpolation, an empty string is no longer emitted before the interpolation (unless it is needed to coerce the interpolation into a string). - Block comments cannot contain `*/`. Now the error message also shows exactly where the offending `*/`. This improvement might seem unrelated, but I had to touch that code anyway to refactor string and regex related code, and the change was very trivial. Moreover, it's consistent with the next two points. - Regexes cannot start with `*`. Now the error message also shows exactly where the offending `*` is. (It might actually not be exatly at the start in heregexes.) It is a very minor improvement, but it was trivial to add. - Octal escapes in strings are forbidden in CoffeeScript (just like in JavaScript strict mode). However, this used to be the case only for regular strings. Now they are also forbidden in heredocs. Moreover, the errors now point at the offending octal escape. - Invalid regex flags are no longer allowed. This includes repeated modifiers and unknown ones. Moreover, invalid modifiers do not stop a heregex from being matched, which results in better error messages. - Fix jashkenas#3621: `///a#{1}///` compiles to `RegExp("a" + 1)`. So does `RegExp("a#{1}")`. Still, those two code snippets used to generate different tokens, which is a bit weird, but more importantly causes problems for coffeelint (see clutchski/coffeelint#340). This required lots of tests in test/location.coffee to be updated. Note that some updates to those tests are unrelated to this point; some have been updated to be more consistent (I discovered this because the refactored code happened to be seemingly more correct). - Regular regex literals used to erraneously allow newlines to be escaped, causing invalid JavaScript output. This has been fixed. - Heregexes may now be completely empty (`//////`), instead of erroring out with a confusing message. - Fix jashkenas#2388: Heredocs and heregexes used to be lexed simply, which meant that you couldn't nest a heredoc within a heredoc (double-quoted, that is) or a heregex inside a heregex. - Fix jashkenas#2321: If you used division inside interpolation and then a slash later in the string containing that interpolation, the division slash and the latter slash was erraneously matched as a regex. This has been fixed. - Indentation inside interpolations in heredocs no longer affect how much indentation is removed from each line of the heredoc (which is more intuitive). - Whitespace is now correctly trimmed from the start and end of strings in a few edge cases. - Last but not least, the lexing of interpolated strings now seems to be more efficient. For a regular double-quoted string, we used to use a custom function to find the end of it (taking interpolations and interpolations within interpolations etc. into account). Then we used to re-find the interpolations and recursively lex their contents. In effect, the same string was processed twice, or even more in the case of deeper nesting of interpolations. Now the same string is processed just once. - Code duplication between regular strings, heredocs, regular regexes and heregexes has been reduced. - The above two points should result in more easily read code, too.

Previously such errors pointed at the end of the input, which wasn't very helpful. This is also consistent with unclosed strings, where the errors point at the opening quote. Note that this includes unclosed #{ (interpolations).

lydell · 2015-01-04T06:56:53Z

I have fixed all mentioned issues now. Moreover, now it also fixes #3529. Ready for another round of review.

michaelficarra · 2015-01-04T10:14:27Z

LGTM.

epidemian · 2015-01-04T15:33:46Z

This looks very good. Thanks for putting such work on some of the hairy corners of the compiler, @lydell!

Refactor interpolation (and string and regex) handling in lexer

xixixao · 2015-01-05T09:36:34Z

@lydell 🏆

jashkenas · 2015-01-05T16:33:42Z

And let's tag 'em as we close 'em ;)

michaelficarra · 2015-01-05T16:36:33Z

I wasn't sure how to tag a PR that fixes a bug.

jashkenas · 2015-01-05T16:37:42Z

It was a big ol' change/rewrite, so I'm going with "change fixed". If it had just been a single bug or two, then "bug fixed".

vendethiel · 2015-01-05T16:42:21Z

I'm not sure we need the "fixed" tag. I mean, a "bug"-tagged PR without the "fixed" tag would mean the PR added a bug on purpose?

jashkenas · 2015-01-05T16:44:35Z

We need it. It's nice to be able to search through Issues and PR's consistently, and that's harder if we only keep things organized half the time. On occasion, a minor, but unfixable-at-the-time bug might be closed without fixing, and we want to be able to find those as well.

Instead of compiling to `"" + + (+"-");`, `"#{+}-"'` now gives an appropriate error message: [stdin]:1:5: error: unexpected end of interpolation "#{+}-" ^ I got tired of updating the tests in test/location.coffee (which I had enough of in jashkenas#3770), which relies on implementation details (the exact amount of tokens generated for a given string of code) to do their testing, so I refactored them to be less fragile.

Instead of compiling to `"" + + (+"-");`, `"#{+}-"'` now gives an appropriate error message: [stdin]:1:5: error: unexpected end of interpolation "#{+}-" ^ This is done by _always_ (instead of just sometimes) wrapping the interpolations in parentheses in the lexer. Unnecessary parentheses won't be output anyway. I got tired of updating the tests in test/location.coffee (which I had enough of in jashkenas#3770), which relies on implementation details (the exact amount of tokens generated for a given string of code) to do their testing, so I refactored them to be less fragile.

The new lexer was merged from PR made in jashkenas/coffeescript#3770 Also closes clutchski#372 no_unnecessary_double_quotes - Removed Block RegExp Fix that was originally patched in PR clutchski#340. Thanks to the new PR, this is no longer needed. - Shortened function name `containsSingleQuote` to `hasSingleQuote`. - Changed how we detect if we're inside an interpolated string to be more like the way we handle interpolation in space_operators. no_trailing_semicolons rule - Due to the new way tokens are generated, we now check for 'STRING_END' token rather than a 'CALL_END' token space_operators - Removed old code that assumed an interpolated string always begins with empty quotes. Due to new changes in CoffeeScript lexer, we can now use 'STRING_START' to determine the beginning of an interpolated string. - Renamed lintParens, lintCall to trackParens, trackCall respectively

The new lexer was merged from PR made in jashkenas/coffeescript#3770 Also closes clutchski#372 no_unnecessary_double_quotes - Removed Block RegExp Fix that was originally patched in PR clutchski#340. Thanks to the new PR, this is no longer needed - Shortened function name `containsSingleQuote` to `hasSingleQuote` - Changed how we detect if we're inside an interpolated string to be more like the way we handle interpolation in space_operators no_trailing_semicolons rule - Due to the new way tokens are generated, we now check for 'STRING_END' token rather than a 'CALL_END' token space_operators - Removed old code that assumed an interpolated string always begins with empty quotes. Due to new changes in CoffeeScript lexer, we can now use 'STRING_START' to determine the beginning of an interpolated string - Renamed lintParens, lintCall to trackParens, trackCall respectively

michaelficarra reviewed Jan 3, 2015
View reviewed changes

lydell added 2 commits January 4, 2015 07:47

Point "missing )/}/]" errors to the unclosed (/{/[

ae6df88

Previously such errors pointed at the end of the input, which wasn't very helpful. This is also consistent with unclosed strings, where the errors point at the opening quote. Note that this includes unclosed #{ (interpolations).

lydell force-pushed the interpolations branch from c0dcd47 to ae6df88 Compare January 4, 2015 06:53

michaelficarra added a commit that referenced this pull request Jan 4, 2015

Merge pull request #3770 from lydell/interpolations

b70f657

Refactor interpolation (and string and regex) handling in lexer

michaelficarra merged commit b70f657 into jashkenas:master Jan 4, 2015

michaelficarra mentioned this pull request Jan 4, 2015

Bad lexing of triple quoted regex and strings #3301

Closed

lydell deleted the interpolations branch January 4, 2015 20:35

jashkenas added change fixed labels Jan 5, 2015

This was referenced Jan 12, 2015

Fix #1316: Interpolate interpolations safely #3787

Merged

x = y /* z */ #1772

Closed

swang mentioned this pull request Feb 11, 2015

Update to CoffeeScript 1.9.1 clutchski/coffeelint#377

Merged

Refactor interpolation (and string and regex) handling in lexer #3770

Refactor interpolation (and string and regex) handling in lexer #3770

Uh oh!

Conversation

lydell commented Jan 3, 2015

Uh oh!

vendethiel commented Jan 3, 2015

Uh oh!

michaelficarra Jan 3, 2015

Choose a reason for hiding this comment

Uh oh!

michaelficarra commented Jan 4, 2015

Uh oh!

lydell commented Jan 4, 2015

Uh oh!

michaelficarra commented Jan 4, 2015

Uh oh!

epidemian commented Jan 4, 2015

Uh oh!

xixixao commented Jan 5, 2015

Uh oh!

jashkenas commented Jan 5, 2015

Uh oh!

michaelficarra commented Jan 5, 2015

Uh oh!

jashkenas commented Jan 5, 2015

Uh oh!

vendethiel commented Jan 5, 2015

Uh oh!

jashkenas commented Jan 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants