lark grammar: How does the escaped string regex work?

2024/9/23 1:29:28

The lark parser predefines some common terminals, including a string. It is defined as follows:


I do understand _STRING_INNER. I also understand how ESCAPED_STRING is composed. But what I don't really understand is _STRING_ESC_INNER.

If I read the regex correctly, all it says is that whenever I find two consecutive literal backslashes, they must not be preceeded by another literal backslash?

How can I combine those two into a single regex?

And wouldn't it be required for the grammar to only allow escaped double quotes in the string data?



  • .*? Non-greedy match, meaning the shortest possible number of repetitions of . (any symbol). This only makes sense when followed by something else. So .*?X on input AAXAAX would match only the AAX part, instead of expanding all the way to the last X.

  • (?<!...) is a "negative look-behind assertion" (link): "Matches if the current position in the string is not preceded by a match for ....". So .*(?<!X)Y would match AY but not XY.

Applying this to your example:

  • ESCAPED_STRING: The rule says: "Match ", then _STRING_ESC_INNER, and then " again".

  • _STRING_INNER: Matches the shortest possible number of repetitions of any symbol. As said before, this only makes sense when considering the regular expression that comes after it.

  • _STRING_ESC_INNER: We want this to match the shortest possible string that does not contain a closing quote. That is, for an input "abc"xyz", we want to match "abc", instead of also consuming the xyz" part. However, we have to make sure that the " is really a closing quote, in that it should not be itself escaped. So for input "abc\"xyz", we do not want to match only "abc\", because the \" is escaped. We observe that the closing " has to be directly preceded by an even number of \ (with zero being an even number). So " is ok, \\" is ok, \\\\" is ok etc. But as soon as " is preceded by an odd number of \, that means the " is not really a closing quote.

    (\\\\) matches \\. The (?<!\\) says "the position before should not have \". So combined (?<!\\)(\\\\) means "match \\, but only if it is not preceded by \".

    The following *? then does the smallest possible repetitions of this, which again only makes sense when considering the regular expression that comes after this, which is the " from the ESCAPED_STRING rule (possible point of confusion: the \" in the ESCAPED_STRING refers to a literal " in the actual input we want to match, in the same way that \\\\ refers to \\ in the input). So (?<!\\)(\\\\)*?\" means "match the shortest amount of \\ that is followed by " and not preceded by \. So in other words, (?<!\\)(\\\\)*?\" matches only " that are preceded by an even number of \ (including blocks of size 0).

    Now combining it with the preceding _STRING_INNER, the _STRING_ESC_INNER rule then says: Match the first " preceded by an even number of \, so in other words, the first " where the \ is not itself escaped.

