The lark parser predefines some common terminals, including a string. It is defined as follows:
_STRING_INNER: /.*?/
_STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/ ESCAPED_STRING : "\"" _STRING_ESC_INNER "\""
I do understand _STRING_INNER
. I also understand how ESCAPED_STRING
is composed. But what I don't really understand is _STRING_ESC_INNER
.
If I read the regex correctly, all it says is that whenever I find two consecutive literal backslashes, they must not be preceeded by another literal backslash?
How can I combine those two into a single regex?
And wouldn't it be required for the grammar to only allow escaped double quotes in the string data?
Preliminaries:
.*?
Non-greedy match, meaning the shortest possible number of repetitions of .
(any symbol). This only makes sense when followed by something else. So .*?X
on input AAXAAX
would match only the AAX
part, instead of expanding all the way to the last X
.
(?<!...)
is a "negative look-behind assertion" (link): "Matches if the current position in the string is not preceded by a match for ....". So .*(?<!X)Y
would match AY
but not XY
.
Applying this to your example:
ESCAPED_STRING
: The rule says: "Match "
, then _STRING_ESC_INNER
, and then "
again".
_STRING_INNER
: Matches the shortest possible number of repetitions of any symbol. As said before, this only makes sense when considering the regular expression that comes after it.
_STRING_ESC_INNER
: We want this to match the shortest possible string that does not contain a closing quote. That is, for an input "abc"xyz"
, we want to match "abc"
, instead of also consuming the xyz"
part. However, we have to make sure that the "
is really a closing quote, in that it should not be itself escaped. So for input "abc\"xyz"
, we do not want to match only "abc\"
, because the \"
is escaped. We observe that the closing "
has to be directly preceded by an even number of \
(with zero being an even number). So "
is ok, \\"
is ok, \\\\"
is ok etc. But as soon as "
is preceded by an odd number of \
, that means the "
is not really a closing quote.
(\\\\)
matches \\
. The (?<!\\)
says "the position before should not have \
". So combined (?<!\\)(\\\\)
means "match \\
, but only if it is not preceded by \
".
The following *?
then does the smallest possible repetitions of this, which again only makes sense when considering the regular expression that comes after this, which is the "
from the ESCAPED_STRING
rule (possible point of confusion: the \"
in the ESCAPED_STRING
refers to a literal "
in the actual input we want to match, in the same way that \\\\
refers to \\
in the input). So (?<!\\)(\\\\)*?\"
means "match the shortest amount of \\
that is followed by "
and not preceded by \
. So in other words, (?<!\\)(\\\\)*?\"
matches only "
that are preceded by an even number of \
(including blocks of size 0).
Now combining it with the preceding _STRING_INNER
, the _STRING_ESC_INNER
rule then says: Match the first "
preceded by an even number of \
, so in other words, the first "
where the \
is not itself escaped.