lark grammar: How does the escaped string regex work?

2024/9/23 1:29:28

The lark parser predefines some common terminals, including a string. It is defined as follows:

_STRING_INNER: /.*?/
_STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/ ESCAPED_STRING : "\"" _STRING_ESC_INNER "\""

I do understand _STRING_INNER. I also understand how ESCAPED_STRING is composed. But what I don't really understand is _STRING_ESC_INNER.

If I read the regex correctly, all it says is that whenever I find two consecutive literal backslashes, they must not be preceeded by another literal backslash?

How can I combine those two into a single regex?

And wouldn't it be required for the grammar to only allow escaped double quotes in the string data?

Answer

Preliminaries:

  • .*? Non-greedy match, meaning the shortest possible number of repetitions of . (any symbol). This only makes sense when followed by something else. So .*?X on input AAXAAX would match only the AAX part, instead of expanding all the way to the last X.

  • (?<!...) is a "negative look-behind assertion" (link): "Matches if the current position in the string is not preceded by a match for ....". So .*(?<!X)Y would match AY but not XY.

Applying this to your example:

  • ESCAPED_STRING: The rule says: "Match ", then _STRING_ESC_INNER, and then " again".

  • _STRING_INNER: Matches the shortest possible number of repetitions of any symbol. As said before, this only makes sense when considering the regular expression that comes after it.

  • _STRING_ESC_INNER: We want this to match the shortest possible string that does not contain a closing quote. That is, for an input "abc"xyz", we want to match "abc", instead of also consuming the xyz" part. However, we have to make sure that the " is really a closing quote, in that it should not be itself escaped. So for input "abc\"xyz", we do not want to match only "abc\", because the \" is escaped. We observe that the closing " has to be directly preceded by an even number of \ (with zero being an even number). So " is ok, \\" is ok, \\\\" is ok etc. But as soon as " is preceded by an odd number of \, that means the " is not really a closing quote.

    (\\\\) matches \\. The (?<!\\) says "the position before should not have \". So combined (?<!\\)(\\\\) means "match \\, but only if it is not preceded by \".

    The following *? then does the smallest possible repetitions of this, which again only makes sense when considering the regular expression that comes after this, which is the " from the ESCAPED_STRING rule (possible point of confusion: the \" in the ESCAPED_STRING refers to a literal " in the actual input we want to match, in the same way that \\\\ refers to \\ in the input). So (?<!\\)(\\\\)*?\" means "match the shortest amount of \\ that is followed by " and not preceded by \. So in other words, (?<!\\)(\\\\)*?\" matches only " that are preceded by an even number of \ (including blocks of size 0).

    Now combining it with the preceding _STRING_INNER, the _STRING_ESC_INNER rule then says: Match the first " preceded by an even number of \, so in other words, the first " where the \ is not itself escaped.

https://en.xdnf.cn/q/71878.html

Related Q&A

Pycharm unresolved reference on join of os.path

After upgrade pycharm to 2018.1, and upgrade python to 3.6.5, pycharm reports "unresolved reference join". The last version of pycharm doesnt show any warning for the line below:from os.path …

Apply Border To Range Of Cells Using Openpyxl

I am using python 2.7.10 and openpyxl 2.3.2 and I am a Python newbie.I am attempting to apply a border to a specified range of cells in an Excel worksheet (e.g. C3:H10). My attempt below is failing wit…

Make a functional field editable in Openerp?

How to make functional field editable in Openerp?When we createcapname: fields.function(_convert_capital, string=Display Name, type=char, store=True ),This will be displayed has read-only and we cant …

how to read a fasta file in python?

Im trying to read a FASTA file and then find specific motif(string) and print out the sequence and number of times it occurs. A FASTA file is just series of sequences(strings) that starts with a header…

Passing a pandas dataframe column to an NLTK tokenizer

I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object." raw_d…

SWIG - Wrap C string array to python list

I was wondering what is the correct way to wrap an array of strings in C to a Python list using SWIG.The array is inside a struct :typedef struct {char** my_array;char* some_string; }Foo;SWIG automati…

How to show an Image with pillow and update it?

I want to show an image recreated from an img-vector, everything fine. now I edit the Vector and want to show the new image, and that multiple times per second. My actual code open tons of windows, wit…

How do I map Alt Gr key combinations in vim?

Suppose I wanted to map the command :!python % <ENTER> to pressing the keys Alt Gr and j together?

cannot import name get_user_model

I use django-registrations and while I add this code in my admin.pyfrom django.contrib import adminfrom customer.models import Customerfrom .models import UserProfilefrom django.contrib.auth.admin impo…

pytest: Best Way To Add Long Test Description in the Report

By default pytest use test function names or test files names in pytest reportsis there any Best way to add test description (Long test name) in the report with out renaming the files or functions usin…