I need to parse some special data structures. They are in some somewhat-like-C format that looks roughly like this:
Group("GroupName") {/* C-Style comment */Group("AnotherGroupName") {Entry("some","variables",0,3.141);Entry("other","variables",1,2.718);}Entry("linebreaks","allowed",3,1.414);
}
I can think of several ways to go about this. I could 'tokenize' the code using regular expressions. I could read the code one character at a time and use a state machine to construct my data structure. I could get rid of comma-linebreaks and read the thing line by line. I could write some conversion script that converts this code to executable Python code.
Is there a nice pythonic way to parse files like this?
How would you go about parsing it?
This is more a general question about how to parse strings and not so much about this particular file format.
Using pyparsing (Mark Tolonen, I was just about to click "Submit Post" when your post came thru), this is pretty straightforward - see comments embedded in the code below:
data = """Group("GroupName") { /* C-Style comment */ Group("AnotherGroupName") { Entry("some","variables",0,3.141); Entry("other","variables",1,2.718); } Entry("linebreaks", "allowed", 3, 1.414 );
} """from pyparsing import *# define basic punctuation and data types
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,"{}();")
GROUP = Keyword("Group")
ENTRY = Keyword("Entry")# use parse actions to do parse-time conversion of values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))# parses a string enclosed in quotes, but strips off the quotes at parse time
string = QuotedString('"')# define structure expressions
value = string | real | integer
entry = Group(ENTRY + LPAREN + Group(Optional(delimitedList(value)))) + RPAREN + SEMI# since Groups can contain Groups, need to use a Forward to define recursive expression
group = Forward()
group << Group(GROUP + LPAREN + string("name") + RPAREN + LBRACE + Group(ZeroOrMore(group | entry))("body") + RBRACE)# ignore C style comments wherever they occur
group.ignore(cStyleComment)# parse the sample text
result = group.parseString(data)# print out the tokens as a nice indented list using pprint
from pprint import pprint
pprint(result.asList())
Prints
[['Group','GroupName',[['Group','AnotherGroupName',[['Entry', ['some', 'variables', 0, 3.141]],['Entry', ['other', 'variables', 1, 2.718]]]],['Entry', ['linebreaks', 'allowed', 3, 1.4139999999999999]]]]]
(Unfortunately, there may be some confusion since pyparsing defines a "Group" class, for imparting structure to the parsed tokens - note how the value lists in an Entry get grouped because the list expression is enclosed within a pyparsing Group.)