I'm trying to parse the output of "ypcat -k netgroup" The output looks like many lines of this format:
group1 (host1,user1,domain1) (host2,user2,domain2) (host3,user3,domain3) ...
or sometimes
group2 groupa groupb groupc ...
I first tried using this lark grammar:
def getNetgroups():
parser = Lark(ypcat_grammer)
res = subprocess.check_output(['ypcat -k netgroup'], shell=True).decode('utf-8')
print(parser.parse(res).pretty())
ypcat_grammer = r"""
?start: _line+
_line: groupname members NEWLINE
members: (member|groupname)*
member: "(" hostname? "," username? "," domainname? ")"
username: _name
domainname: _name
groupname: _name
hostname: _name
_name: /([a-zA-Z0-9_\.\-]+)/
%import common.WS_INLINE
%import common.NUMBER
%import common.NEWLINE
%ignore WS_INLINE
"""
that took 60 seconds to parse 4000 lines!!? that seemed crazy long, so I write a hand-coded parser:
member = re.compile('\(([^,]*),([^,]*),([^,]*)\)')
def parseNetGroups():
res = subprocess.check_output(['ypcat -k netgroup'], shell=True).decode('utf-8')
rows = []
for line in res.split('\n'):
words = re.split('\s+', line)
groupname = words.pop(0)
members = []
for word in words:
if m:=member.match(word):
members.append((m.group(1),m.group(2),m.group(3)))
else:
members.append(word)
rows.append({'GROUPNAME':groupname, 'MEMBERS':members})
return pd.DataFrame(rows)
this took 0.8 seconds. What am I doing wrong?
changing to parser='lalr' reduced runtime to 3.8s. That's good enough for me.