For one of my projects I needed some one-liners parser to AST. I’ve tried PLY, pyPEG and a few more. And stopped on pyparsing. It’s actively maintained, works without magic and easy to use.
Ideally I wanted to parse something like:
LANG=en_US.utf-8 git diff | wc -l >> diffs
To something like:
(=LANGen_US.utf-8)(>>(|(gitdiff)(wc-l))(diffs))
So let’s start with simple shell command, it’s just space-separated tokens:
importpyparsingaspptoken=pp.Word(pp.alphanums+'_-.')command=pp.OneOrMore(token)command.parseString('git branch --help')>>>['git','branch','--help']
It’s simple, another simple part is parsing environment variables. One environment
variable is token=token
, and list of them separated by spaces:
env=pp.Group(token+'='+token)env.parseString('A=B')>>>[['A','=','B']]env_list=pp.OneOrMore(env)env_list.parseString('VAR=test X=1')>>>[['VAR','=','test'],['X','=','1']]
And now we can easily merge command and environment variables, mind that environment variables are optional:
command_with_env=pp.Optional(pp.Group(env_list))+pp.Group(command)command_with_env.parseString('LOCALE=en_US.utf-8 git diff')>>>[[['LOCALE','=','en_US.utf-8']],['git','diff']]
Now we need to add support of pipes, redirects and logical operators. Here we don’t need to know what they’re doing, so we’ll treat them just like separators between commands:
separators=['1>>','2>>','>>','1>','2>','>','<','||','|','&&','&',';']separator=pp.oneOf(separators)command_with_separator=pp.OneOrMore(pp.Group(command)+pp.Optional(separator))command_with_separator.parseString('git diff | wc -l >> out.txt')>>>[['git','diff'],'|',['wc','-l'],'>>',['out.txt']]
And now we can merge environment variables, commands and separators:
one_liner=pp.Optional(pp.Group(env_list))+pp.Group(command_with_separator)one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt')>>>[[['LANG','=','C'],['DEBUG','=','true']],[['git','branch'],'|',['wc','-l'],'>>',['out.txt']]]
Result is hard to process, so we need to structure it:
one_liner=pp.Optional(env_list).setResultsName('env')+ \
pp.Group(command_with_separator).setResultsName('command')result=one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt')print('env:',result.env,'\ncommand:',result.command)>>>env:[['LANG','=','C'],['DEBUG','=','true']]>>>command:[['git','branch'],'|',['wc','-l'],'>>',['out.txt']]
Although we didn’t get AST, but just a bunch of grouped tokens. So now we need to transform it to proper AST:
defprepare_command(command):"""We don't need to work with pyparsing internal data structures,
so we just convert them to list.
"""forpartincommand:ifisinstance(part,str):yieldpartelse:yieldlist(part)defseparator_position(command):"""Find last separator position."""forn,partinenumerate(command[::-1]):ifpartinseparators:returnlen(command)-n-1defcommand_to_ast(command):"""Recursively transform command to AST."""n=separator_position(command)ifnisNone:returntuple(command[0])else:return(command[n],command_to_ast(command[:n]),command_to_ast(command[n+1:]))defto_ast(parsed):ifparsed.env:forenvinparsed.env:yield('=',env[0],env[2])command=list(prepare_command(parsed.command))yieldcommand_to_ast(command)list(to_ast(result))>>>[('=','LANG','C'),>>>('=','DEBUG','true'),>>>('>>',('|',('git','branch'),>>>('wc','-l')),>>>('out.txt',))]
It’s working. The last part, glue that make it easier to use:
defparse(command):result=one_liner.parseString(command)ast=to_ast(result)returnlist(ast)parse('LANG=en_US.utf-8 git diff | wc -l >> diffs')>>>[('=','LANG','en_US.utf-8'),('>>',('|',('git','diff'),('wc','-l')),('diffs',))]
Although it can’t parse all one-liners, it doesn’t support nested commands like:
echo$(gitbranch)echo`git branch`
But it’s enough for my task and support of not implemented features can be added easily.