We have talked about PLY several times[1]. In this post I want to explain how to work and implement lexer states in PLY. In PLY documentation [2], you can find a whole explanation about the lexer states. I only pretend to show a very simple example with lexer states and how to use it in the parser code. The example shows how to ignore the content between two tokens and return it as a token.
How to build states
It is easy to create states. PLY wait to find the states variable. In this variable we specify the token and the type of state:
- Inclusive: we extends the previous states for lexer. So we can use other tokens defined in the previous states.
- Exclusive: we create a new state. In this case is necessary to implement an error state and a ignore state because we do not extends from previous tokens.
For instance:
states = (
('foo', 'exclusive'),
('bar', 'inclusive'),
)
Then we join on these states to token variable.
Example
I want to show a very simple example that ignores all content and return it as a token. In some occasions we need to do it, for example to create a special ignore content node in the AST. Note the regular expression of the rule t_foo_CONTENT. This expression match all until #end string.
Lexer
import ply.lex as lex
states = (
('foo', 'exclusive'),
)
tokens = ['foo', 'NUMBERS', 'END', 'CONTENT']
def t_NUMBERS(t):
r'\d+'
print 'General numbers '
return t
def t_foo(t):
r'foo'
t.lexer.code_start = t.lexer.lexpos
print 'I have detected foo token'
t.lexer.begin('foo')
return t
def t_foo_END(t):
r'foo'
print 'End of state foo'
t.lexer.begin('INITIAL')
return t
def t_foo_CONTENT(t):
r'[^$]+\#end'
return t
t_ignore = ' \t'
def t_foo_error(t):
print 'Lexical error: "' + str(t.value[0]) + '" in line ' + str(t.lineno)
t.lexer.skip(1)
def t_error(t):
print 'Lexical error: "' + str(t.value[0]) + '" in line ' + str(t.lineno)
t.lexer.skip(1)
def test(data, lexer):
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print tok
lexer = lex.lex()
if __name__ == '__main__':
data = ''' foo 123 123 esto bla bla bla
#endfoo 123'''
lexer.input(data)
test(data, lexer)
_______________________________________________________
Parser
import ply.yacc as yacc
from lexer import tokens
def p_start(p):
''' start : NUMBERS foo prod END '''
p[0] = p[1]
def p_prod(p):
''' prod : CONTENT '''
p[0] = p[1]
def p_error(p):
print 'Syntax error at line ' + str(p.lexer.lineno)
parser = yacc.yacc()
if __name__ == '__main__':
data = ''' 123 foo 321 #endfoo '''
parser.parse(data)
data = ''' 123 foo bla bla bla bla #endfoo '''
parser.parse(data)
_______________________________________________________
When
foo is detected, the DFA entries in a new state (foo state) and will be in this state until
#endfoo is detected.
References
[1]
Previous entries for PLY
[2] PLY webpage:
http://www.dabeaz.com/ply/