How to work with states in PLY

We have talked about PLY several times[1]. In this post I want to explain how to work and implement lexer states in PLY. In PLY documentation [2], you can find a whole explanation about the lexer states. I only pretend to show a very simple example with lexer states and how to use it in the parser code. The example shows how to ignore the content between two tokens and return it as a token.


How to build states

It is easy to create states. PLY wait to find the states variable. In this variable we specify the token and the type of state:
  • Inclusive: we extends the previous states for lexer. So we can use other tokens defined in the previous states.
  • Exclusive: we create a new state. In this case is necessary to implement an error state and a ignore state because we do not extends from previous tokens.
For instance:
states = (
    ('foo', 'exclusive'),
    ('bar', 'inclusive'), 
)
Then we join on these states to token variable.


Example

I want to show a very simple example that ignores all content and return it as a token. In some occasions we need to do it, for example to create a special ignore content node in the AST. Note the regular expression of the rule t_foo_CONTENT. This expression match all until #end string.

Lexer
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import ply.lex as lex

states = (
    ('foo', 'exclusive'),
)

tokens = ['foo', 'NUMBERS', 'END', 'CONTENT']

def t_NUMBERS(t):
    r'\d+'
    print 'General numbers '
    return t

def t_foo(t):
    r'foo'
    t.lexer.code_start = t.lexer.lexpos
    print 'I have detected foo token'
    t.lexer.begin('foo')
    return t

def t_foo_END(t):
    r'foo'
    print 'End of state foo'
    t.lexer.begin('INITIAL')
    return t


def t_foo_CONTENT(t):
    r'[^$]+\#end'
    return t

t_ignore = ' \t'

def t_foo_error(t):
    print 'Lexical error: "' + str(t.value[0]) + '" in line ' + str(t.lineno)
    t.lexer.skip(1)

def t_error(t):
    print 'Lexical error: "' + str(t.value[0]) + '" in line ' + str(t.lineno)
    t.lexer.skip(1)

def test(data, lexer):
    lexer.input(data)
    while True:
        tok = lexer.token()
        if not tok:
            break
        print tok

lexer = lex.lex()

if __name__ == '__main__':

    data = ''' foo  123 123 esto bla bla bla 
    #endfoo 123'''

    lexer.input(data)
    test(data, lexer) 
_______________________________________________________ 
Parser 

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import ply.yacc as yacc
from lexer import tokens

def p_start(p):
    ''' start : NUMBERS foo prod END '''
    p[0] = p[1]

def p_prod(p):
    ''' prod : CONTENT '''
    p[0] = p[1]

def p_error(p):
    print 'Syntax error at line ' + str(p.lexer.lineno)

parser = yacc.yacc()

if __name__ == '__main__':
    data = ''' 123 foo 321 #endfoo '''
    parser.parse(data)

    data = ''' 123 foo bla bla bla bla #endfoo '''
    parser.parse(data) 
_______________________________________________________

When foo is detected, the DFA entries in a new state (foo state) and will be in this state until #endfoo is detected.


References

[1] Previous entries for PLY
[2] PLY webpage: http://www.dabeaz.com/ply/

 

0 comentarios:

Publicar un comentario en la entrada

Por favor, no escriba al estilo SMS y use signos de puntuación en caso necesario