CPSC 411 - Lab Notes - 01-22

Lexing Via Python

Download lex.py (and other files) from http://systems.cs.uchicago.edu/ply for your code.

This lab goes through the steps of showing you how to create and run a python lexer file. No familiarity with Python is assumed. Familiarity with programming in general and using computers IS assumed.

Overview of process

Create a directory for your source code for this assignment.
Download the python lexer (From the link above) and unzip and untar it.
Copy the lex.py file from that directory into your assignment source directory.
Copy the lexing example of your choice (either one of the downloaded files, or calcLexer.py from this website) to the file minlex.py.
Modify minlex.py to recognize only the minisculus tokens.
Test it and make it work properly :)
e-mail it to your lab TA.

See - Seven easy steps!

Details of a lexing file for Python

First, we have a two line header that signifies which program to execute when running this like a script. This may be /usr/bin/python2.2 or something else for your chosen machine.

#!/usr/bin/python
#

We then use the python import command to bring in the lexing module.

import lex, sys

Next, we have to create a tuple of all of the token names and assign that to the variable token. In Python, a tuple is automatically created by surrounding a list of items in parentesis.

tokens=('PRINT', 'LPAR', 'RPAR', 'SEMICOLON', 
        'NUM', 'ADD', 'SUB', 'MULT', 'DIV')

Following this are the lexing rules. The name of each rule starts with the characters "t_" and is followed by the tokens name, exceptin when the characters are just thrown away. Note that the first rule, t_ignore does not return a token.

Each rule is either a simple variable where the regular expression string is assigned to it, or a method where the regular expression is a docstring, occurring on the first line after the def t_TKNNAME:. Python allows docstrings as a convenient method of short documentation of a method, which is accessible within a running python program.

When using the variable method (e.g. t_ADD = r'\+'), the token returned will have a type equal to the name of the variable less the t_ (e.g. 'ADD') and a value of whatever string was matched (e.g. '+'). The default when using a method is the same, but the method/function may change any of these values. See the pylex documentation for more detail.

# Ignore whitespace.
t_ignore =    '\n\t '

t_ADD =   r'\+'
t_SUB =   r'-'
t_MULT =  r'\*'
t_DIV =   r'\/'

t_LPAR =   r'\('
t_RPAR =   r'\)'

t_SEMICOLON = r' ; '

t_PRINT =   r' print '

def t_NUM(t):
      r' \d+ '
      t.value=int(t.value)
      return t
   
  
def t_error(t):
    print "Illegal character %s" % repr(t.value[0])
    t.skip(1)

The last method, t_error is matched if nothing else is.

We then initialize the lexing system by a call to the method lex.lex()

lex.lex()

In this example we now use a standard Python variable __name__ to see if we are being run as the main program. The rest of the code that is indented below it will be run in that case.

if __name__ == "__main__":

We then read in the entire standard in file and put its contents in the variable data.

    data = sys.stdin.read()

We then pass the file contents to the lexing routine.

    lex.input(data)

This is followed by a loop that continues until there are no more tokens. It prints the token type and in one case, prints the value of it.

    while 1:
          tok = lex.token()
          if not tok: break      # No more input
          print tok.type,
          if tok.type == "NUM": print "(%s)"%tok.value,
          print

(The Code only.)

Last modified by Brett Giles
Last modified: Sat Feb 22 15:12:11 MST 2003