All students need to use lex.py from http://systems.cs.uchicago.edu/ply for their code.
In other languages such as the lex for C, input for lex is divided into three sections:
...definitions... %% ...rules... %% ...subroutines...
However, in Python, the lexer is structured quite differently. More of that will be explained in lab 2.
In python, each of the tokens is associated with a regular expression, either by direct assignment such as in the simple case of a plus token:
t_PLUS = r'\+'
or by placing it as the comment string of a routine that will return the token, such as for a number:
def t_NUMBER(t): r'\d+' try: t.value = int(t.value) except ValueError: print "Integer value too large", t.value t.value = 0 return t
This lab will look at regular expressions.
Expression | Meaning | Example |
. | Any character except "\n" | |
a,b,... | Non special characters match that character | ab;c matches "ab;c" |
[] |
Any character in the brackets. ^ negates it when it is the first character. - signifies a range if not the first character. |
[abz] a single a or b or z [^a-z] Anything except lc letters. |
* | 0 or more of the preceding pattern | a* - nothing, a, aa, aaa,... |
+ | 1 or more of the preceding pattern | |
? | 0 or 1 of the preceding pattern. | [0-9]? An optional digit |
{n} | n of the preceding pattern. | |
{n,m} | n to m of the preceding pattern. | [a-z]{3,5} All groups of three, four or five letters |
{name} | Refers to a name defined in the definitions section of lex | |
\ |
Escape character | \* matches an asterik |
() |
groups patterns | ([ab]1?)? matches nothing, a, a1, b, b1. |
| |
Either the pattern before or after. | (if)+|5 matches multiple if's or a single 5 |
"..." |
Literally what is in the quotes. | "\*" matches an backslash then an asterik |
^ |
If the first character, matches beginning of the line | |
<> |
State in lex |
A more complete online reference is available on the Python web site, for example, see the re module there.
We went through how to build up an example in class. We will arrive at a regular expression that recognizes such strings as 45, -34.928, 7e9 +23.348E-6.
We want to recognize... | Regular Expression |
A digit | [0-9] |
Many digits | [0-9]+ |
An optional sign | [-+]? |
A whole number | [-+]?[0-9]+ |
A fractional number | [-+]?[0-9]*\.[0-9]+ |
Both whole and fractional numbers | [-+]?[0-9]+|([0-9]*\.[0-9]+) |
An exponent | [eE][-+]?[0-9]+ |
A number | [-+]?[0-9]+|([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? |