 
        Winter 2012-2013
Compiler Principles
Lexical Analysis (Scanning)
Mayer Goldberg and Roman Manevich
Ben-Gurion University
General stuff
Topics taught by me
Lexical analysis (scanning)
Syntax analysis (parsing)
…
Dataflow analysis
Register allocation
Slides will be available from web-site after
lecture
Request: please mute mobiles, tablets,
super-cool squeaking devices
2
Today
Understand role of lexical analysis
Lexical analysis theory
Implementing modern scanner
3
Role of lexical analysis
First part of compiler front-end
High-level
Language
Lexical
Analysis
Syntax
Analysis
Parsing
AST
Symbol
Table
etc.
Inter.
Rep.
(IR)
Code
Generation
Executable
Code
(scheme)
Convert stream of characters into stream
of tokens
Split text into most basic meaningful strings
Simplify input for syntax analysis
4
From scanning to parsing
5 + (7 * x)
program text
Lexical
Analyzer
token stream
Grammar:
E  id
E  num
EE+E
EE*E
E(E)
num
+
(
num
*
id
)
Parser
valid
syntax
error
+
num
Abstract Syntax Tree
*
num
x
5
Javascript example
Identify basic units in this code
var currOption = 0;
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications", "teaching",
"software", "activities"];
for (i = 0; i < menu.length; i++) {
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
elt.style.display = "none";
}
}
}
6
Javascript example
Identify basic units in this code
var currOption = 0;
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications", "teaching",
"software", "activities"];
for (i = 0; i < menu.length; i++) {
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
elt.style.display = "none";
}
}
}
7
Javascript example
Identify basic units in this code
operator
keyword
whitespace
numeric literal
var currOption = 0;
string literal
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications", "teaching",
"software", "activities"];
for (i = 0; i < menu.length; i++) {
identifier
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
punctuation
elt.style.display = "none";
}
}
}
8
Scanner output
var currOption = 0;
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications“,
"teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) {
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
elt.style.display = "none";
}
}
}
Stream of Tokens
LINE: ID(value)
1: VAR
1: ID(currOption)
1: EQ
1: INT_LITERAL(0)
1: SEMI
3: FUNCTION
3: ID(choose)
3: LP
3: ID(id)
3: EP
3: LCB
...
9
What is a token?
Lexeme – substring of original text
constituting an identifiable unit
Record type storing:
Identifiers, Values, reserved words, …
Kind
Value (when applicable)
Start-position/end-position
Any information that is useful for the parser
Different for different languages
10
C++ example 1
Splitting text into tokens can be tricky
How should the code below be split?
vector<vector<int>> myVector
>>
operator
or
>, >
two tokens
?
11
C++ example 2
Splitting text into tokens can be tricky
How should the code below be split?
vector<vector<int> > myVector
>, >
two tokens
12
Example tokens
Type
Examples
Identifier
x, y, z, foo, bar
NUM
42
FLOATNUM
-3.141592654
STRING
“so long, and thanks for all the fish”
LPAREN
(
RPAREN
)
IF
if
…
13
Separating tokens
Type
Examples
Comments
/* ignore code */
// ignore until end of line
White spaces
\t \n
Lexemes are recognized but get consumed
rather than transmitted to parser
if
if
i/*comment*/f
14
Preprocessor directives in C
Type
Examples
Inlude directives
#include<foo.h>
Macros
#define THE_ANSWER 42
15
Designing a scanner
Define each type of lexeme
Reserved words: var, if, for, while
Operators: < = ++
Identifiers: myFunction
Literals: 123 “hello”
Annotations: @SuppressWarnings
But how do we define lexemes of
unbounded length?
16
Designing a scanner
Define each type of lexeme
Reserved words: var, if, for, while
Operators: < = ++
Identifiers: myFunction
Literals: 123 “hello”
Annotations: @SuppressWarnings
But how do we define lexemes of
unbounded length?
Regular expressions
17
Regular languages refresher
Formal languages
Alphabet = finite set of letters
Word
= sequence of letter
Language = set of words
Regular languages defined equivalently by
Regular expressions
Finite-state automata
18
Regular expressions
Empty string: Є
Letter: a
Concatenation: R1 R2
Union: R1 | R2
Kleene-star: R*
Shorthand: R+ stands for R R*
scope: (R)
Example: (0* 1*) | (1* 0*)
What is this language?
19
Exercise 1 - Question
Language of Java identifiers
Identifiers start with either an underscore ‘_’
or a letter
Continue with either underscore, letter, or digit
20
Exercise 1 - Answer
Language of Java identifiers
Identifiers start with either an underscore ‘_’
or a letter
Continue with either underscore, letter, or digit
(_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)*
Using shorthand macros
First
= _|a|b|…|z|A|…|Z
Next
= First|0|…|9
R
= First Next*
21
Exercise 2 - Question
Language of rational numbers in decimal
representation (no leading, ending zeros)
0
123.757
.933333
Not 007
Not 0.30
22
Exercise 2 - Answer
Language of rational numbers in decimal
representation (no leading, ending zeros)
Digit
= 1|2|…|9
Digit0
= 0|Digit
Num
= Digit Digit0*
Frac
= Digit0* Digit
Pos
= Num | .Frac | 0.Frac| Num.Frac
PosOrNeg = (Є|-)Pos
R
= 0 | PosOrNeg
23
Exercise 3 - Question
Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], …
24
Exercise 3 - Answer
Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], …
Not regular
Context-free
Grammar:
S ::= []
| [S]
25
Finite automata
An automaton is defined by states and
transitions
transition
accepting
state
b
c
a
start
b
start
state
26
Automaton running example
Words are read left-to-right
a
b
c
b
c
a
start
b
27
Automaton running example
Words are read left-to-right
a
b
c
b
c
a
start
b
28
Automaton running example
Words are read left-to-right
a
b
c
b
c
a
start
b
29
Automaton running example
Words are read left-to-right
a
b
c
word
accepted
b
c
a
start
b
30
Word outside of language
b
b
c
b
c
a
start
b
31
Word outside of language
Missing transition means non-acceptance
b
b
c
b
c
a
start
b
32
Exercise - Question
What is the language defined by the
automaton below?
b
c
a
start
b
33
Exercise - Answer
What is the language defined by the
automaton below?
a b* c
Generally: all paths leading to accepting states
b
c
a
start
b
34
Non-deterministic automata
Allow multiple transitions from given state
labeled by same letter
b
c
a
start
a
c
b
35
NFA run example
a
b
c
b
c
a
start
a
c
b
36
NFA run example
Maintain set of states
a
b
c
b
c
a
start
a
c
b
37
NFA run example
a
b
c
b
c
a
start
a
c
b
38
NFA run example
Accept word if any of the states in the set
is accepting
a
b
c
b
c
a
start
a
c
b
39
NFA+Є automata
Є transitions can “fire” without reading the
input
b
start
a
c
Є
40
NFA+Є run example
a
b
c
b
start
a
c
Є
41
NFA+Є run example
Now Є transition can non-deterministically
take place
a
b
c
b
start
a
c
Є
42
NFA+Є run example
a
b
c
b
start
a
c
Є
43
NFA+Є run example
a
b
c
b
start
a
c
Є
44
NFA+Є run example
a
b
c
b
start
a
c
Є
45
NFA+Є run example
Word accepted
a
b
c
b
start
a
c
Є
46
Reg-exp vs. automata
Regular expressions are declarative
Offer compact way to define a regular
language by humans
Don’t offer direct way to check whether a
given word is in the language
Automata are operative
Define an algorithm for deciding whether a
given word is in a regular language
Not a natural notation for humans
47
From reg. exp. to automata
Theorem: there is an algorithm to build an
NFA+Є automaton for any regular
expression
Proof: by induction on the structure of the
regular expression
For each sub-expression R we build an
automaton with exactly one start state and
one accepting state
Start state has no incoming transitions
Accepting state has no outgoing transitions
48
From reg. exp. to automata
Theorem: there is an algorithm to build an
NFA+Є automaton for any regular
expression
Proof: by induction on the structure of the
regular expression
start
49
Base cases
R=
start
R=a
start
a
50
Construction for R1 | R2
R1
start
R2
51
Construction for R1 R2
R1
start
R2
52
Construction for R*
R
start
53
From NFA+Є to DFA
Construction requires O(n) states for a regexp of length n
Running an NFA+Є with n states on string
of length m takes O(m·n2) time
Solution: determinization via subset
construction
Number of states worst-case exponential in n
Running time O(m)
54
Subset construction
For an NFA+Є with states M={s1,…,sk}
Construct a DFA with one state per set of
states of the corresponding NFA
M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Simulate transitions between individual
states for every letter
NFA+Є
s1 a s2
s4
a
s7
DFA
[s1,s4]
a
[s2,s7]
55
Subset construction
For an NFA+Є with states M={s1,…,sk}
Construct a DFA with one state per set of
states of the corresponding NFA
M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Extend macro states by states reachable
via Є transitions
NFA+Є
s1 Є s4
DFA
[s1,s2]
[s1,s2,s4]
56
Scanning challenges
Regular expressions allow us to define the
language of all sequences of tokens
Automata theory provides an algorithm for
checking membership of words
But we are interested in splitting the text not
just deciding on membership
How do we determine lexemes?
How do we handle ambiguities – lexemes
matching more than one token?
57
Separating lexemes
ID
= (a+b+…+z) (a+b+…+z)*
ONE
=1
Input: abb1
How do we identify ID(abb), ONE?
58
Separating lexemes
ID
= (a+b+…+z) (a+b+…+z)*
ONE
=1
Input: abb1
How do we identify ID(abb), ONE?
a-z
start
ID
a-z
1
ONE
59
Maximal munch
ID
= (a+b+…+z) (a+b+…+z)*
ONE
=1
Input: abb1
How do we identify ID(abb), ONE?
Solution: find longest matching lexeme
Keep reading text until automaton leaves
accepting state
Return token corresponding to accepting state
Reset – go back to start state and continue
reading input from there
60
Handling ambiguities
ID = (a+b+…+z) (a+b+…+z)*
IF = if
Input: if
Matches both tokens
What should the scanner output?
a-z
start
a-z
ID
NFA
i
f
IF
61
Handling ambiguities
ID = (a+b+…+z) (a+b+…+z)*
IF = if
Input: if
Matches both tokens
What should the scanner output?
a-z
a-z\i
start
i
ID
a-z
a-z\f
ID
f
DFA
IF ID
62
Handling ambiguities
ID = (a+b+…+z) (a+b+…+z)*
IF = if
Input: if
Matches both tokens
What should the scanner output?
Solution: break tie using order of
definitions
a-z\i
ID
Output: ID(if)
start
i
a-z
a-z
a-z\f
ID
f
IF ID
63
Handling ambiguities
IF = if
ID = (a+b+…+z) (a+b+…+z)*
Input: if
Conclusion: list keyword
token definitions
before identifier definition
Matches both tokens
What should the scanner output?
Solution: break tie using order of
a-z
definitions
a-z\i
ID
Output: IF
a-z
start
i
a-z\f
ID
f
IF ID
64
Implementing scanners in
practice
65
Implementing scanners
Manual construction of automata +
determinization is
Very tedious
Error-prone
Non-incremental
Fortunately there are tools that
automatically generate code from a
specification for most languages
C: Lex, Flex
Java: JLex, JFlex
66
Using JFlex
Define tokens (and states)
Run Jflex to generate Java implementation
Usually MyScanner.nextToken() will be
called in a loop by parser
Stream of characters
MyScanner.lex
Regular
Expressions
JFlex
MyScanner.java
Tokens
67
Common format for reg-exps
Basic Patterns
Matching
x
The character x
.
Any character, usually except a new line
[xyz]
Any of the characters x,y,z
Repetition Operators
R?
An R or nothing (=optionally an R)
R*
Zero or more occurrences of R
R+
One or more occurrences of R
Composition Operators
R1R2
An R1 followed by R2
R1|R2
Either an R1 or R2
Grouping
(R)
R itself
68
Escape characters
What is the expression for one or more +
symbols?
(+)+ won’t work
(\+)+ will
backslash \ before an operator turns it to
standard character
\*, \?, \+, …
Newline: \n or \r\n depending on OS
Tab: \t
69
Shorthands
Use names for expressions
letter = a | b | … | z | A | B | … | Z
letter_ = letter | _
digit = 0 | 1 | 2 | … | 9
id = letter_ (letter_ | digit)*
Use hyphen to denote a range
letter = a-z | A-Z
digit = 0-9
70
Catching errors
What if input doesn’t match any token
definition?
Trick: Add a “catch-all” rule that matches
any character and reports an error
Add after all other rules
71
Next lecture: parsing
72
				
											        © Copyright 2025