CpS 450 Lecture Notes

Lexical Analysis

A token is a group of characters that represents an atomic unit of information

Lexical analysis is the activity of converting a program listing into a sequence of tokens

Token categories:

Keywords
Symbols (Punctuation and Operators)
Identifiers and Literals
Whitespace and comments are usually called pseudotokens because they are ignored during translation, but must still be handled by the scanner

The string of characters which composes a token is its lexeme. There is a one-to-one correspondence between tokens and lexemes in the first two categories, but not the last two.

Scanner Construction

A scanner is a program that performs lexical analysis. It reads characters of input sequentially until it recognizes a token; it outputs the token, and repeats the process, until EOF.

Two approaches to writing a scanner:

Hand-code it (see Tiny example)
Use a scanner generator

A scanner generator writes the scanner program for you.

Takes as input a description of the lexical characteristics of the tokens (usually in the form of regular expressions)
Produces as output a scanner written in C or Java, etc. that recognizes tokens according to the rules in the input spec

Regular Expressions

A regular expression is a pattern that defines the structure of a token. Here are some examples of regular expressions:

	Regular Expression	Strings matched
1.	a	{"a"}
2.	if	{"if"}
3.	a\|b	{"a", "b"}
4.	a*	{"", "a", "aa", "aaa", ... }
5.	(0\|1\|2\|...\|9)*	{"", "0", "50", "31593", ... }
6.	*	(illegal RE)

Slightly more formally, a regular expression is a pattern which defines a set of strings (see 2.0.3 Regular Expressions, pp. 33-34). The pattern has the form:

ε , the set containing the empty string, and ϕ , the set containing no strings at all
A single symbol from the character set
... specifies the set of strings containing one string, namely, that symbol (example #1 above)
re₁re₂ ("concatenation")
... specifies the set of strings formed by concatenating each string from re₁ with each string from re₂, where re₁ and re₂are regular expressions (example 2)
re₁₊re₂ ("union")
... the set of strings formed by the union of the sets defined by re₁ and re₂ (example 3)
re* ("repetition" / "closure")
... the set of strings containing zero or more concatenations of the strings defined by re (example 4)

Note that * has highest precedence, followed by concatenation, followed by alternation. Use parenthesis to override precedence. For example, consider:

a+bc* is a+(b(c*))
a+bc vs. (a+b)c

Regular Expressions for Programming Languages

Most tokens can be matched by fairly simple regular expressions

Regular expressions for keywords and symbols are the keywords and symbols themselves (except where the symbols, like *, have special meaning in regular expressions)
Most languages have fairly simple structures for identifiers and literals

Occasionally, tricky situations arise:

consider C comments ("a regular expression for C comments is so complicated that it is almost never written in practice")

Limitations of Regular Expressions

"pumping lemma" : "regular expressions can't count."

Nested comments cannot be handled by regular expressions (insufficient power)

Scanner generators that work with regular expressions use special mechanisms to work around these restrictions.

Handling Ambiguity

Example 1

Consider the following lexeme: "IF" - which regular expression applies?

IF
(A|B|C|...|Z)*

"IF" is in the set of strings of both regular expressions; scanner needs to know which token to report (IF keyword or identifier)

Example 2

Consider the following input: <> -- which regular expression(s) apply?

In practice, scanners match the first regular expression specified, so given the order in Example 1, the scanner would report IF keyword. Also, scanners usually match the longest sequence possible ("maximal munch"), so <> would be reported, even though < comes first in the order in Example 2.

Finite Automata

How do we create a scanner from a set of regular expressions? Use a finite automaton!

A finite automaton is a machine that consists of

an alphabet
a set of states, one of which is a start state and one or more of which are accepting states
a set of transitions between the states labeled by a symbol from the alphabet.

Example: See p. 31.

A deterministic finite automaton (DFA) has the restriction that no two transitions out of a given state have the same vocabulary symbol. A nondeterministic finite automaton (NFA) does not have that restriction, and also allows transitions labeled by the empty string.

Note that finite automata can be represented graphically or tabular form (p. 31)

Implementing Finite Automata

See p. 42 (2.2 Implementation with Finite State Machines) for classic implementation using table to encode transitions.

Regular Expressions to Finite Automata

Regular expressions and Finite automata have equivalent expressive power.

A regular expression can be converted to a NFA through a procedure called Thompson's Construction. An NFA can then be converted to a DFA through an algorithm called subset construction.

Small Example

Consider the language whose tokens consist of alphanumeric identifiers, integer literals, equal sign, plus.

letter = ['a'..'z'] | ['A'..'Z']

digit = ['0'..'9']

identifier = letter (letter | digit | '_' ) *

int_literal = digit digit*

Let's create a F. A. that will accept any of these tokens. In other words, we want:

token = identifier | int_literal | '=' | '+'