CpS 450 Lecture Notes

Lexical Analysis


A token is a group of characters that represents an atomic unit of information

Lexical analysis is the activity of converting a program listing into a sequence of tokens

Token categories:

The string of characters which composes a token is its lexeme. There is a one-to-one correspondence between tokens and lexemes in the first two categories, but not the last two.

Scanner Construction

A scanner is a program that performs lexical analysis. It reads characters of input sequentially until it recognizes a token; it outputs the token, and repeats the process, until EOF.

Two approaches to writing a scanner:

  1. Hand-code it (see Tiny example)

  2. Use a scanner generator

A scanner generator writes the scanner program for you.

Regular Expressions

A regular expression is a pattern that defines the structure of a token. Here are some examples of regular expressions:

 

Regular Expression

Strings matched

1.

a

{"a"}

2.

if

{"if"}

3.

a|b

{"a", "b"}

4.

a*

{"", "a", "aa", "aaa", ... }

5.

(0|1|2|...|9)*

{"", "0", "50", "31593", ... }

6.

*

(illegal RE)

Slightly more formally, a regular expression is a pattern which defines a set of strings (see 2.0.3 Regular Expressions, pp. 33-34). The pattern has the form:

 Note that * has highest precedence, followed by concatenation, followed by alternation. Use parenthesis to override precedence. For example, consider:

Regular Expressions for Programming Languages

Most tokens can be matched by fairly simple regular expressions

Occasionally, tricky situations arise:

Limitations of Regular Expressions

"pumping lemma" : "regular expressions can't count."

Scanner generators that work with regular expressions use special mechanisms to work around these restrictions.

Handling Ambiguity

Example 1

Consider the following lexeme: "IF" - which regular expression applies?

"IF" is in the set of strings of both regular expressions; scanner needs to know which token to report (IF keyword or identifier)

Example 2

Consider the following input: <> -- which regular expression(s) apply?

In practice, scanners match the first regular expression specified, so given the order in Example 1, the scanner would report IF keyword. Also, scanners usually match the longest sequence possible ("maximal munch"), so <> would be reported, even though < comes first in the order in Example 2.

Finite Automata

How do we create a scanner from a set of regular expressions? Use a finite automaton!

A finite automaton is a machine that consists of

Example: See p. 31.

A deterministic finite automaton (DFA) has the restriction that no two transitions out of a given state have the same vocabulary symbol. A nondeterministic finite automaton (NFA) does not have that restriction, and also allows transitions labeled by the empty string.

Note that finite automata can be represented graphically or tabular form (p. 31)

Implementing Finite Automata

See p. 42 (2.2 Implementation with Finite State Machines) for classic implementation using table to encode transitions.

Regular Expressions to Finite Automata

Regular expressions and Finite automata have equivalent expressive power.

A regular expression can be converted to a NFA through a procedure called Thompson's Construction. An NFA can then be converted to a DFA through an algorithm called subset construction.

Small Example

Consider the language whose tokens consist of alphanumeric identifiers, integer literals, equal sign, plus.

Let's create a F. A. that will accept any of these tokens. In other words, we want: