A token is a group of characters that represents an atomic unit of information
Lexical analysis is the activity of converting a program listing into a sequence of tokens
Token categories:
Keywords
Symbols (Punctuation and Operators)
Identifiers and Literals
Whitespace and comments are usually called pseudotokens because they are ignored during translation, but must still be handled by the scanner
The string of characters which composes a token is its lexeme. There is a one-to-one correspondence between tokens and lexemes in the first two categories, but not the last two.
A scanner is a program that performs lexical analysis. It reads characters of input sequentially until it recognizes a token; it outputs the token, and repeats the process, until EOF.
Two approaches to writing a scanner:
Hand-code it (see Tiny example)
Use a scanner generator
A scanner generator writes the scanner program for you.
Takes as input a description of the lexical characteristics of the tokens (usually in the form of regular expressions)
Produces as output a scanner written in C or Java, etc. that recognizes tokens according to the rules in the input spec
A regular expression is a pattern that defines the structure of a token. Here are some examples of regular expressions:
|
Regular Expression |
Strings matched |
1. |
a |
{"a"} |
2. |
if |
{"if"} |
3. |
a|b |
{"a", "b"} |
4. |
a* |
{"", "a", "aa", "aaa", ... } |
5. |
(0|1|2|...|9)* |
{"", "0", "50", "31593", ... } |
6. |
* |
(illegal RE) |
Slightly more formally, a regular expression is a pattern which defines a set of strings (see 2.0.3 Regular Expressions, pp. 33-34). The pattern has the form:
ε , the set containing the empty string, and ϕ , the set containing no strings at all
A single symbol from the
character set
... specifies the set of strings containing one
string, namely, that symbol (example #1 above)
re1re2
("concatenation")
... specifies the set of strings
formed by concatenating each string from re1
with each string from re2, where re1
and re2 are regular expressions (example 2)
re1+re2
("union")
... the set of strings formed by the union
of the sets defined by re1 and re2
(example 3)
re* ("repetition" / "closure")
... the set of strings containing zero or more concatenations of
the strings defined by re (example 4)
Note that * has highest precedence, followed by concatenation, followed by alternation. Use parenthesis to override precedence. For example, consider:
a+bc* is a+(b(c*))
a+bc vs. (a+b)c
Most tokens can be matched by fairly simple regular expressions
Regular expressions for keywords and symbols are the keywords and symbols themselves (except where the symbols, like *, have special meaning in regular expressions)
Most languages have fairly simple structures for identifiers and literals
Occasionally, tricky situations arise:
consider C comments ("a regular expression for C comments is so complicated that it is almost never written in practice")
"pumping lemma" : "regular expressions can't count."
Nested comments cannot be handled by regular expressions (insufficient power)
Scanner generators that work with regular expressions use special mechanisms to work around these restrictions.
Consider the following lexeme: "IF" - which regular expression applies?
IF
(A|B|C|...|Z)*
"IF" is in the set of strings of both regular expressions; scanner needs to know which token to report (IF keyword or identifier)
Consider the following input: <> -- which regular expression(s) apply?
<
>
<>
In practice, scanners match the first regular expression specified, so given the order in Example 1, the scanner would report IF keyword. Also, scanners usually match the longest sequence possible ("maximal munch"), so <> would be reported, even though < comes first in the order in Example 2.
How do we create a scanner from a set of regular expressions? Use a finite automaton!
A finite automaton is a machine that consists of
an alphabet
a set of states, one of which is a start state and one or more of which are accepting states
a set of transitions between the states labeled by a symbol from the alphabet.
Example: See p. 31.
A deterministic finite automaton (DFA) has the restriction that no two transitions out of a given state have the same vocabulary symbol. A nondeterministic finite automaton (NFA) does not have that restriction, and also allows transitions labeled by the empty string.
Note that finite automata can be represented graphically or tabular form (p. 31)
See p. 42 (2.2 Implementation with Finite State Machines) for classic implementation using table to encode transitions.
Regular expressions and Finite automata have equivalent expressive power.
A regular expression can be converted to a NFA through a procedure called Thompson's Construction. An NFA can then be converted to a DFA through an algorithm called subset construction.
Consider the language whose tokens consist of alphanumeric identifiers, integer literals, equal sign, plus.
letter = ['a'..'z'] | ['A'..'Z']
digit = ['0'..'9']
identifier = letter (letter | digit | '_' ) *
int_literal = digit digit*
Let's create a F. A. that will accept any of these tokens. In other words, we want:
token = identifier | int_literal | '=' | '+'