A parser determines whether a sequence of tokens is legal given a grammar
Four classes of grammars:
Regular: Equivalent to Regular Expressions; least powerful
Context-free: Used to define programming language grammars
Context-sensitive
Unrestricted
A context-free grammar consists of
A set T of terminals (tokens produced by the Scanner)
A set N of nonterminals (names defined by grammar rules)
A set P of productions
(grammar rules) of the form A ::= a, where A ∈ N and a
∈ (T ∪ N)*
Note: We will frequently use V = (T ∪
N) as an abbreviation (meaning "Vocabulary symbol")
A start symbol S ∈ N
Example:
|
start |
::= exp |
|
exp |
::= exp op exp |
|
op |
::= '+' |
A language is a set of legal token strings which can be derived in zero or more steps from the start symbol of a grammar. Note symbolic definition: L(G) = { s | start Þ * s }
A derivation involves a series of replacements of nonterminals by their grammar definitions. A derivation step is one such replacement. See p. 74 (bottom).
A parse of a token string involves performing a derivation beginning with the start symbol that results in the input token string (if possible).
When expanding a nonterminal that appears on the LHS of 2+ rules, we choose the rule that will ultimately result in a successful match. This is why we say parsing is syntax-driven or syntax-directed.
Note that the selection of which nonterminal to expand in each step does not have any effect on whether the input string will ultimately be derived, but the selection of which rule (when there is an option) does.
In a leftmost derivation, we always expand the leftmost nonterminal in the string; in a rightmost derivation, we always expand the rightmost nonterminal.
Answers the question: Is the input string in the language?
Ex: L({E ::= E + a | a}) => {a, a+a, a+a+a, ...}
General cases:
A ::= A a | b -> baaaa... [left recursion]
A ::= aA | b -> ...aaaaaab [right recursion]
This shows context-free grammars can express everything regular expressions can, and more (ex: nested parentheses)
Practice: Write a grammar that matches a comma-delimited list of argument declarations type1 name1, type2 name2, ..., such as might appear in a function definition.
It's possible for a production to have 0 symbols on the RHS. This allows for optional areas in the grammar without having to resort to common prefixes. Consider two if-then grammars:
|
stmt |
::= if-stmt |
|
if-stmt |
::= if '(' exp ')' stmt |
|
exp |
::= 0 |
Using Poof Productions:
|
stmt |
::= if-stmt |
|
if-stmt |
::= if '(' exp ')' stmt if-trailer |
|
if-trailer |
::= else stmt |
|
exp |
::= 0 |
In order for translation to occur, we need to know more than just the answer to "is the input string in the language?" Need to know the structure of the string.
A parse tree is constructed as follows:
The start symbol is the root
Each time we perform a derivation step, the vocabulary symbols V* which replace the nonterminal N become children of that nonterminal in the parse tree
When we're all done, the leaves of the tree should be only terminals
Note that a parse tree contains more information than is actually needed to perform translation. An abstract syntax tree (sometimes just called "syntax tree") is a parse tree with nonessential info removed.
Note: Your author uses the term "derivation tree" instead of "parse tree". I prefer "parse tree" to avoid confusing with "derivation".
Ambiguous grammars permit the construction of multiple parse trees from the same input.
This results when, given a derivation order (leftmost or rightmost), there is more than one rule that can be used to expand a given nonterminal. See p. 74.
Ambiguity in a grammar presents a problem for automated parsers: they need to know which rule to use.
Determining whether a grammar is ambiguous is, in general, impossible.
Removing ambiguity is, in general, hard.
Common causes of ambiguity have well-understood solutions. Ex: pp. 87-88.
An ambiguous grammar can often be rewritten to an unambiguous one; however, an ambiguous language is one for which no unambiguous grammar exists.