IT博客汇 | Compilers (III): Parsing

Compilers (III): Parsing

MarkNV发表于 2013-06-19 23:53:00

1. Introduction

In the last article we illustrated the first step of compiling: Lexical Analysis in detail. Let's do some review and peek a bit to parsing to prepare moving onto the second part.

Lexer: Input (String of code) -> Output (String of tokens)
Parser: Input (String of tokens) -> Output (Parse tree)

As we mentioned before the output of lexical analysis is the input of parsing. Besides, not all strings of tokens are programs so verifying tokens is necessary in parsing. Another important point is that parser not only check if certain token is valid to the language but also generate a parse tree (may be implicit) as output i.e. the input of next step of compiling.

Before we getting start of parsing, we need to master several prerequisite knowledge.

1.1 Context-free language

CFG is our first new friend. In this section we are going to tell you what is it and why we need it.

1.1.1 Weakness of regular language

In lexical analysis we use regular expression to define tokens in order to determine the token class a lexeme belongs to. In parsing we need a language for describing valid strings of tokens and distinguish them from invalid strings of tokens.

Is regular language able to do that? Unfortunately, this answer is NO. For instance:

First we do a quick review that regular language only consists of 1. Single character, 2. Epsilon, 3. Union, 4. Concatenation, 5. Iteration. Then consider the following language.

$\lbrace (^i)^i\mid i\leq 0 \rbrace$

It denotes a set of parentheses under the condition of the number of open and close parentheses must be equal. Is regular languages able to do this? NO! Examples: HTML, C/C++, JAVA. Another example is the language of palindromes which regular language is unable to handle.

1.1.2 Upgrade to CFG

So here comes the context-free language! The relation between regular language is as bellow.

relation between rl and cfl

We can see that regular language is actually the subset of context-free language. You may ask why CFG is called context-free? It's probably that we can substitute strings for variables regardless of context. CFG is much more powerful so it's useful in many applications such as describing syntax of programming languages, parsing, Structure of documents, e.g. XML.

A CFG consists of:

A set of terminals $T$, e.g. if, else, return, break, or '(', ')'.
A set of non-terminals $N$, e.g. statements, expressions.
A start symbol $S, S\in N$, Note: only one symbol.
A set of productions, $X\rightarrow Y_1...Y_n, \lbrace X\in N,Y_i\in N\cup T\cup\epsilon\rbrace$.

Terminals are so called because there are no rules for replacing them. And once generated, terminals are permanent. For easy reading, upper case letters are non-terminals and lower case letters are all terminals.

Strings always start with the start symbol.

A production for instance $\lbrace A\rightarrow aBb\rbrace$ means in whenever we see an $A$ (in other productions) we can substitute $A$ by $aBb$ i.e. remove the string of symbol on the left hand side of the arrow and replace it with the string of symbols on the right hand side. $A\overset{*}{\rightarrow}B$ means $A$ can produce $B$ whatever how much steps it takes.

Here is an example of context-free grammar, let's see what does it generate:

$S\rightarrow (A)$

$A\rightarrow\epsilon$

$A\rightarrow aA$

$A\rightarrow\ ASA$

Well, it's to verbose to write A multiple times so we could use $A\rightarrow\epsilon\mid aA\mid ASA$ for short.

So what does it generate? The answer is obvious: $S\rightarrow\epsilon\mid aSb$.

1.2 Derivations

The second concept we need to learn is derivation. A derivation is a sequence of productions: $S\rightarrow...\rightarrow...\rightarrow...$

A derivation can be drawn as a parse tree or defines a parse tree. Start symbol is the root of tree and for a production $X\rightarrow Y_1...Y_n$ add children $Y_1...Y_n$ to node $X$.

1.2.1 Parse tree

I'll show you an example:

Given a context-free grammar $E\rightarrow E+E\mid E*E\mid(E)\mid id$ and a string $id*id+id$. Following is the generated parse tree.

parse tree

Bellow is the derivation and tree building process:

Use $E\rightarrow E+E$ to substitute $E$.
Use $E\rightarrow E*E$ to substitute $E$ in the left side.
Use $E\rightarrow id$ to substitute $E$ on the left.
Use $E\rightarrow id$ to substitute $E$ in the middle.
Use $E\rightarrow id$ to substitute the last $E$.

We can also find from the tree that all leaf nodes are terminals and all internal nodes are non-terminals. Moreover, an in-order traversal of the leaves is the original input. Additionally, the parse tree shows the association of operations, the input string does not.

1.2.2 Canonical derivation

The example in last section is a left-most derivation which means at each step, replace the left-most non-terminal.

You may think is there a right-most derivation? Yes, there is! Morever, a right-most derivation is called canonical derivation. Bellow is the process:

right-most derivation

As you see, each time it substitutes the right-most non-terminal. But, what else can you find?

Yeah! Right-most and left-most derivations have the same parse tree (ignore the length of lines in the picture). So one parse tree may have many derivations.

1.3 Ambiguity

You probably already find that in the first step we uses $E\rightarrow E+E$ to substitute $E$ but not $E\rightarrow E*E$. Like bellow:

ambiguity

Since the second production alse can lead to simialr parse tree (In the aspect of leaf nodes, while the priroity of + and * is compelety wrong but compiler doesn't know it.) why we use the first production?

Actually in a real case, if no any condition is given the parser will report an error that it doesn't know which one to use. And this is called ambiguity i.e. a grammar is ambiguous if it has more than one parse tree for some string.

We know that computers are very stupid machines, it only obey our orders (programs). Well, the most intolerant thing for computers is ambiguity. Indeed there are several ways to handle this ambiguity.

1.3.1 Old fashioned way

The most direct method is to rewrite grammar unambiguously. Still using the former example we rewrite it to:

$E\rightarrow E'+E\mid E'$

$E'\rightarrow id*E'\mid id\mid (E)*E'\mid (E)$

Through this method we can get the following parse tree:

resolve ambiguity

The concept of this method is to extract the ambiguous part and make it another production.

Since this old method is barely used in real applications, we don't want to give more detail here. You may check the dragon book if you are interested.

1.3.2 Modern way

In the modern way the solution is simple and effective.

Just enforce precedence of $*$ over $+$. Simply make some rules and get all problems down.

So in modern tools, instead of rewriting the grammar:

Use the more natural (ambiguous) grammar.
Along with disambiguating declarations.
Allow precedence and associativity declarations to disambiguate grammars

1.4 Abstract syntax tree

Abstract syntax tree or AST is the opposite to Concrete Syntax Trees (CST or parse tree we've learned). AST omits much of the detail that would be present in a CST. For example:

Consider the grammar: $E\rightarrow int\mid (E)\mid E+E$. Given a string $5+(2+3)$, after lexical analysis we should get a list of tokens: $int_5+(int_2+int_3)$. Then we can generate the parse tree (CST) and AST as bellow:

cst vs ast

We can find that CST has much more information than AST such as parentheses, non-terminals such as $E$. It's more compact and easier to use, and is an important data structure in a compiler.

2. Top-down parsing

The first category of parsing technique is called top-down parsing because the parse tree is constructed from top to down and from left to right.

2.1 Recursive descent parsing

Recursive descent parsing is the typical implementation of top-down parsing, it's simple but slow.

Consider the context-free grammar:

$E\rightarrow{T}\mid{T}+{E}$

$T\rightarrow{int}\mid{int*T}\mid{(E)}$

Our token stream is $(int_7)$, set the pointer before the first token '('. Bellow is the whole parsing process:

recursive descent parsing

Step 1: The root of parse tree should be the non-terminal start symbol $E$.

Step 2: The left most production of $E$ is $E\rightarrow{T}$, so according to the rule of top-down parsing, we try this one first.

Step 3: Since $T\rightarrow{int}$ is the left-most production of $T$, we replace $T$ with $int$. Now we met our very first non-terminal 'int' then we lookup to the pointer, which shows the next input is '(' not 'int'. Mismatch!

Step 4: Well, in recursive descent parsing we need to backtrack. As you can see, we change the parse tree back.

Step 5: The next production of $T$ is $T\rightarrow{int*T}$, then we substitute $T$ with $int*T$. Non-terminal again, matching time! Unfortunately, 'int' and '(', mismatch again.

Step 6: Backtrack time.

Step 7: The next and last production of $T$ is $T\rightarrow{(E)}$, so we replace $T$ with $(E)$. This time our left-most leaf in the parse tree is '(', which matches the next input character '(', so we can move the pointer to next character.

Step 8: The next node in parse tree is $E$, so we need to check the productions of $E$ one by one in order. Again $E\rightarrow{T}$ first.

Step 9: Do the same thing to the new node $T$ and we use the first production $T\rightarrow{int}$ here. We see a non-terminal here so match again. Fortunately 'int' matches.

Step 10: The last non-terminal ')' also matches and now there is no further input character, thus this string or stream of tokens is accepted.

Recursive descent parsing is easy to implement but relatively slow because it need to backtrack when mismatching.

2.1.1 Left recursion

As ambiguity to parse tree, left recursion is a problem to recursive descent parsing that it does not work on it.

Let's consider the production $S\rightarrow{S\alpha}\mid{\beta}$.

What we actually want is $\beta{\alpha\alpha...\alpha}$, but what we can get is $S\alpha\alpha...\alpha$. Because $S\rightarrow{\alpha}$ goes into an infinite loop and we can never reach $S\rightarrow\beta$.

So we gonna find a way to solve this problem. Before that we make the description of left recursion more general:

$S\rightarrow{S\alpha_1}\mid...\mid{S\alpha_n}\mid{\beta_1}\mid...\mid{\beta_m}$

To eliminate left recursion we must rewrite it as:

$S\rightarrow{\beta_1S'}\mid...\mid{\beta_mS'}$

$S'\rightarrow{\alpha_1S'}\mid...\mid{\alpha_nS'}\mid\epsilon$

As we can see, left recursion must be eliminated before conducting recursive descent parsing and in real applications it can be done automatically.

2.2 Predictive parsing

We know that in recursive descent parsing many choices of production can be used at each step, and backtrack is needed to undo bad choices. These features make recursive descent parsing slow. So, is there any faster top-down parsing algorithm?

Here comes the predictive parsing. It is called predictive because parser can predict which production to use without backtracking by looking ahead the next few tokens.

2.2.1 LL(1) parsing

Predictive parsers accept LL(k) grammars. The first L means "left to right", the second L means "left-most derivation" and k means k tokens lookahead. In LL(1) parsing there is always only one choice of production.

Let's take a look at our context-free grammar again:

$E\rightarrow{T}\mid{T}+{E}$

$T\rightarrow{int}\mid{int*T}\mid{(E)}$

A problem we can find is that in productions of $E$ there are two productions start with token $T$ and also in productions of $T$ there are two productions start with the token $int$. As we lookahead only one token, it's impossible to predict in this situation.

Therefore we need to left-factor the grammar:

$E\rightarrow{TX}$

$X\rightarrow{+E}\mid\epsilon$

$T\rightarrow{intY}\mid{(E)}$

$Y\rightarrow{*T}\mid\epsilon$

Then we could use the parsing table to parse strings of tokens. In this section we just show you how to use the parsing table and will teach you how to generate a LL(1) parsing table yourself. Bellow is the LL(1) prasing table:

LL(1) parsing table

The first column contains the left-most non-terminals, the first line contains the next input token and the other are production to use while blank means error. The working principle is similar to the state graph of finite automata, we have the current state and next input so we could determine the production to use precisely. Additionally '${$}$' means the end of string.

Now we can use the parsing table to parse the string $int*int$:

ll1 parsing

Step 1: $E$ is the start symbol so we push $E$ into the stack. The first input token is 'int' and we check the parsing table to find which production to use and it turns out to be $E\rightarrow{TX}$, then we replace $E$ with $TX$. Meanwhile generate the parse tree along with parsing process.

Step 2: The left-most symbol is the non-terminal $T$ and the input token is still 'int', check the table and find $T\rightarrow{intY}$ then replace $T$ with 'intY'.

Step 3: The left-most symbol is the terminal 'int' and of course it matches the first input token 'int', so we pop 'int' out and move the pointer to next token '*'.

Step 4: The left-most symbol is the non-terminal $Y$ and the input token now is '*', lookup the table and we get $Y\rightarrow{*T}$. Then do the replacement.

Step 5: The left-most symbol is the terminal '*', pop it out and move the pointer of token stream.

Step 6: Non-terminal $T$ encounter 'int', so use production $T\rightarrow{intY}$ to do replacing.

Step 7: Meet a terminal 'int', pop out and move the pointer.

Step 8: Non-terminal $Y$ encounter '${$}$', then use the production $Y\rightarrow\epsilon$.

Step 9: Since $\epsilon$ means the empty string "", so we could ignore it. Then we have $X$ meets '${$}$', do the same thing again.

Step 10: '${$}$' meets '${$}$', this means the string of tokens is accepted.

At the same time, our parse tree is built.

Now we know how to use a parsing table to parsing token stream, then we are gonging to learn how to build the LL(1) parsing table.

2.2.2 First set

We've learned that LL(1) grammar can only look ahead only one token. In addition, we decide which production to using the information of a left-most non-terminal and a input token. That is to say if there is a production $A\rightarrow{\alpha}$, a non-terminal $t$ and $\alpha\rightarrow^*{t\beta}$ can derive a $t$ in the first position there must be a production $\alpha$ in the table cell [A,t]. We say $t$ is First of $A$.

The first set can be generally defined as bellow:

$First(X)=\lbrace t\mid X\rightarrow^ * t\alpha\rbrace\cup\lbrace\epsilon\mid X\rightarrow^ * \epsilon\rbrace$

Well in simple English, it means:

The First of a non-terminal is itself, $First(t)=\lbrace{t}\rbrace$.
$\epsilon$ can be in the First set, $\epsilon\in{First(X)}$ if $X\rightarrow^*\epsilon$.
$First(\alpha)\subseteq{First(X)}$ if $X\rightarrow^*\alpha$ in the first position or $X\rightarrow{A_1...A_n}\alpha$ (condition: $A_1...A_n\rightarrow\epsilon$).

Let's consider this grammar again:

$E\rightarrow{TX}$

$X\rightarrow{+E}\mid\epsilon$

$T\rightarrow{intY}\mid{(E)}$

$Y\rightarrow{*T}\mid\epsilon$

First of all the First set of terminals are themselves.

Secondly, we start with $First(E)$ and we can find that $First(E)\subseteq{First(T)}$, thus we need to find $First(T)$.

$First(T)=\lbrace{(,int}\rbrace$

$First(E)=First(T)=\lbrace{(,int}\rbrace$.

$First(X)=\lbrace{+,\epsilon}\rbrace$.

$First(Y)=\lbrace{*,\epsilon}\rbrace$.

Except the terminals, we could start finding First sets from the left-most non-terminals on the left of arrow.

2.2.3 Follow set

The general definition of Follow set is:

$Follow(X)=\lbrace{t\mid S\rightarrow^*\beta X t \theta}\rbrace$

Well in simple English, it means:

${$}\in{Follow(S)}$, $S$ is the start symbol.
For each production $A\rightarrow\alpha X\beta$, $First(\beta)-\lbrace\epsilon\rbrace\subseteq{Follow(X)}$.
For each production $A\rightarrow\alpha X\beta$ where $\epsilon\in{First(\beta)}$, $Follow(A)\subseteq{Follow(X)}$.

We still use the grammar in last section, won't type again here, please check above. We can get the follow set:

$Follow(E)=\lbrace{$,)}\rbrace$

$Follow(X)=\lbrace{$,)}\rbrace$

$Follow(T)=\lbrace{$,),+}\rbrace$

$Follow(Y)=\lbrace{$,),+}\rbrace$

You can also write down the Follow sets of terminals, I won't do that here because it won't be used.

2.2.4 LL(1) Parsing table

Finally, we can use the First and Follow set to generate LL(1) parsing table, the algorithm is as bellow:

Construct a parsing table T for context-free grammar G,

For each production $A\rightarrow\alpha$ in G do:

For each terminal $t\in{First(\alpha)}$, do $T[A,t]=\alpha$.
If $\epsilon\in{First(\alpha)}$, for each $t\in{Follow(A)}$, do $T[A,t]=\alpha$.
If $\epsilon\in{First(\alpha)}$, for each ${$}\in{Follow(A)}$, do $T[A,{$}]=\alpha$.

Follow the step above and we can get the parsing table:

ll1 parsing table

In fact, most programming language CFGs are not LL(1), so LL(1) is actually useless in reall application. However, it's the concept that is important for us to learn.

3. Bottom-up Parsing

Coming soon...