Context-Free Languages and Syntax Analyzer

2018-01-12

Context-free grammar is a more powerful method to describe languages. Such grammars can describe certain features that have a recursive structure, which is useful in various of applications. And context-free grammars were first to used in the study of human languages, which include many terms, such as noun, verb, preposition. Because while understanding their relationships and their respective phrases, it will lead to the natural recursion. Context-free grammars have the ability to capture these important patterns of the relationships among these languages. The collection of languages associated with context-free grammars are called the context-free languages. The grammar is often used in the parser, which is contained in the compiler and interpreter for programming languages to extract the meanings of a program in advance of generating the complied code or performing the interpreted execution. Applying the context-free grammars, we can construct the parser for the programming languages. In this article I will introduce the concepts of Context-free languages and program a syntax analyzer for SL for practicing. Besides, I will introduce Pushdown Automata, which is a class of powerful machines to recognize context-free languages.

Reference:

Introduction to the theory of computation —Michael. Sipser
Teaching Materials in the course of System Programming — Zili Shao.

Context-free Grammars

If you try to use regular expressions to describe the language L = {a^nb^n, n>=0}, you will soon find it impossible. Because the pattern of counting the same numbers of alphabets can not be performed by regular expressions. However, this pattern is quite commonly used in human’s communicating and understanding of languages, therefore, we need another more powerful grammars.

1 2	S -> aSb\|T T ->ab

This is an example of context-free languages, and it describes the language L = {a^nb^n, n>=0}.
A grammar consists of a collection of substitution rules, also called productions. Each rule appears line by line, comprising a symbol and a string separated by an arrow. The symbol is called a variable. The string consists of variables and other symbols is called terminals. Usually, the symbol variable is represented by the capital letters, and the terminals are represented by lowercase letters, numbers and special symbols. One variable in these rules is called start variable, which is the left-most symbol of the top-most rule.
Take the grammar above for an example, S is its start variable, another variable is T, and its terminals are a and b.
When we use this grammar to generate a string, say string ‘aaabbb’, the process will be as follow:

1	S -> aSb -> aaSbb -> aaabbbb

This process is called derivation. Step by step, we replace the symbol on the left-hand side with terminals according to the rules, and derive the string from the one another, finally when there is no more symbols in the string, there is it.

Formal Description for CFG

A context-free grammar is a 4-tuple (V, ∑, R, S), where

V is a finite set of variables.
∑ is a finite set of terminals, disjoint from V.
R is a finite set of rules, which turn the variables into the terminals.
S ⊆ V is the start variable.

Take the formal grammar as example, its context-free grammar is ({S, T}, {a, b}, R, S), where R is

1 2	S -> aSb\|T T ->ab

Parsing Tree

Consider there is a context-free grammar G1 ({, , }, {a, +, -, ×, ÷, (, )}, R, {}), R is

1
2
3

<EXPR> -> <EXPR> + <TERM>| <TERM>
<TERM> -> <TERM> × <FACTOR>| <FACTOR>
<FACTOR> -> (<EXPR>)| a

With this grammar, we can generate strings, say a+a×a and (a+a)×a as follow:

a+a×a

<EXPR> -> <EXPR> + <TERM>
       -> <TERM> + <TERM>
       -> <FACTOR> + <TERM>
       -> a + <TERM>
       -> a + <TERM> × <FACTOR>
       -> a + <FACTOR> × <FACTOR>
       -> a + a × <FACTOR>
       -> a + a × a

(a+a)× a

<EXPR> ->  <TERM>
       -> <TERM> × <FACTOR>
       -> <FACTOR> × <FACTOR>
       -> (<EXPR>) × <FACTOR>
       -> (<EXPR> + <TERM>) × <FACTOR>
       -> (<TERM> + <TERM>) × <FACTOR>
       -> (<FACTOR> + <TERM>) × <FACTOR>
       -> (<FACTOR> + <FACTOR>) × <FACTOR>
       -> ( a + <FACTOR>) × <FACTOR>
       -> ( a + a) × <FACTOR>
       -> ( a + a) × a

The process could be drawn as the parsing tree.

A compiler is a specific program to translate the code written in programming language into another form, which is more suitable for execution. Therefore, the compiler’t job is to extracting the meaning of the code to be compiled in a process called parsing.
One way of representing the parsing process is to draw a parsing tree for the code, in the context-free grammar for the programming language.

Syntax Analyzer

In the practice part of context-free language, I implemented a syntax analyzer utilizing the recursive descent parser for the SL on the context-free grammar (for its lexical analyzer part, please see : Simple Language ). The syntax analyzer can parse the program segment and judge if it could be accepted according to its context-free grammar.
The context-free grammar is as follow:

PROG -> PROG_BODY.
PROG_BODY -> var ID_LIST; STATEMENT_LIST
ID_LIST -> id ID_LIST’
ID_LIST’ -> , id ID_LIST’ | ε
STATEMENT_LIST -> begin STATEMENT end
STATEMENT -> id=EXP; STATEMENT’
STATEMENT’-> id=EXP; STATEMENT’ | STATEMENT_LIST STATEMENT’ | ε 
EXP ->TERM E’
E’-> + TERM E’ | - TERM E’ | ε
TERM -> FAC T'
T’ -> *FAC T’| /FAC T’|ε
FAC -> (EXP) | num | id

Notice that this grammar has no right-recursion, which avoids the endless loop in the process of parsing.
Now let’s add the token name into this grammar for reference. In order to shorten the written job, I simply replace the token name with a single alphabet as a symbol.

Token Name	Symbol
KEYWORD var	A
KEYWORD begin	a
KEYWORD end	b
PLUS	F
MINUS	G
COMMA	B
DIV	I
SEMICOLON	C
LBRACE	J
ASSIGN	D
RBRACE	K
PERIOD	E
ID	L
NUM	M
Unrecognized	N

Then the grammar will look like this:
R

PROG -> PROG_BODY E
PROG_BODY ->A ID_LIST C STATEMENT_LIST 
ID_LIST -> L ID_LIST’
ID_LIST’ -> B L ID_LIST’ | ε
STATEMENT_LIST -> a STATEMENT b
STATEMENT -> L D EXP C STATEMENT’
STATEMENT’-> L D EXP C STATEMENT’ | STATEMENT_LIST STATEMENT’ | ε 
EXP ->TERM E’
E’-> F TERM E’ | G TERM E’ | ε
TERM -> FAC T’
T’ -> H FAC T’| I FAC T’|ε
FAC -> J EXP K | M | L

So the grammar can be written formally as G({PROG, PROG_BODY, ID_LIST, STATEMENT_LIST, ID_LIST’, STATEMENT, STATEMENT’, EXP, TERM, E’, FAC, T’}, {A, a, b, B, C, D, E, F, G, H, I, J, K, L, M}, R, {PROG}).

Implement

Firstly, I continued using the scan() part in the lexical analyzer for extract the tokens in the program segment.
After it gets the tokens and translate them into a string of symbols, it uses the string as the input and the point p can help to locate the symbols. Finally, it starts doing the parsing job with the code below:

int EXP1(){
	if(*p == 'F'|*p == 'G'|*p == 'H'|*p == 'I'){
		p++;
		if(*p =='L'|*p == 'M'){
			p++;
			if(EXP1()== 1){
				//p++;
				return 1;
				
			}
			return 0;
		}
		return 0;
	}
	return 1;
	p--;
}
/*EXP -> L EXP1| M EXP1| J EXP K EXP1
*/
int EXP(){
	if(*p == 'L'|*p == 'M'){
		p++;
		if(EXP1()==1){
			//p++;
			return 1;
		}
		return 0;
	}else if(*p == 'J'){
		p++;
		if(EXP()== 1){
			//p++;
			if(*p == 'K'){
				p++;
				if(EXP1()==1){
					return 1;
				}
				return 0;
			}
			return 0;
		}
		return 0;
	}
	return 0;
}
/*STATEMENT1 -> L D EXP C STATEMENT1 | STATEMENT_LIST STATEMENT1 |ε
*/
int STATEMENT1(){
	if(*p=='L'){
		p++;
		if(*p=='D'){
			p++;
			if(EXP()==1){
				//p++;
				if(*p=='C'){
					p++;
					if(STATEMENT1()==1){
						
						return 1;
					}
					return 0;
				}
				return 0;
			}
			return 0;
		}
		return 0;
	}else if(STATEMENT_LIST() == 1){
			p++;
			if(STATEMENT1()== 1){
				//p++;
				return 1;
			}
		return 0;
	}
	return 1;
	p--;
}
/*STATEMENT -> L D EXP C STATEMENT1
*/
int STATEMENT(){
	printf("\t\t\tAssignment begins\n");
	if(*p =='L'){
		p++;
		if(*p =='D'){
			p++;
			if(EXP()==1){
				//p++;
				if(*p == 'C'){
					printf("\t\t\tAssignment ends\n");
					p++;
					if(STATEMENT1()==1){
						//p++;
						return 1;
					}
					return 0;
				}
				return 0;
			}
			return 0;
		}
		return 0;
	}
	return 0;
}
/*STATEMENT_LIST -> a STATEMENT b
*/
int STATEMENT_LIST(){
	if(*p == 'a'){
		p++;
		printf("\t\tStatement begins\n");
		if(STATEMENT()==1){
			//p++;
			//printf("checked %s\n", p);
			if(*p=='b'){
				p++;
				printf("\t\tStatement ends\n");
				return 1;
			}
			return 0;
		}
		return 0;
	}
	return 0;
}
/* ID_LIST1 ->B L ID_LIST1| ε
*/
int ID_LIST1(){
	if(*p =='B'){
		p++;
		if(*p =='L'){
			p++;
			if(ID_LIST1()==1){
				//p++;
				return 1;
			}
			return 0;
		}
		return 0;
	}
	return 1;
	p--;
}
/*ID_LIST -> L ID_LIST1
*/
int ID_LIST(){
	if(*p=='L'){
		p++;
		if(ID_LIST1()==1){
			//p++;
			return 1;
		}
		return 0;
	}
	return 0;
}
/*PROG_BODY -> A ID_LIST C STATEMENT_LIST
*/
int PROG_BODY(){
	if(*p=='A'){
		p++;
		printf("\tDeclaration begins\n");
		if(ID_LIST()==1){
			//p++;
			//printf("checked %s\n", p);
			if(*p =='C'){
				p++;
				printf("\tDeclaration ends\n");
				if(STATEMENT_LIST()==1){
					//p++;
					return 1;
				}
				return 0;
			}
			return 0;
		}
		return 0;
	}
	return 0;
}
/*PROG -> PROG_BODY E
*/
int PROG(){
	printf("Assignment 4:\n\n");
	printf("Program begins\n");
	if(PROG_BODY()==1){
		//p++;
		if(*p=='E'){
			p++;
			printf("Program ends\n");
			return 1;
		}
		return 0;
	}
	return 0;
}

Using recursion, the code will be so concise and easy to understand, that the variable could be treated as a function could be called in itself, and the terminals are detected as an alphabet as the string.
And in the main, it opens the input_file and execute these lines to process both the lexical and the syntax analyses.

/* Process lexical analysis*/
	if (scan() > 0){
		/* Process syntax analysis*/
		p=token_list;
		printf("p:%s\n",p);
		if( (PROG()==1) && (*p == '$') ){
                printf("Accept.\n");
        }else{
                printf("Reject.\n");
        }
		return 1;
	}else{
		return -1;

Conclusion

Let’s test it with an example:
Example 1:

var A, B, C; 
begin 
	A = 5;
	B = 6;
	C = (a + b) / 2.0; 
end.

The result:

Let’s test a negative example with the syntax error:
Example 2:

var a;
begin
a = 5
end.

It misses a ‘;’ after ‘5’
The result:

All in all, I have to say this practice sample is just my personal version of implementation, so there must be some things could be revised and improved. I just wrote them down in case the situation I need to find and keep them in mind.

This is the end of my review article, feel free to contact with me if you have any problem with my article.