COMP 524 Program 1 (Parser)
General Instructions
First, review the assignment submission policy in the syllabus. Note that there is no collaboration allowed.
This assignment is due at 11:59pm on Tuesday, February 19th. Submit assignments to me via email. All of your code should be included in a file called 'program1.py', and 'program1.py' should be the only thing that you turn in. When I am finished grading your assignment, I will email you your grade and any comments that I have. There are a total of 100 points and 10 bonus points.
Remember: start early! I guarantee my availability during office hours, but not at 11pm the night it is due.
When you turn in program1.py, there should be no statements at the top level that produce output or require input. Merely provide the required functions and anything the functions need--do not include any code related to your testing or debugging. When I import your functions into another script, I want the import to be silently successful. Thank you.
Update 2008-02-02 8:02pm - I have extended the deadline 1 week.
Update 2008-02-02 7:35pm - Eric G. pointed out several problems with the assignment, which have been fixed. First, I had an invalid identifier (containing underscores and digits) in my 3rd example. Also, there was a bug with the definition of a statement, which should not allow a declaration. Declarations should be separate from statements, since they should come before the other types of statements in a program/inner block.
Update 2008-01-30 11:50pm - Please make sure you comment out your print statements before you turn this in.
My solutions are now available.
Overview
You are to write a parser for a C-like language, in Python. This assignment consists of three parts: write and simplify a grammar defining the language, write a scanner, and write a parser.
Language Description
Let's call our language Pseudo-C. Pseudo-C supports a subset of C's syntax, with a few simplifying reductions. The following constructs are supported by Pseudo-C:
- Identifiers are defined to be a sequence of one or more lower-case letters (i.e. [a-z]+), with the exception of predefined keywords. (Note that identifiers, number literals, string literals, etc. are tokens, or terminal symbols. Their definitions will not be in the grammar, but rather they are defined by the scanner.)
- Several operators are defined:
+,-,*,/,==(equality),<,<=,>,>=,!=(inequality),||(logical or),&&(logical and), and!(logical not). Parentheses can be used for grouping and precedence. - A statement is either an imperative or a compound expression.
- A declaration is a type keyword followed by an identifier followed by a semicolon.
- Valid type keywords are: int, float, string.
- An imperative is either an assignment followed by a semicolon or a function call followed by a semicolon.
- An assignment is an identifier followed by
=followed by a simple expression. - A simple expression is standard arithmetic combination of identifiers, number literals, string literals, function calls, and operators. Note that although certain combinations including string literals might not make sense (e.g.
str / 3), they are syntactically valid. The precedence is as follows, with operators higher on the list binding tighter. The not operator!must be followed by parenthesis (e.g.!(a<10)), so its precedence level need not be defined. Operators with the same precedence exhibit left-associativity, e.g. 2+3+4 is (2+3)+4.* /+ -== != < <= > >=&& ||
- A number literal is defined as in problem 1 of Exercise 1. Update 2008-02-17: signs are no longer allowed on number literals. This avoids ambiguity:
2+3must be two number literals separated by a PLUS, instead of two successive number literals (i.e.2and+3). - A string literal is defined to be a sequence of zero or more non-tick characters, enclosed in ticks (i.e. '[^']*').
- A function call is an identifier followed by an open parenthesis
(followed by an optional argument list followed by a close parenthesis). - An optional argument list is either empty, a single simple expression, or a sequence of simple expressions separated by commas.
- A compound expression is one of the following:
- A while statement is a
whilekeyword followed by an open parenthesis(followed by a simple expression followed by a close parenthesis)followed by a block. - An if statement is an
ifkeyword followed by an open parenthesis(followed by a simple expression followed by a close parenthesis)followed by a block. The block can be optionally followed by a sequence of elseif statements, which are anelseifkeyword, a simple expression (updated:) enclosed in parentheses, and a block. At the end, the entire thing can be optionally followed by an else statement, which is anelsekeyword and a block. - A block is an open brace
{, an inner block, and a close brace}. An inner block is an optional list of declarations followed by a required list of statements. (Inner blocks do not count as compound expressions.)
- A while statement is a
- A program, which is the starting production, is the same as an inner block.
- A comment is everything from
//to the end of the line. The scanner should ignore comments; there should be no token for a comment.
In general, use common sense about whitespace. At a minimum, ensure that your parser works with the example programs below. Note that I've eliminated most non-essential whitespace to provide a more difficult test.
- A simple average of two numbers:
int a;
float b;
a=read();
b=read();
sum=a+b;//Note that sum wasn't declared. Syntactically, this is OK.
print(sum/2); - An integer tester:
int a;
a=read();
if(a<0){
print('negative');
}elseif(a==0){
print('zero');
}else{
print('positive');
}
print('done',' with',' that'); - Factorial:
int foo;
int product;
foo = read();
product = 1;
while (foo > 1) {
product = product * foo;
foo = foo - 1;
}
print(foo);
Step 1: Build a grammar
First, create a grammar that expresses this language. Then, eliminate left recursion and common prefixes to get your grammar in LL(1) form. In your program1.py file, include a global grammar variable that contains the entire grammar you used. (I recommend using triple-quoted strings for this.)
Note: beware the hidden common prefixes, such as imperative := assignment | function_call (both of which start with an identifier).
Points: 30
Par: 20 minutes; 33 productions
Step 2: Code a scanner
Then, identify the set of terminal tokens used by your grammar and create a scanner to convert the original character stream into a token stream. Call your primary scanning function scan(filename), which opens and reads the code stored in filename and returns a list of tokens. For a token, use the following class. (You may add to this class, but support the following at a minimum.)
def __init__(self, t, n):
self.type = t
self.name = n
Points: 30
Par: 34 minutes; 120 lines of code
+18 minutes and +21 lines of code for bug fix
Step 3: Code a recursive-descent parser
As we learned in class, create a recursive-descent parser for the grammar you created. Call your primary parsing function parse(tokens), where tokens is a list of tokens as scan() could return. parse() should return either an error string in case of a syntax error or a parse tree, where, as we saw in class, a parse tree is simply a nested list.
For example, this parse tree:
is represented by this nested list:
['expr',
['expr', Token('NUM',7)],
['op', Token('TIMES','')],
['expr', Token('NUM',2)]
],
['op', Token('PLUS','')],
['expr', Token('NUM',4)] ]
You might find it useful to "pretty-print" your parse tree, for debugging purposes, using the function below. You need not use this function in your program; I provide it merely to help you. It assumes a Token.tostr() method.
if type(p) is types.ListType:
if len(p) == 0:
print '. '*tabstop + 'e'
return
print '. '*tabstop + p[0]
for el in p[1:]:
printParseTree(el, tabstop+1)
elif isinstance(p, Token):
s = p.tostr()
print '. '*tabstop + p.tostr()
Points: 40
Par: 64 minutes, including grammar bug fixes; 189 lines of code
Bonus
See the first Pseudo-C example program above. Note how sum was used without being declared. In C, this is a compile-time error, even though this rule is not encoded into the grammar itself. If you implement the rule that identifiers cannot be used unless they are declared, you will get 10 bonus points. If the parse() function encounters such an error, print a warning but continue to build and return the parse tree (assuming the syntax is valid). Since there is no syntax for declaring functions, you may assume that functions do not need to be declared before they are called.
Points: 10 bonus
Par: 14 minutes, ~10 lines of code

