View Source of program1.php
In the interest of open source, and in the interest of keeping Mr. Webmaster honest, here is the PHP source of program1.php. (This encourages him to write clean, easy-to-understand code...well, relatively clean and easy-to-understand, that is.) If you become interested in PHP, you can read more about it at PHP: Hypertext Preprocessor. You will also see plenty of HTML code, which is what makes the World Wide Web go 'round. You can read more about HTML (and CSS, another technology used on this site) here, at the official page of the very official World Wide Web Consortium (W3C). Anyway, enough rambling, here's the code!
<?php include("header1.php"); ?>
<?php $THIS_FILE = "program1.php"; ?>
<title>COMP 524-S08 - Program 1 (Parser)</title>
<link rel="stylesheet" href="local.css" type="text/css" charset="iso-8859-1" title="Default" />
<?php include("header2.php"); ?>
<?php include("localheader.php"); ?>
<h1>COMP 524 Program 1 (Parser)</h1>
<h2>General Instructions</h2>
<p>First, review the assignment submission policy in the <a href="syllabus.php">syllabus</a>. Note that there is <strong>no collaboration</strong> allowed.</p>
<p>This assignment is <strong>due at 11:59pm on Tuesday, February 19<sup>th</sup></strong>. Submit assignments to me via email. All of your code should be included in a file called 'program1.py', and 'program1.py' should be the only thing that you turn in. When I am finished grading your assignment, I will email you your grade and any comments that I have. There are a total of 100 points and 10 bonus points.</p>
<p>Remember: <strong>start early</strong>! I guarantee my availability during office hours, but not at 11pm the night it is due.</p>
<p>When you turn in program1.py, there should be no statements at the top level that produce output or require input. Merely provide the required functions and anything the functions need--do not include any code related to your testing or debugging. When I import your functions into another script, I want the import to be silently successful. Thank you.</p>
<p><strong>Update 2008-02-02 8:02pm</strong> - I have extended the deadline 1 week.</p>
<p><strong>Update 2008-02-02 7:35pm</strong> - Eric G. pointed out several problems with the assignment, which have been fixed. First, I had an invalid identifier (containing underscores and digits) in my 3rd example. Also, there was a bug with the definition of a statement, which should <strong>not</strong> allow a declaration. Declarations should be separate from statements, since they should come before the other types of statements in a program/inner block.</p>
<p><strong>Update 2008-01-30 11:50pm</strong> - Please make sure you comment out your print statements before you turn this in.</p>
<p><a href="program1.py">My solutions</a> are now available.</p>
<h3>Overview</h3>
<p>You are to write a parser for a C-like language, in Python. This assignment consists of three parts: write and simplify a grammar defining the language, write a scanner, and write a parser.</p>
<h3>Language Description</h3>
<p>Let's call our language Pseudo-C. Pseudo-C supports a subset of C's syntax, with a few simplifying reductions. The following constructs are supported by Pseudo-C:</p>
<ul>
<li><strong>Identifiers</strong> are defined to be a sequence of one or more lower-case letters (i.e. [a-z]+), with the exception of predefined keywords. (Note that identifiers, number literals, string literals, etc. are <em>tokens</em>, or <em>terminal symbols</em>. Their definitions will not be in the grammar, but rather they are defined by the scanner.)</li>
<li>Several <strong>operators</strong> are defined: <code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>==</code> (equality), <code><</code>, <code><=</code>, <code>></code>, <code>>=</code>, <code>!=</code> (inequality), <code>||</code> (logical or), <code>&&</code> (logical and), and <code>!</code> (logical not). Parentheses can be used for grouping and precedence.</li>
<li>A <strong>statement</strong> is either an imperative or a compound expression.</li>
<li>A <strong>declaration</strong> is a type keyword followed by an identifier followed by a semicolon.</li>
<li>Valid <strong>type keywords</strong> are: int, float, string.</li>
<li>An <strong>imperative</strong> is either an assignment followed by a semicolon or a function call followed by a semicolon.</li>
<li>An <strong>assignment</strong> is an identifier followed by <code>=</code> followed by a simple expression.</li>
<li>A <strong>simple expression</strong> is standard arithmetic combination of identifiers, number literals, string literals, function calls, and operators. Note that although certain combinations including string literals might not make sense (e.g. <code>str / 3</code>), they are syntactically valid. The precedence is as follows, with operators higher on the list binding tighter. The not operator <code>!</code> must be followed by parenthesis (e.g. <code>!(a<10)</code>), so its precedence level need not be defined. Operators with the same precedence exhibit <em>left-associativity</em>, e.g. 2+3+4 is (2+3)+4.
<ol>
<li><code>* /</code></li>
<li><code>+ -</code></li>
<li><code>== != < <= > >=</code></li>
<li><code>&& ||</code></li>
</ol>
</li>
<li>A <strong>number literal</strong> is defined as in problem 1 of <a href="exercise1.php">Exercise 1</a>. <strong>Update 2008-02-17:</strong> signs are no longer allowed on number literals. This avoids ambiguity: <code>2+3</code> must be two number literals separated by a PLUS, instead of two successive number literals (i.e. <code>2</code> and <code>+3</code>).</li>
<li>A <strong>string literal</strong> is defined to be a sequence of zero or more non-tick characters, enclosed in ticks (i.e. '[^']*').</li>
<li>A <strong>function call</strong> is an identifier followed by an open parenthesis <code>(</code> followed by an optional argument list followed by a close parenthesis <code>)</code>.</li>
<li>An <strong>optional argument list</strong> is either empty, a single simple expression, or a sequence of simple expressions separated by commas.</li>
<li>A <strong>compound expression</strong> is one of the following:
<ul>
<li>A <strong>while statement</strong> is a <code>while</code> keyword followed by an open parenthesis <code>(</code> followed by a simple expression followed by a close parenthesis <code>)</code> followed by a block.</li>
<li>An <strong>if statement</strong> is an <code>if</code> keyword followed by an open parenthesis <code>(</code> followed by a simple expression followed by a close parenthesis <code>)</code> followed by a block. The block can be optionally followed by a sequence of elseif statements, which are an <code>elseif</code> keyword, a simple expression (<strong>updated:</strong>) enclosed in parentheses, and a block. At the end, the entire thing can be optionally followed by an else statement, which is an <code>else</code> keyword and a block.</li>
<li>A <strong>block</strong> is an open brace <code>{</code>, an inner block, and a close brace <code>}</code>. An <strong>inner block</strong> is an optional list of declarations followed by a required list of statements. (Inner blocks do not count as compound expressions.)</li>
</ul>
</li>
<li>A <strong>program</strong>, which is the starting production, is the same as an inner block.</li>
<li>A <strong>comment</strong> is everything from <code>//</code> to the end of the line. The scanner should ignore comments; there should be no token for a comment.</li>
</ul>
<p>In general, use common sense about <strong>whitespace</strong>. At a minimum, ensure that your parser works with the example programs below. Note that I've eliminated most non-essential whitespace to provide a more difficult test.</p>
<ul>
<li>A simple average of two numbers:
<div class="code">
int a;<br/>
float b;<br/>
a=read();<br/>
b=read();<br/>
sum=a+b;//Note that sum wasn't declared. Syntactically, this is OK.<br/>
print(sum/2);
</div>
</li>
<li>An integer tester:
<div class="code">
int a;<br/>
a=read();<br/>
if(a<0){<br/>
print('negative');<br/>
}elseif(a==0){<br/>
print('zero');<br/>
}else{<br/>
print('positive');<br/>
}<br/>
print('done',' with',' that');
</div>
</li>
<li>Factorial:
<div class="code">
int foo;<br/>
int product;<br/>
foo = read();<br/>
product = 1;<br/>
while (foo > 1) {<br/>
product = product * foo;<br/>
foo = foo - 1;<br/>
}<br/>
print(foo);
</div>
</li>
</ul>
<h3>Step 1: Build a grammar</h3>
<p>First, create a grammar that expresses this language. Then, eliminate left recursion and common prefixes to get your grammar in LL(1) form. In your program1.py file, include a global <code>grammar</code> variable that contains the entire grammar you used. (I recommend using <a href="http://www.python.org/doc/1.5.1p1/tut/strings.html">triple-quoted strings</a> for this.)</p>
<p>Note: beware the hidden common prefixes, such as <code>imperative := assignment | function_call</code> (both of which start with an identifier).</p>
<p>Points: 30</p>
<p>Par: 20 minutes; 33 productions</p>
<h3>Step 2: Code a scanner</h3>
<p>Then, identify the set of terminal tokens used by your grammar and create a scanner to convert the original character stream into a token stream. Call your primary scanning function <code>scan(filename)</code>, which opens and reads the code stored in <code>filename</code> and returns a list of tokens. For a token, use the following class. (You may add to this class, but support the following at a minimum.)</p>
<div class="code">
class Token:<br/>
def __init__(self, t, n):<br/>
self.type = t<br/>
self.name = n
</div>
<p>Points: 30</p>
<p>Par: 34 minutes; 120 lines of code<br/>+18 minutes and +21 lines of code for bug fix</p>
<h3>Step 3: Code a recursive-descent parser</h3>
<p>As we learned <a href="notes/notes-0129.html">in class</a>, create a recursive-descent parser for the grammar you created. Call your primary parsing function <code>parse(tokens)</code>, where <code>tokens</code> is a list of tokens as <code>scan()</code> could return. <code>parse()</code> should return either an error string in case of a syntax error or a parse tree, where, as we saw in class, a parse tree is simply a nested list.</p>
<p>For example, this parse tree:<br/><img src="notes/figs/0124.fig03.png" alt="a parse tree" /><br/>is represented by this nested list:</p>
<div class="code">
['expr',<br/>
['expr',<br/>
['expr', Token('NUM',7)],<br/>
['op', Token('TIMES','')],<br/>
['expr', Token('NUM',2)]<br/>
],<br/>
['op', Token('PLUS','')],<br/>
['expr', Token('NUM',4)]
]
</div>
<p>You might find it useful to "pretty-print" your parse tree, for debugging purposes, using the function below. You need not use this function in your program; I provide it merely to help you. It assumes a <code>Token.tostr()</code> method.</p>
<div class="code">
def printParseTree(p, tabstop=0):<br/>
if type(p) is types.ListType:<br/>
if len(p) == 0:<br/>
print '. '*tabstop + 'e'<br/>
return<br/>
print '. '*tabstop + p[0]<br/>
for el in p[1:]:<br/>
printParseTree(el, tabstop+1)<br/>
elif isinstance(p, Token):<br/>
s = p.tostr()<br/>
print '. '*tabstop + p.tostr()
</div>
<p>Points: 40</p>
<p>Par: 64 minutes, including grammar bug fixes; 189 lines of code</p>
<h3>Bonus</h3>
<p>See the first Pseudo-C example program above. Note how sum was used without being declared. In C, this is a compile-time error, even though this rule is not encoded into the grammar itself. If you implement the rule that identifiers cannot be used unless they are declared, you will get 10 bonus points. If the <code>parse()</code> function encounters such an error, print a warning but continue to build and return the parse tree (assuming the syntax is valid). Since there is no syntax for declaring functions, you may assume that functions do not need to be declared before they are called.</p>
<p>Points: 10 bonus</p>
<p>Par: 14 minutes, ~10 lines of code</p>
<?php include("footer.php"); ?>

