Wizard

COMP 524: Programming Language Concepts

Spring, 2008
Jeff Terrell
jsterrel AT cs.unc.edu
(919) 962-1791 (office: Sitterson 138)

COMP 524 Program 1 (Parser)

General Instructions

First, review the assignment submission policy in the syllabus. Note that there is no collaboration allowed.

This assignment is due at 11:59pm on Tuesday, February 19th. Submit assignments to me via email. All of your code should be included in a file called 'program1.py', and 'program1.py' should be the only thing that you turn in. When I am finished grading your assignment, I will email you your grade and any comments that I have. There are a total of 100 points and 10 bonus points.

Remember: start early! I guarantee my availability during office hours, but not at 11pm the night it is due.

When you turn in program1.py, there should be no statements at the top level that produce output or require input. Merely provide the required functions and anything the functions need--do not include any code related to your testing or debugging. When I import your functions into another script, I want the import to be silently successful. Thank you.

Update 2008-02-02 8:02pm - I have extended the deadline 1 week.

Update 2008-02-02 7:35pm - Eric G. pointed out several problems with the assignment, which have been fixed. First, I had an invalid identifier (containing underscores and digits) in my 3rd example. Also, there was a bug with the definition of a statement, which should not allow a declaration. Declarations should be separate from statements, since they should come before the other types of statements in a program/inner block.

Update 2008-01-30 11:50pm - Please make sure you comment out your print statements before you turn this in.

My solutions are now available.

Overview

You are to write a parser for a C-like language, in Python. This assignment consists of three parts: write and simplify a grammar defining the language, write a scanner, and write a parser.

Language Description

Let's call our language Pseudo-C. Pseudo-C supports a subset of C's syntax, with a few simplifying reductions. The following constructs are supported by Pseudo-C:

In general, use common sense about whitespace. At a minimum, ensure that your parser works with the example programs below. Note that I've eliminated most non-essential whitespace to provide a more difficult test.

Step 1: Build a grammar

First, create a grammar that expresses this language. Then, eliminate left recursion and common prefixes to get your grammar in LL(1) form. In your program1.py file, include a global grammar variable that contains the entire grammar you used. (I recommend using triple-quoted strings for this.)

Note: beware the hidden common prefixes, such as imperative := assignment | function_call (both of which start with an identifier).

Points: 30

Par: 20 minutes; 33 productions

Step 2: Code a scanner

Then, identify the set of terminal tokens used by your grammar and create a scanner to convert the original character stream into a token stream. Call your primary scanning function scan(filename), which opens and reads the code stored in filename and returns a list of tokens. For a token, use the following class. (You may add to this class, but support the following at a minimum.)

class Token:
    def __init__(self, t, n):
        self.type = t
        self.name = n

Points: 30

Par: 34 minutes; 120 lines of code
+18 minutes and +21 lines of code for bug fix

Step 3: Code a recursive-descent parser

As we learned in class, create a recursive-descent parser for the grammar you created. Call your primary parsing function parse(tokens), where tokens is a list of tokens as scan() could return. parse() should return either an error string in case of a syntax error or a parse tree, where, as we saw in class, a parse tree is simply a nested list.

For example, this parse tree:
a parse tree
is represented by this nested list:

['expr',
  ['expr',
    ['expr', Token('NUM',7)],
    ['op', Token('TIMES','')],
    ['expr', Token('NUM',2)]
  ],
  ['op', Token('PLUS','')],
  ['expr', Token('NUM',4)] ]

You might find it useful to "pretty-print" your parse tree, for debugging purposes, using the function below. You need not use this function in your program; I provide it merely to help you. It assumes a Token.tostr() method.

def printParseTree(p, tabstop=0):
    if type(p) is types.ListType:
        if len(p) == 0:
            print '.   '*tabstop + 'e'
            return
        print '.   '*tabstop + p[0]
        for el in p[1:]:
            printParseTree(el, tabstop+1)
    elif isinstance(p, Token):
        s = p.tostr()
        print '.   '*tabstop + p.tostr()

Points: 40

Par: 64 minutes, including grammar bug fixes; 189 lines of code

Bonus

See the first Pseudo-C example program above. Note how sum was used without being declared. In C, this is a compile-time error, even though this rule is not encoded into the grammar itself. If you implement the rule that identifiers cannot be used unless they are declared, you will get 10 bonus points. If the parse() function encounters such an error, print a warning but continue to build and return the parse tree (assuming the syntax is valid). Since there is no syntax for declaring functions, you may assume that functions do not need to be declared before they are called.

Points: 10 bonus

Par: 14 minutes, ~10 lines of code

program1.php: Last Modified: 03/05/08@15:46:45 | Size: 12665 bytes | View Source Valid XHTML 1.1 Valid CSS