The Proteus System

Publications

- Revised: Tue Jul 30 16:56:45 1996 by jamesri@cogs.susx.ac.uk

This section provides access to publications associated with the project. A list of paper titles organized by subject follows. To see the abstract (and citation) for a paper, click on its title.

Most abstracts provide access to the full paper by clicking on the icon .

At a minimum, access to the PostScript form of the paper is provided. For papers marked with [*], an HTML form that may be browsed directly is also available.

(Alternatively, publications and other materials can be obtained via anonymous FTP to our ftp site or a direct connection)

Proteus Overview Publications

Proteus Language

Program Transformation and Generation of Parallel Code

Performance Analysis and Prediction

Applications of the Proteus System


Prototyping Parallel and Distributed Programs in Proteus

P. Mills, L. Nyland, J. Prins, J. Reif and R. Wagner. Presented at the Symposium on Parallel and Distributed Processing Dec, 1991

Postscript (75Kbytes)

ABSTRACT:

This paper presents Proteus, an architecture-independent language suitable for prototyping parallel and distributed programs. Proteus is a high-level imperative notation based on sets and sequences with a single construct for the parallel composition of processes. Although a shared-memory model is the basis for communication between processes, this memory can be partitioned into shared and private variables. Parallel processes operate on individual copies of private variables, which are independently updated and may be merged into the shared state at specifiable barrier synchronization points. Several examples are given to illustrate how the various parallel programming models, such as synchronous data-parallelism and asynchronous control-parallelism, can be expressed in terms of this foundation. This common foundation allows prototypes to be tested, evolved and finally implemented through refinement techniques targeting specific architectures.

@inproceedings { proteus-spdp91,
        author = { Peter Mills and Lars Nyland and Jan Prins and John Reif
                and Robert Wagner},
        title = { Prototyping Parallel and Distributed Programs in {Proteus} },
        booktitle = { Proceedings of the Third {IEEE} Symposium on Parallel and
                Distributed Processing},
        address = { Dallas, Texas, Dec.1-5 },
        publisher = { IEEE },
        pages = { 10-19 },
        year = 1991 
}

Prototyping N-body Simulation in Proteus

P. Mills, L. Nyland, J. Prins and J. Reif. Presented at the Sixth International Parallel Processing Symposium, March, 1992

Postscript (51Kbytes)

ABSTRACT:

This paper explores the use of Proteus, an architecture-independent language suitable for prototyping parallel and distributed programs. Proteus is a high-level imperative notation based on sets and sequences with a single construct for the parallel composition of processes communicating through shared memory. Several different parallel algorithms for N-body simulation are presented in Proteus, illustrating how Proteus provides a common foundation for expressing the various parallel programming models. This common foundation allows prototype parallel programs to be tested and evolved without the use of machine-specific languages. To transform prototypes to implementations on specific architectures, program refinement techniques are utilized. Refinement strategies are illustrated that target broad-spectrum parallel intermediate languages, and their viability is demonstrated by refining an N-body algorithm to data-parallel CVL code.

@inproceedings { proteus-ipps92,
        author = { Peter Mills and Lars Nyland and Jan Prins and John Reif},
        title = { Prototyping {N}-body Simulation in {Proteus}},
        booktitle = { Proceedings of the Sixth International Parallel 
                Processing Symposium},
        address = { Beverly Hills, Ca., March 23-26 },
        publisher = { IEEE },
        pages = { 476-482 },
        year = 1992
}

Prototyping High-Performance Parallel Computing Applications in Proteus

by Peter Mills, Lars S. Nyland, Jan F. Prins and John Reif. Presented at The DARPA Software PI Meeting, March, 1992.

Postscript (63Kbytes)

ABSTRACT:

This paper explores the use of Proteus, an architecture-independent language suitable for prototyping time-sensitive parallel and distributed programs. Proteus is a high-level imperative notation based on sets and sequences with succinct yet powerful constructs for the parallel composition of processes communicating through shared memory.

Several different parallel algorithms for N-body simulation in molecular dynamics are presented using Proteus, illustrating how the language provides a common foundation for expressing various parallel programming models. This common foundation supports the construction of high-performance computing applications across a wide range of parallel machines through a development methodology in which prototype parallel programs can be tested and evolved without the use of machine-specific languages.

To transform prototypes to implementations on specific architectures, program refinement techniques are utilized. Refinement strategies are illustrated that target broad-spectrum parallel intermediate languages, and their viability is demonstrated by refining an N-body algorithm to data-parallel intermediate code. Time-sensitive variants of a parallel N-body are also described to illustrate how Proteus allows the expression of resource requirements through real-time constraints as well as progress constraints which abstractly specify the distribution of computational resources.

@inproceedings { proteus-darpa92,
        author = { Peter Mills and Lars Nyland and Jan Prins and John Reif },
        title = { Prototyping High-Performance Parallel Computing 
                Applications in Proteus },
        booktitle = {Proceedings of 1992 DARPA Software Technology Conference},
        address = { April 28-30, 1992 Los Angeles, California, },
        publisher = { Meridian, Arlington, VA },
        pages = { 433-442 },
        year = 1992 
}

Prototyping Parallel Algorithms

Lars S. Nyland and Jan F. Prins. Presented at The Dartmouth Institute for Advanced Graduate Studies on Parallel Computation, June 1992.

Postscript (54Kbytes)

ABSTRACT:

A prototyping approach to the implementation of complex parallel algorithms on actual parallel machines is proposed as a way to bridge the gap between theoretical descriptions of parallel algorithms and the implementation-level details required in current low-level parallel programming languages. Prototypes are constructed using the wide-spectrum parallel programming language Proteus and can be executed using a Proteus interpreter. The wide-spectrum nature of the Proteus language permits key aspects of the implementation to be carefully investigated while other, ancillary, aspects can remain at a convenient level of abstraction. Manual and automatic techniques are under investigation to transform prototypes to low-level programming languages for specific machines. In this paper several different N-body force calculation techniques are used to introduce Proteus, including a preliminary implementation study of the 3-dimensional fast multipole algorithm of Greengard, an algorithm of current interest in the scientific computing community.

@inproceedings{proteus-dags92,
        author =        {Lars S. Nyland and Jan F. Prins},
        title =         {Prototyping Parallel Algorithms},
        booktitle =     {Proc. of the 1992 DAGS/PC Symposium},
        month =         {June},
        year =          {1992},
        address =       {Dartmouth College, Hanover, NH},
        pages =         {31--39},
        editor =        {Donald Johnson, Fillia Makedon, Panagiotis Metaxas},
        organization =  {Dartmouth College}
}

Rate Control as a Language Construct for Parallel and Distributed Programming

Peter Mills, Jan Prins and John Reif. Presented at the IEEE Workshop on Parallel and Distributed Real-Time Systems, April, 1993.

Postscript (53Kbytes)

ABSTRACT:

This paper introduces a new parallel programming language construct, the rate construct, and examines its utility for a variety of problems. The rate construct specifies constraints on the relative rates of progress of tasks executing in parallel, where progress is the amount of computational work as measured by elapsed "ticks" on a local logical clock. By prescribing expected work, the rate construct constrains the allocation of processor-time to tasks needed to achieve that work; in a parallel setting this constrains the distribution of tasks to processors and multiprocessing ratios, effected for example by load balancing. We present definitions of rate and underlying real-time primitives as orthogonal extensions to the architecture-independent parallel programming language Proteus. The utility of the rate construct is evidenced for a variety of problems, including weighted parallel search for a goal, adaptive many-body simulation in which rates abstract the requirements for load-balancing, and variable time-stepped computations in which the use of rates can alter the frequency of asynchronous iterations.

@inproceedings { mills-rates,
        author = { Peter Mills and Jan Prins and John Reif },
        title = { Rate Control as a Language Construct for Parallel and
                Distributed Programming },
        booktitle = {Proceedings of IEEE Workshop on Parallel and Distributed
                Real-Time Systems},
        address = { IPPS'93, Newport Beach, California, April 13-16 },
        publisher = { ONR },
        pages = { 164-170 },
        year = 1993 
}

Transforming High-Level Data-Parallel Programs into Vector Operations

J. Prins and D. Palmer, Presented at the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, May, 1993

Postscript (55Kbytes)

ABSTRACT:

Efficient parallel execution of a high-level data-parallel language based on nested sequences, higher-order functions and generalized iterators can be realized in the vector model using a suitable representation of nested sequences and a small set of transformation rules to distribute iterators through the constructs of the language.

@inproceedings { proteus-ppopp93,
        author = { Jan Prins and Daniel Palmer},
        title = { Transforming High-Level Data-Parallel Programs into Vector
                  Operations},
        booktitle = { Proceedings of the Fourth {ACM} SIGPLAN Symposium on
                Principles and Practice of Parallel Programming},
        address = { San Diego, CA., May 19-22 },
        publisher = { ACM },
        pages = { 119-128 },
        year = 1993 
}

Work-Efficient Nested Data-Parallelism

D. Palmer, J. Prins and S. Westfold, Presented at the Fifth Symposium on the Frontiers of Massively Parallel Processing (Frontiers 95), February 1995

Postscript (46Kbytes)

ABSTRACT:

An apply-to-all construct is the key mechanism for expressing data-parallelism, but data-parallel programming languages like HPF and C* significantly restrict which operations can appear in the construct. Allowing arbitrary operations substantially simplifies the expression of irregular and nested data-parallel computations. The technique of flattening nested parallelism introduced by Blelloch, compiles data-parallel programs with unrestricted apply-to-all constructs into vector operations, and has achieved notable success, particularly with irregular data-parallel programs. However, these programs must be carefully constructed so that flattening them does not lead to suboptimal work complexity due to unnecessary replication in index operations. We present new flattening transformations that generate programs with correct work complexity. Because these transformations may introduce concurrent reads in parallel indexing, we developed a randomized indexing that reduces concurrent reads while maintaining work-efficiency. Experimental results show that the new rules and implementations significantly reduce memory usage and improve performance.

@inproceedings { proteus-frontiers95,
        author = { Daniel Palmer, Jan Prins and Stephen Westfold},
        title = { Work-Efficient Nested Data-Parallelism},
        booktitle = { Proceedings of the Fifth Symposium on the
                Frontiers of Massively Parallel Processing (Frontiers 95)},
        publisher = { IEEE },
        year = {1995 }
}

Piecewise Execution of Nested Data-Parallel Programs

D. Palmer, J. Prins, S. Chatterjee, R. Faith, Presented at The Eighth International Workshop on Languages and Compilers for Parallel Computers (LCPC95), August 1995

Postscript (46Kbytes)

ABSTRACT:

The technique of flattening nested data parallelism combines all the independent operations in nested apply-to-all constructs and generates large amounts of potential parallelism for both regular and irregular expressions. However, the resulting data-parallel programs can have enormous memory requirements, limiting their utility. In this paper, we present piecewise execution, an automatic method of partially serializing data-parallel programs so that they achieve maximum parallelism within storage limitations. By computing large intermediate sequences in pieces, our approach requires asymptotically less memory to perform the same amount of work. By using characteristics of the underlying parallel architecture to drive the computation size, we retain effective use of a parallel machine at each step. This dramatically expands the class of nested data-parallel programs that can be executed using the flattening technique. With the addition of piecewise I/O operations, these techniques can be applied to generate out-of-core execution on large datasets.

@inproceedings { lcpc95,
        author = { Daniel Palmer, Jan Prins, Siddhartha Chatterjee, and Rik Faith},
        title = { Piecewise Execution of Nested Data-Parallel Programs},
        booktitle = { Languages and compilers for parallel computing : 8th 
	              International Workshop, Columbus, OH, USA, August 10-12
	1995 Proceedings},
        publisher = { Spring-Verlog },
        year = {1996 }
}

Provably Correct Vectorization of Nested Parallel Programs

J. Riely, J. Prins and P. Iyer Presented at the Second Workshop on Massively Parallel Programming Models, Berlin, October 1995.

Postscript (182Kbytes)

ABSTRACT:

The work/step framework provides a high-level cost model for nested data-parallel programming languages, allowing programmers to understand the efficiency of their codes without concern for the eventual mapping of tasks to processors. Vectorization, or flattening, is the key technique for compiling nested-parallel languages. This paper presents a formal study of vectorization, considering three low-level targets: the EREW, bounded-contention CREW, and CREW variants of the VRAM. For each, we describe a variant of the cost model and prove the correctness of vectorization for that model. The models impose different constraints on the set of programs and implementations that can be considered; we discuss these in detail.

@inproceedings { mppm95,
        author = { James Riely and Jan Prins and P. Iyer},
        title = { Provably Correct Vectorization of Nested Parallel Programs},
        booktitle = { Proceedings of the Second Workshop on Programming Models
                for Massive Parallelism (MPPM95)},
        publisher = { IEEE },
        year = {1995 }
}

A Data-Parallel Implementation of the Adaptive Fast Multipole Algorithm,

by Lars S. Nyland, Jan F. Prins, and John H. Reif. Presented at the Dartmouth Institute for Advanced Graduate Studies on Parallel Computation

Postscript (73Kbytes)

ABSTRACT:

Given an ensemble of n bodies in space whose interaction is governed by a potential function, the N-body problem is to calculate the force on each body in the ensemble that results from its interaction with all other bodies. An efficient algorithm for this problem is critical in the simulation of molecular dynamics, turbulent fluid flow, intergalactic matter and other problems. The fast multipole algorithm (FMA) developed by Greengard approximates the solution with bounded error in time O(n). For non-uniform distributions of bodies, an adaptive variation of the algorithm is required to maintain this time complexity.

The parallel execution of the FMA poses complex implementation issues in the decomposition of the problem over processors to reduce communication. As a result the 3D Adaptive FMA has, to our knowledge, never been implemented on a scalable parallel computer. This paper describes several variations on the parallel adaptive 3D FMA algorithm that are expressed using the data-parallel subset of the high-level parallel prototyping language Proteus. These formulations have implicit parallelism that is executed sequentially using the current Proteus execution system to yield some insight into the performance of the variations. Efforts underway will make it possible to directly generate vector code from the formulations, rendering them executable on a broad class of parallel computers.

@inproceedings{proteus-dags93,
        author =        {Lars S. Nyland and Jan F. Prins and John H. Reif},
        title =         {A Data-Parallel Implementation of the Adaptive Fast Mul
tipole Algorithm},
        booktitle =     {Proceedings of the 1993 DAGS/PC Symposium},
        month =         {June},
        year =          {1993},
        address =       {Dartmouth College, Hanover, NH}
}

Spatial Decompositions For Hierarchical N-Body Simulations

by Lars S. Nyland. Presented at BellCore, July 1993.

There are several algorithms for three-dimensional N-body simulation that rely on the recursive decomposition of the simulation space. Two well-known N-body algorithms are the Fast Multipole Algorithm (FMA) of Greengard and Rohklin and the Barnes-Hut algorithm. The algorithms differ, one is O(n) and the other is O(n log n), and both are viable due to hidden constants. A spatial decompostion method must be chosen for each of these algorithms, and the choice depends on many parameters.

In this talk, I will discuss three well-known decomposition strategies: uniform-depth oct-tree, varying-depth oct-tree, and median-cut (or orthogonal recursive bisection) decomposition. The three strategies will be described, examples will be shown of each over some sample N-body distributions, and then evaluated for use with the Fast Multipole Algorithm. The important parameters for choosing a spatial decomposition for the FMA depend not only on the requirements of the FMA (multipole mathematics, error control), but on the the ability to parallelize the resulting code as well.


The Complexity of N-Body Simulation

John H. Reif and Steven R. Tate. Presented at the International Colloquium on Automata, Languages, and Programming (ICALP'93), July, 1993

@inproceedings { nbody-reif,
        author = { John H. Reif and Steven R. Tate },
        title = { The Complexity of N-Body Simulation },
        booktitle = { Proc. 20th International Colloquium on Automata,
                Languages, and Programming {(ICALP'93)}}
        address = { Lund, Sweden, July 1993 },
        publisher = { Springer-Verlag },
        pages = { 162-176 },
        year = 1993 
}

An Introduction to Proteus, Version 0.9

by Gary Levin and Lars Nyland, August, 1993

DVI File (51Kbytes), or Postscript (130Kbytes)

ABSTRACT:

The current version of Proteus is a variation of Isetl that supports thread and data parallelism. Isetl is an interactive implementation of SETL (SETL was developed at the Courant Institute; See Schwartz, J.T., et al, Programming with sets: An introduction to SETL, Springer-Verlag, 1986. ) a programming language built around mathematical notation and objects, primarily sets and functions. It contains the usual collection of statements common to procedural languages, but a richer set of expressions.

The objects of Proteus include: integers, floating point numbers, funcs (sub-programs), strings, sets, and tuples (finite sequences). The composite objects, sets and tuples, may contain any mixture of Proteus objects, nested to arbitrary depth.

This document is intended for people who have had no previous experience with Proteus, but who are reasonably comfortable with learning a new programming language. Few examples are given here, but many examples are distributed with the software.


The Proteus NS Grammar (DRAFT of TR94-029)

by Rickard E. Faith, Lars Nyland, Dan Palmer, and Jan Prins, April, 1993

Multiple formats: Hypertext or Postscript (37Kbytes)


UnCvL: The University of North Carolina C Vector Library

by Rickard E. Faith, Doug L. Hoffman, and David G. Stahl May, 1993

Postscript (46Kbytes)

ABSTRACT:

This paper describes our efforts in implementing a version of CVL (C Vector Library) for the MasPar MP-1 SIMD computer. CVL is a library of rudimentary vector routines, callable from C, as described by G. Blelloch, et. al. UnCvL is an implementation of the CVL library routines written in MPL, MasPar's parallel version of C. The main motivation for implementing CVL on the MP-1 is to provide support for NESL, a high level, portable parallel programming interface. NESL compiles to an intermediate language called Vcode which, in turn, is implemented using CVL. Our CVL implementation on the MP-1 will also be used in the parallel execution of the Proteus language at the University of North Carolina.

Implemented functions include 195 elementwise operations, 180 scan and reduce operations, 108 permute operations, and 102 miscellaneous and support operations. Of the undocumented functions that we were able to elucidate, only 9 pack, 9 shuffle, and 18 split function have not yet been implemented.

The current version of UnCvL supports execution of NESL and Proteus programs on the MasPar MP-1.


DPL: Data Parallel Library Manual

by Daniel W. Palmer March, 1994

Postscript (41Kbytes)

ABSTRACT:

The Data Parallel Library is a collection of C-callable routines that provide the capability to treat nested sequences as a primitive type. The library is designed to specifically to be used with C as an executable target notation for transformations of Proteus programs. The library is portable so that Proteus can achieve architecture independence, it executes in parallel so that parallel algorithms written in Proteus can attain parallel execution times and it provides a reasonably high-level interface so the transformations can be kept as simple as possible.


The Proteus System for the Development of Parallel Applications

by Allen Goldberg, Jan Prins, John Reif, Rik Faith, Zhiyong Li, Peter Mills, Lars Nyland, Dan Palmer, James Riely, Stephen Westfold, April, 1994 (rev. Aug 1994), not yet published (send mail to Malcolm Harrison for status (mailto:harrison@cs.nyu.edu)).

Postscript (199Kbytes)

ABSTRACT:

In recent years technological advances have made the construction of large-scale parallel computers economically attractive. These machines have the potential to provide fast solutions to computationally demanding problems that arise in computational science, real-time control, computer simulation, large database manipulation, and other areas. However , applications that exploit this performance potential have been slow to appear; such applications have proved exceptionally difficult to develop and, once written, too often fail to deliver the expected speed-up.

This state of affairs may be explained by the proliferation of parallel architectures and the simultaneous lack of effective high-level architecture-independent programming languages. Parallel applications are currently developed using low-level parallel programming notations that reflect specific features of the target architecture (e.g., shared vs. distributed memory , SIMD vs. MIMD, exposed vs. hidden interconnection network). These notations have significant drawbacks:

The problem is a fundamental one: abstract models of parallel computation lead to unrealistic algorithms, whereas machine-specific models lead to intractable analysis of even the simplest programs. The ef fect of the tar get architec- ture is pervasive: at a high level, dif ferent architectures often require fundamentally dif ferent algorithms to achieve optimal performance; at a low level, overall performance exhibits great sensitivity to changes in communication to- pology and memory hierarchy. The result is that the developer of a parallel application must explore a complex and poorly-understood design space that contains significant architecture-dependent trade-of fs. Moreover , this algorithmic exploration must be con- ducted using programming languages that of fer exceptionally low levels of abstraction in their mechanisms for ex- pressing concurrency . While a reasonable solution to these problems is to trade reduced access to architecture-specific performance for improved abstract models of computation, this trade may not always be the right one: the whole point of parallelism, for most applications, is performance.

The goal of the DUNCK (Duke University, University of North Carolina at Chapel Hill and the Kestrel Institute) ProtoTech effort is to provide improved capabilities for


 

Specification and Development of Parallel Algorithms with the Proteus System,

A. Goldberg, P. Mills, L. Nyland, J. Prins, J. Reif and J. Riely. in Specification of Parallel algorithms, G. Blelloch, K. Chandy, and S. Jagannathan, eds., AMS, 1994.

Multiple formats: Hypertext or Full Postscript (115Kbytes)

ABSTRACT:

The Proteus language is a wide-spectrum parallel programming notation that supports the expression of both high-level architecture-independent specifications and lower-level architecture-specific implementations. A methodology based on successive refinement and interactive experimentation supports the development of parallel algorithms from specification to various efficient architecture-dependent implementations. The Proteus system combines the language and tools supporting this methodology. This paper presents a brief overview of the Proteus system and describes its use in the exploration and development of several non-trivial algorithms, including the fast multipole algorithm for N-body computations.


Software Issues in High-Performance Computing and a Framework for the Development of HPC Applications

P. Mills, L. Nyland, J. Prins, and J. Reif. in Computer Science Agendas for High Perfromance Computing, U. Vishkin, ew, ACM, 1994.

Multiple formats: Hypertext PaperHypertext or Full Postscript (72Kbytes)

ABSTRACT:

We identify the following key problems faced by HPC software: (1) the large gap between HPC design and implementation models in application development, (2) achieving high performance for a single application on different HPC platforms, and (3) accommodating constant changes in both problem specification and target architecture as computational methods and architectures evolve.

To attack these problems, we suggest an application development methodology in which high-level architecture-independent specifications are elaborated, through an iterative refinement process which introduces architectural detail, into a form which can be translated to efficient low-level architecture-specific programming notations. A tree-structured development process permits multiple architectures to be targeted with implementation strategies appropriate to each architecture, and also provides a systematic means to accommodate changes in specification and target architecture.

We describe the Proteus system, an application development system based on a wide-spectrum programming notation coupled with a notion of program refinement. This system supports the above development methodology via: (1) the construction of the specification and the successive designs in a uniform notation, which can be interpreted to provide early feedback on functionality and performance, (2) migration of the design towards specific architectures using formal methods of program refinement, (3) techniques for performance assessment in which the computational model varies with the level of refinement, and (4) the automatic translation of suitably refined programs to low-level parallel virtual machine codes for efficient execution.


Models and Resource Metrics for Parallel and Distributed Computation

Zhiyong Li, Peter H. Mills, John H. Reif

To appear in: Proc. 28th Annual Hawaii International Conference on System Sciences (HICSS-28 Parallel Algorithms Software Technology Track), (Wailea, Maui, Hawaii, January 3-6), IEEE Press, 1995.

Postscript (93Kbytes)

ABSTRACT:

This paper presents a framework of using resource metrics to characterize the various models of parallel computation. Our framework reflects the approach of recent models to abstract architectural details into several generic parameters, which we call resource metrics. We examine the different resource metrics chosen by different parallel models, categorizing the models into four classes: the basic synchronous models, and extensions of the basic models which more accurately reflect practical machines by incorporating notions of asynchrony, communication cost and memory hierarchy. We then present a new parallel computation model, the LogP-HMM model, as an illustration of design principles based on the framework of resource metrics. The LogP-HMM model extends an existing parameterized network model (LogP) with a sequential hierarchical memory model (HMM) characterizing each processor. The result accurately captures both network communication costs and the effects of multileveled memory such as local cache and I/O. We examine the potential utility of our model in the design of near optimal sorting and FFT algorithms.

@inproceedings{ LMR94,
	author = {Zhiyong Li and Peter H. Mills and John H. Reif},
	title = {Models and Resource Metrics for Parallel and Distributed  
    		Computation},
	booktitle = { Proc. 28th Annual 
		Hawaii International Conference on System Sciences (HICSS-28
		Parallel Algorithms Software Technology Track) },
	publisher={IEEE Press},
	conflocation={ Wailea, Maui, Hawaii, January 3-6 },
	year=1995
}

This page has been referenced times.


Click here to return to Proteus home page