A Prototyping System for Parallel and Distributed Applications


Table of Contents:

  1. Principal Investigator.
  2. Productivity Measures.
  3. Summary of Objectives and Approach.
  4. Detailed Summary of Technical Progress.
  5. Transitions and DOD Interactions.
  6. Software and Hardware Prototypes.
  7. List of Publications.
  8. Invited and Contributed Presentations.
  9. Honors, Prizes or Awards Received.
  10. Project Personnel Promotions.
  11. Project Staff.
  12. Multimedia URL.
  13. Keywords.
  14. Business Office.
  15. Expenditures.
  16. Students.
  17. Book Plans.
  18. Sabbatical Plans.
  19. Related Research.
  20. History.

- Revised: Thu Feb 6 11:18:51 1997 by nyland@cs.unc.edu

Principal Investigator.


Productivity Measures.


Summary of Objectives and Approach.

  1. This work addresses a fundamental problem: naive implementations of abstract models of parallel computation lead to impractical implementations, whereas machine-specific models lead to intractable analysis of even the simplest programs. The goal of our effort is to provide a software design methodology that supports the architecture-independent development of parallel and distributed systems through a process of prototyping and refinement. The Proteus system provides the language and tools supporting this methodology, and comprises: a wide-spectrum parallel programming notation whose interpretive execution allows the early prototyping of specifications, techniques for the (semi-automatic) refinement of architecture-independent specifications to lower-level programs optimized for specific architectures followed by translation to executable low-level parallel languages, and an execution system supporting interoperability with foreign codes and run-time performance analysis. Within this general paradigm our work has emphasized the following areas:
  2. Transformation of nested data parallel programs into an executable vector model [PP93,PPW95].
  3. A refinement-based approach to performance prediction using parallel cost models that vary with the level of refinement [LMR95].
  4. A model of design and development based on formal category-theoretic concepts [Srin95].
  5. This work has been driven by several challenge problems including the Fast Multipole Algorithm for N-body simulation, a radar tracking and response doctrine system for the Naval Surface Warfare Center Aegis Hiper-D program, and most recently multitarget tracking.


Detailed Summary of Technical Progress.

  1. Participation in the Hiper-D GeoServer follow-on activity.

    From July through October of 1994, we participated in a community-wide prototyping experiment defined by the Hiper-D program at NSWC as part of their next-generation Aegis development effort. Using Proteus, we developed a series of prototypes to explore design approaches for Radar Validation using constructive solid geometry methods. Three prototypes were developed, along with a harness to connect the prototypes to the network application. All prototypes were executable using the Proteus interpreter, were rapidly developed, and were judged to have contributed useful knowledge about design options to NSWC.

    In addition, one prototype was transformed to parallel code which can be run on several parallel machines (including the Cray at NCSC and the MasPar at UNC). This computational kernel is controlled by the network harness, using the interoperability capabilities of Proteus.

  2. Development of concurrent object model for process parallelism. Significant progress has been made in language design in the development of an extensible foundation for explicit task parallelism. Communication is through a shared object model in which the access to shared state is controlled through object methods and class directives which constrain mutual exclusion of methods [GPR+94]. Predefined classes such as for single-assignment objects which synchronize a producer with a consumer, together with provisions for private state with barrier synchronization allow the expression of a wide range of parallel computing paradigms, key to providing an expressive and uniform vehicle for refinement.

  3. Development and the release of Specware 1.0

    Specware is Kestrel Institute's system next generation system for composition and refinement of specifications, generalizing and extending earlier work on KIDS and DTRE. Included in the latest release is a fully general Lisp code generator, and full support for graphically-based refinement of specifications.

  4. Development of a framework for performance prediction in which computing model varies with level of refinement. We have developed a framework for performance prediction in which the computing model varies with the level of refinement [MNPR94], and concomitantly developed improved parallel computational models (LogP-HMM, LogP-UMH) [LMR95]. To support early and incrementally more accurate analyses, as program refinement progresses performance estimates are obtained using increasingly detailed parallel computational models. To support the assessment of multi-paradigm programs, different models are used for analysis of code segments following different paradigms, such as the VRAM for data-parallelism and the LogP for message-passing, with suitable instrumentation to attach the model to the design. We have also developed a framework for assessing and designing new models of computation to support more detailed performance analyses based on the notion of resource metrics, which are parameterized abstractions of architectural details. For example, to support more accurate modeling of costs such as cache and I/O, we have developed a new hybrid model of parallel computation, the LogP-HMM model [LMR95], which fills a gap in the hierarchy of refined models by extending a network model (the LogP) with a sequential hierarchical memory model (the HMM). Recently further extensions of the LogP-HMM model have been explored which replace the memory hierarchy with more advanced models such as UMH (Universal Memory Hierarchy) [LMR95a].

  5. Algebraic framework for synthesizing high-performance programs from tensor products for out-of-core computations. Reif, Gupta, and Li have developed an algebraic framework for the automatic synthesis of efficient parallel programs from high-level tensor product specifications for a large class of block recursive algorithms. This framework is suitable for developing out-of-core (I/O intensive) parallel programs for algorithms such as fast Fourier transforms which are used in many key application areas. Tensor products succinctly capture the semantics of block-cyclic distributions as well as data access patterns to the out-of-core data. We have targeted the program synthesis towards distributed hierarchical memory machines including tightly coupled MIMD machines and network connected work-stations. We capture the performance properties of the machines by abstract computation models which include variations of the LogP-HMM model [LMR95] and the striped two-level memory model of Vitter and Aggarwal. Data layout annotations such as cyclic(B) found in the High Performance Fortran are used to de-cluster the data among the disks. The idea of de-clustering data is used by most of current file systems such as PIOUS and Vesta, which therefore provides a suitable testbed for efficiently implementing our approach.

  6. Development and prototyping of an efficient data-parallel algorithm for multitarget tracking. Researchers at Duke has developed an efficient data-parallel algorithm for the core computation of the Joint Probabilistic Data Association multi-target tracking algorithm, where the problem is to derive probabilities of target-measurement associations using a-priori state estimates. This algorithm is an equational refinement of the recursive modified matrix permanent formulation first proposed by Zhou, which is efficient in that it avoids redundant computation of subexpressions and is data-parallel in that it is expressed in a functional style using sequence aggregates. The algorithm is presented as an equational specification whose structure is amenable to expression in the nested data-parallel language, Proteus. We are currently transcribing the algorithm in Proteus and obtaining complexity analyses and comparative benchmarks for the implementation.

  7. One step transformation of Proteus programs.

    Using the NSWC Geoserver as a driving problem, we substantially advanced our transformation capabilities. Proteus program written in a functional, single-assignment style using nested sequences are automatically transformed to C programs that rely on a vector library. The library is implemented for several parallel machines, giving us portability at a small performance penalty. Initial tests indicate a factor of four when comparing the speed of Fortran programs to Proteus programs when run on a Cray YMP. This result is the basis of further research to increase performance by increasing register reuse, pipelining operations, and elimination of temporaries.

  8. Work-efficient Execution of Proteus Programs

    The transformations developed by Prins & Palmer [PP93] and the supporting library routines are not necessarily work-efficient. Careful examination, ordering and rewriting of the transformation rules and data-parallel library routines has preserved the asymptotic work complexity of the transformed proteus programs.

  9. Piece-wise execution of Proteus programs

    The transformation system generates programs that aggressively build the longest possible vectors for parallel execution. This implies that all elements of a vector must be in memory at once, leading to extremely large memory demands for moderate size programs. To alleviate this problem, work is underway to repeatedly execute a series of vector operations on small segments of the vector that will fit in the available memory. The operation sequences usually begin with operations that increase memory usage (e.g., distribution), and end with data reduction operations (e.g., sum, product). We call this style of execution "piece-wise execution", and the result is programs with substantially reduced memory requirements.

  10. Provably correct complexity measures for transformed Proteus programs.

    The transformation of Proteus programs relies on a set of complex rules to generate flat data-parallel code. It is crucial that these transformations preserve the work and step complexities of the source program. We have developed a formal semantics of programs and used this to prove the transformations correct for three different target models: the CREW, bounded-contention CREW and EREW variants of the VRAM (vector random-access machine). The techniques used in the proof are novel and may be of independent interest. The step/work metrics are compositional, allowing the complexity of a program to be calculated from the complexities of its subprograms.

  11. FY-96 PLANS:


Transitions and DOD Interactions.

  1. The Naval Surface Warfare Center has conducted two experiments in prototyping, exploring the usefulness of prototyping in regard to their software system to be deployed in 2003. We participated in both experiments, implementing a small-yet-complete program in the first, and a doctrine-validation component of a distributed application (with other components developed by other ProtoTech groups). Proteus programs were installed an run for demonstrations at NSWC and the IDA. A videotape was produced showing the results of this collaboration.
  2. Duke and UNC are collaborating with John Board in the E.E. Department of Duke University and J. Hermans of the Biochemistry Department at UNC in the investigation and implementation of parallel Fast Multipole Algorithms (FMA) for molecular dynamics simulations. UNC has just completed 2 years of a 5-year $2.5M NIH grant for the development and implementation of parallel algorithms for molecular dynamics. The use of Proteus is an integral component of this effort.
  3. Further collaboration in the parallelization of Molecular Dynamics has been established with the Theoretical Biophysics Group in the Beckman Institute at the University of Illinois (Urbana-Champaign). The group is directed by Dr. Klaus Schulten, the principal investigator of a 5-year Grand Challenge Award for Structural Biology, who is aided by faculty members, post-doctoral fellows and graduate students in the Biophysics, Chemistry, Computer Science and Physics departments.
  4. The Medical Imaging group at UNC is using Proteus to develop sophisticated parallel algorithms for segmenting medical images into physical objects.
  5. Our Fast Multipole Algorithm code is being examined by others, at Duke University, CMU and Brooklyn Polytechnic, as a guide to lead their own implementations.
  6. Release of Specware 1.0 to Motorola, Air Forces Institute of Technology, NSA, Syracuse University and Rome Labs. Motorola is specifying and implementing the kernel of a secure operating system using Specware. A number of students are using Specware at at AFIT. Rome Labs is evaluating Specware for use within their research programs in software engineering. Syracuse University is applying Specware to hardware verification, and is examining the integration of CSP notations and Specware.


Software and Hardware Prototypes.

  1. The Radar Validation Server of the AEGIS Geoserver Experiment:
  2. Trans, a Proteus transformation system:
  3. Specware:


List of Publications.

  1. [FPPN94] R. E. Faith, D. W. Palmer, J. F. Prins, L. S. Nyland, "Practical Issues in the Flattening of Nested Parallelism with Program Transformations" (Extended Abstract)
  2. [FNPP94] R. E. Faith, L. Nyland, D. Palmer, J. Prins, "The Proteus NS Grammar", UNC-CS TR94-029.
  3. [FY95] R. E. Faith, W. Yakowenko, "The Khepera Transformation System User's Manual", UNC Technical Report.
  4. [GLR95] S. Gupta, Z. Li, and J. Reif, "Synthesizing Efficient Programs for Two-Level Memories from Tensor-products". Proc. of the 7th IASTED/ISMM International Conference on Parallel and Distributed Computing and Systems Washington DC, October 1995, pp. 510-513.
  5. [GPR+94] A. Goldberg, J. Prins, J. Reif, R. Faith, Z. Li, P. Mills, L. Nyland, D. Palmer, J. Riely, S. Westfold, "The Proteus System for the Development of Parallel Applications", in Prototyping Languages and Prototyping Technology, M. Harrison, ed., (to appear) Springer-Verlag, 1996. (40 pp)
  6. [GMN+94] A. Goldberg, P. Mills, L. Nyland, J. Prins, J. Reif, J. Riely, "Specification and Development of Parallel Algorithms with the Proteus System", in Specification of Parallel Algorithms, G. Blelloch, M. Chandy, S. Jagannathan, eds., AMS, 1995. (pp 383-399)
  7. [KGP94] S. Kumar, S. Goddard and J. Prins, "Connected-Components Algorithms for Mesh-Connected Parallel Computers", presented at DIMACS workshop on parallel algorithms, 10/94. DIMACS proceedings, IEEE Press (to appear).
  8. [LRG95] Z. Li, J.H. Reif, and S.K.S. Gupta, "Synthesizing Efficient Out-of-Core Programs for Block Recursive Algorithms using Block-Cyclic Data Distributions". Submitted for publication.
  9. [LMN+94] Z. Li, P. Mills, L. Nyland, J. Prins, J. Reif, W. Yakowenko, "A Radar-validation Server", http://particle.cs.unc.edu/nswc2/REPORT/REPORT.html.
  10. [LMR95a] Z. Li, P.H. Mills, and J.H. Reif."Models and Resource Metrics for Parallel and Distributed Computation". To appear in Journal of Parallel Algorithms and Applications, Vol.9, No.4, Dec. 1995.
  11. [LMR95] Z. Li, P. H. Mills, and J. H. Reif, "Models and Resource Metrics for Parallel and Distributed Computation". Proc. 28th Annual Hawaii International Conference on System Sciences (HICSS-28 Parallel Algorithms Software Technology Track), Wailea, Maui, Hawaii, January 3-6, IEEE Press, 1995.
  12. [MNPR94] P. H. Mills, L. S. Nyland, J. F. Prins, and J. H. Reif, "Software Issues in High-Performance Computing and a Framework for the Development of HPC Applications," In: Developing a Computer Science Agenda for High Performance Computing (U. Vishkin, ed.) pp.110-117, ACM, 1994.
  13. [Mil95] P. Mills, "An Efficient Data-Parallel Algorithm for Association in Multi-Target Tracking", Draft Technical Report, Department of Computer Science, Duke University, Sept.1995.
  14. [PPCF96] D. Palmer, J. Prins, S. Chatterjee, R. Faith, "Piecewise Execution of Nested Data-Parallel Programs", Proc. Workshop on Languages and Compilers for Parallel Computing, Columbus, OH, (to appear) Springer-Verlag, 1996. (18pp)
  15. [PP95] D. Palmer, J. Prins, "Work-efficient Nested Data-Parallelism", Proc. of the 5th Symposium on the Frontiers of Massively Parallel Processing, IEEE 1995. (pp 186-193)
  16. [RPI95] J. Riely, J. Prins, S. Iyer, "Provably Correct Vectorization of Nested Parallel Programs", in Programming Models for Massively Parallel Computers, (to appear) IEEE 1995. (10pp)
  17. [SRIN95] Y. V. Srinivas and Richard Jullig,"Specware: Formal Support for Composing Software", Proceedings of the Conference on Mathematics of Program Construction, Kloster Irsee, Germany, July 1995.


Invited and Contributed Presentations.

  1. J. Prins, "Development of Efficient Parallel Algorithms for Unstructured Problems", EUROSIM Congress on Massively Parallel Processing: Applications and Development, Delft, The Netherlands, 6/23/94.
  2. J. Prins, "A Refinement-Based Approach to Parallel Software Development and Maintenance", RCI Forum on Parallel Software Engineering for C3I and Large Scale Commercial Applications, Orlando, FL, 1/24/95.
  3. L. Nyland, "Multiple Proteus Prototypes for AEGIS Doctrine Validation", Prototech Quarterly Meeting, San Diego, CA, 9/29/94.
  4. J. Prins, "High-Performance Irregular Computation using High-Level Programming Languages and Novel Compilation Techniques", Swiss High Performance Computing Seminar Series, CSCS, Manno, Switzerland, 10/6/95.
  5. L. Nyland, "Doctrine Validation Component of the Prototech AEGIS Geoserver Experiment", DSSA Quarterly meeting, Institute for Defense Analysis, 11/30/94.
  6. "Parallel Molecular Computation: Models and Simulations", John Reif. 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'95) Santa Barbara, CA, July 1995.
  7. "Predictive Computing:An Emerging Paradigm for Efficient Computation", John Reif. Dartmouth Institute for Advanced Graduate Studies (DAGS94), July 1994, Hanover, New Hampshire.
  8. "Work Efficient Parallel Solution of Toeplitz Systems and Polynomial GCD", J. Reif. 27th ACM Symposium on Theory of Computing (STOC 95), Las Vegas, NV, May, 1995.
  9. "Efficient Parallel Solution of Sparse Eigenvalue and Eigenvector Problems", J. Reif. 35th Annual IEEE Symposium on Foundations of Computer Science (FOCS'95) Milwaukee, WI, October 1995.


Honors, Prizes or Awards Received.


Project Personnel Promotions.


Project Staff.

  1. Name: Dr Allen Goldberg
  2. Name: Dr Jan F. Prins
  3. Name: Dr John H. Reif
  4. Name: Dr Lars S. Nyland
  5. Name: Mr Peter H. Mills


Multimedia URL.

  1. EOYL FY95
  2. QUAD FY95
  3. EOYL FY94
  4. The Proteus System Home Page
  5. The AEGIS Experiment with the Prototech Community (ARPA Videotape)
  6. "A Radar-validation Server", a report on our interaction in the Prototech prototyping experiment with NSWC.


Keywords.

  1. Prototype Program Development
  2. Parallel Programming
  3. Program Transformation
  4. Fast Multipole Algorithm (FMA)
  5. Radar Validation
  6. Geoserver


Business Office


Expenditures


Students

  1. Name: Mr. Shenfeng Chen
  2. Name: Mr Rickard Faith
  3. Name: Mr. Ashish Gehani
  4. Name: Mr. Zhiyong Li
  5. Name: Mr Daniel W. Palmer
  6. Name: Mr James W. Riely
  7. Name: Mr. Eric Schultes
  8. Name: Ms. Hongyan Wang
  9. Name: Mr William Yakowenko


Book Plans


Sabbatical Plans

  • Person: Jan F. Prins
  • Location: Zurich, CH
  • Institution: ETH Theoretical Computer Science
  • Sabbatical Year: 1996-1997 Academic Year
  • Foreign S&T Reporting Interest: Yes


    Related Research

    1. Architecture Independent Development of Parallel Software (supported by Rome Laboratory)
    2. NIH Parallel Computing Research Resource for Structural Biology
    3. Multi-user Distributed Interactive Simulation, an ARPA-funded CAETI project
    4. Mailing lists:
      • proteus-users: a mailing list for Proteus users and designers.
      • parlunch: a mailing list announcing talks for the Parallel Lunch Seminar.
      To subscribe to either, send mail to majordomo@cs.unc.edu, with the messages "subscribe list-name"


    History

    The performance of two scientific codes has been impacted by our work related to molecular dynamics.

    1. The parallel performance of the molecular dynamics simulator SIgMA was improved by using load-balancing techniques developed as part of our FMA development. The impact of this inclusion is highly scalable performance of small (<10K) atom simulations.
    2. The performance of the Fast Multipole code, developed at Duke University by J. Board et al., was improved by 50% by using one simple change. The modification to the algorithm only became obvious when it was expressed at a high level using Proteus.