SUPERB(ID:6261/)Parallel FORTRANParallel fortran for the SUPERNUM machine in Vienna, and the precursor to Vienna Fortran Hans Zim, Genesis project, 1988 "SUPERB was the first implemented system that translated sequential Fortran 77 into explicitly parallel message-passing Fortran." Related languages
References: The parallelization system SUPERB was developed in the German supercomputer project SUPRENUM from 1985 to 1989. It is based on the Single-Program-Multiple-Data (SPMD) paradigm, allows the use of global addresses, and automatically inserts the necessary communication statements, given a usersupplied data distribution. SUPERB was the first implemented system that translated sequential Fortran 77 into explicitly parallel message-passing Fortran. As a result of the experiences with SUPERB and related research, the language Vienna Fortran was designed within the ESPRIT project GENESIS, in a joint effort of the University of Vienna and ICASE, Nasa Langley Research Center. Vienna Fortran is a machine-independent language extension to Fortran, which includes a broad range of features for the high-level support of advanced application development for distributed-memory multiprocessors. It has significantly influenced the development of High Performance Fortran, a first attempt of language standardization in this area. Keywords: distributed-memory multiprocessor systems, numerical computation, data parallel algo- rithms, data distribution, program analysis, optimization. External link: Online copy Extract: Introduction Introduction Since the advent of early distributed-memory multiprocessing systems (DMMPs) such as Caltech's Cosmic Cube and the German supercomputer SUPRENUM less than a decade ago, these architectures have rapidly gained user acceptance and are today offered by most major manufacturers. Current DMMPs include Intel's hypercubes, the Paragon, the nCUBE, Thinking Machine's CM-5, and the Meiko Computing Surface. DMMPs are relatively inexpensive to build, and are potentially scalable to large numbers of processors. However, these machines are difficult to program: the non-uniformity of the memory which makes local accesses much faster than the transfer of non-local data via message-passing operations implies that the locality of algorithms must be exploited in order to achieve acceptable performance. The management of data, with the twin goals of both spreading the computational workload and minimizing the delays caused when a processor has to wait for non-local data, becomes of paramount importance. When a code is parallelized by hand, the programmer must distribute the program's work and data to the processors which will execute it. One of the common approaches to do so makes use of the regularity of most numerical computations. This is the so-called Single Program Multiple Data (SPMD) or data parallel model of computation. With this method, the data arrays in the original program are each partitioned and mapped to the processors. This is known as distributing the arrays. A processor is then thought of as owning the data assigned to it; these data elements are stored in its local memory. Now the work is distributed according to the data distribution: computations which define the data elements owned by a processor are performed by it -- this is known as the owner computes paradigm. The processors then execute essentially the same code in parallel, each on the data stored locally. Accesses to non-local data must be explicitly handled by the programmer, who has to insert communication constructs to send and receive data at the appropriate positions in the code. The details of message passing can become surprisingly complex: buffers must be set up, and the programmer must take care to send data as early as possible, and in economical sizes. Furthermore, the programmer must decide when it is advantageous to replicate computations across processors, rather than send data. A major characteristic of this style of programming is that the performance of the resulting code depends to a very large extent on the data distribution selected. It determines not only where computation will take place, but is also the main factor in deciding what communication is necessary. The communication statements as well as the data distribution are hardcoded into the program. It will generally require a great deal of reprogramming if the user wants to try out different data distributions. This programming style can be likened to assembly programming on a sequential machine -- it is tedious, time-consuming and error prone. Thus much research activity has been concentrated on providing programming tools for DMMPs. One of the first such tools is SUPERB[42], an interactive restructurer which was developed in the SUPRENUM project ([42]) starting in 1985. It translates Fortran 77 programs into message passing Fortran for the SUPRENUM machine [18], the Intel iPSC, and the GENESIS machine. SUPERB performs coarse-grain parallelization for a DMMP and is also able to vectorize the resulting code for the individual nodes of the machine. The user'specifies the distribution of the program's data via an interactive language. Program flow and dependence analysis information, using both intraprocedural and interprocedural analysis techniques, is computed and made available to the user, who may select individual transformation strategies or request other services via menus. SUPERB puts a good deal of effort into optimizing the target program, extracting communication from loops whenever possible, and combining individual communication statements (by vectorization and fusion) to reduce the overall communication cost ([16]). Simple reductions are recognized and handled by the system. SUPERB handles full Fortran 77, dealing with common blocks and equivalencing. Its implementation was completed in 1989 and thus it was the first system which compiled code for DMMPs from Fortran 77 and a description of the distribution of data. SUPERB provides special support for handling work arrays, as are commonly used in Fortran codes, for example to store several grids in one array. The experience and success gained with SUPERB and other experimental parallelization systems for DMMPs led to a new focus of research: the provision of appropriate high-level language constructs for the specification of data distributions. Vienna Fortran [8, 43], developed within the ESPRIT project GENESIS in joint work by the University of Vienna and ICASE, Nasa Langley Research Center, is a machine-independent language extension to Fortran, which includes high-level features for specifying virtual processor structures, distributing data across sets of processors, dynamically modifying distributions, and formulating explicitly parallel loops. This paper will focus on SUPERB and Vienna Fortran, which are discussed in detail in Sections 3 and 4, after an introduction to the basic notation and terminology (Section 2). The rest of the paper deals with the relationship between Vienna Fortran and HPF (Section 5), an advanced compilation technique for dealing with irregular data accesses (Section 6), and an overview of related work (Section 7), followed by the conclusion. Extract: Related Work Related Work An early attempt to provide higher-level language constructs for the specification of numerical algorithms on DMMPs is DINO [34, 35]. DINO is explicitly parallel, providing a set of C language extensions. Non-local data may be read and written; thus DINO does not conform to the owner computes paradigm. Remote accesses are marked by the user. DINO has been fully specified and implemented. The description of SUPERB in [42] is the first journal publication in the area of compiling Fortran for DMMPs. Callahan and Kennedy propose a similar compilation approach in [6]. The concept of defining processor arrays and distributing data to them was first introduced in the programming language BLAZE [25] in the context of shared memory systems with non-uniform access times. This research was continued in the Kali programming language [28] for distributed memory machines, which requires that the user'specify data distributions in much the same way that Vienna Fortran does. It permits both standard and user-defined distributions. The design of Kali has greatly influenced the development of Vienna Fortran. In particular, the parallel FORALL loops of Vienna Fortran were first defined in Kali and implemented with the inspector-executor paradigm as described in Section 6. The Parti routines and the ARF compiler ([41, 38]), developed by Saltz and co-workers at ICASE, represent techniques developed to handle the kind of codes written for sparse and unstructured problems in scientific computing. They are designed to handle the general case of arbitrary data mappings, and efficient techniques were developed for a number of subproblems. A commercially available system is the MIMDizer ([30]) which may be used to parallelize sequential Fortran programs according to the SPMD model. The MIMDizer takes a similar approach to SUPERB; it deals with a number of specific Fortran issues, including a very flexible handling of common blocks. The programming language Fortran D [13] proposes a Fortran language extension in which the programmer specifies the distribution of data by aligning each array to a decomposition, which corresponds to an HPF template (see Section 5), and then specifying a distribution of the decomposition to a virtual machine. These are executable statements, and array distributions are dynamic only. A subset of Fortran D -- roughly corresponding to SUPERB -- has been implemented for the iPSC/860 [20]. The source language for the Crystal compiler built by Li and Chen at Yale University ([26]) is the functional language Crystal, which includes constructs for specifying data parallelism. Thus there is a certain amount of parallelism explicit in the original code. Experimental compilers have been constructed for the iPSC hypercube and the nCUBE; they place particular emphasis on an analysis of the communication requirements to generate efficient communication. Dataparallel C ([19]) is a SIMD extension of the C language which is a slightly modified version of the original C* for the Connection Machine. Like DINO, it is explicitly parallel and requires the user to specify a local view of computations. Dataparallel C compilers have been constructed for both shared and distributed memory machines. Cray Research Inc. has announced MPP Fortran [32], a set of language extensions to Cray Fortran which enable the user to specify the distribution of data and work. They provide intrinsics for data distribution and permit redistribution at subroutine boundaries. Further, they permit the user to structure the executing processors by giving them a shape and weighting the dimensions. Several methods for distributing iterations of loops are provided. In the Cray programming model, many of the features of shared memory parallel languages have been retained: these include critical sections, events and locks. New instructions for node I/O are provided. Other systems include AL, which has been implemented on the Warp systolic array processor [40], Pandore, a C-based system [2], Id Nouveau, a compiler for a functional language [33], Oxygen [36], ASPAR [22], Adapt, developed at the University of Southampton [29], and the Yale Extensions [10]. In a few systems, dynamic data distributions have been implemented within narrow constraints [3, 2]. The systems described above are not the only efforts to provide either suitable language constructs for mapping code onto DMMPs or to generate message passing programs from higher--level code. Other important approaches include Linda [1], Strand [12], and Booster [31]. in Parallel Computing, Vol. 20, 1994 view details SUPERB was an interactive restructuring tool, developed at the University of Bonn, which translated Fortran 77 programs into message-passing Fortran for the Intel iPSC, the GENESIS machine, and the SUPRENUM machine. The user'specified the distribution of the program's data via an interactive language. Program flow and dependence information, using both intraprocedural and interprocedural analysis techniques, was computed and made available to the user, who could select individual transformation strategies or request other services via menus. SUPERB performed coarse-grain parallelization for a distributed-memory machine and was also able to vectorize the resulting code for the individual nodes of the machine. Extract: Conclusion Conclusion HPF is a well-designed language which can handle most data parallel scientific applications with reasonable facility. However, as architectures evolve and scientific programming becomes more sophisticated, the limi- tations of the language are becoming increasingly apparent. There are at least three points of view one could take: 1. HPF is too high-level a language --- MPI-style languages are more appropriate. 2. HPF is too low-level a language --- aggressive compiler technologies and improving architectures obviate the need for HPF-style compiler directives. 3. The level of HPF is about right, but extensions are required to handle some applications for some upcoming architectures. All three of these alternatives are being actively pursued by language researchers. For example, HPC++ [?] is an effort to design an HPF-style language using C++ as a base. On the other hand, F - - [?] is an attempt to provide a lower-level data-parallel language than HPF. Like HPF, F - - provides a single thread of flow control. But unlike HPF, F - - requires all communication to be explicit using "get'' and "put'' primitives. While it is difficult to predict where languages will head, the coming generation of SMP-cluster ar- chitectures may induce new families of languages which will take advantage of the hardware support for shared-memory semantics with an SMP, while covering the limited global communication capability of the architectures. In this effort the experience gained in the development and implementation of HPF will surely serve us well. in Parallel Computing, Vol. 20, 1994 view details |