Single Assignment Fortran(ID:3279/SAF)

Country: us
- languages for us
- us/1993
Began: 1993
Sammet:NSC
- NSC
- NSC/1993
- NSC/us

parallel fortran with anonotations and comprehensions

References:

Can Ozturan, Balaram Sinharoy and Boleslaw K. Szymanski "Compiler Technology for Parallel Scientific Computation" pp201-225 view details Extract: Introduction
Introduction
With a constant need to solve scientific and engineering problems of ever-growing complexity, there is an increasing need for software tools that provide solutions with minimal user involvement. Parallel computation is becoming indispensable in the solution of the large-scale problems that arise in science and engineering. While the use of parallel computation has been increasing, its widespread application has been hampered by the level of effort required to develop and implement the needed software, Parallel software often must be tuned to a particular parallel architecture to execute efficiently; thus, it often requires costly redesign when ported to new machines. Parallel program correctness requires the results to be independent of the number and speed of the processors. This requirement can be satisfied only if the parallel tasks are independent of each other or properly synchronized when a dependence exists. Designing proper synchronization is a major source of difficulty in ensuring parallel program correctness. Different categories of parallel architectures have led to a proliferation of dialects of standard computer languages. Varying parallel programming primitives for different parallel language dialects greatly limit parallel software portability. Poor portability of parallel programs has resulted in a duplication of efforts and has limited the use of developed systems.
Parallel computation can be viewed as an interwoven description of operations that are to be applied to data distributed over the processors, and of data mapping and synchronization that dictate the data movements and the computation order. The traditional programming languages, such as Fortran, C, or C++, cope well with the task of prescribing operations to be performed. However, the description of data mapping and synchronization in such languages is often introduced by ad
hoc architecture-dependent extensions. Examples are various synchronization constructs, like busy-wait, locks or barriers, used in programs for shared-memory machines, send and receive with different semantics employed by programs for message-pas sing architectures, and dimension projection and data broadcast popular in programs for SIMD computers. To avoid such architecture-dependent language definitions, we propose to separate the description of operations to be performed on the data values from the definition of data mapping and synchronization needed to supply these data values to the proper processor at the proper instance of the program execution.
With this goal in mind, we developed tools [1, 2] that (i) decompose, at least partially, the parallel program into the two (nearly) orthogonal parts described above, (ii) translate the necessary data movements into optimal form customized for the target architecture, and (Hi) synthesize an overall parallel computation. Using these tools the user can describe high-level features of a program and synthesize parallel computation from numerical algorithms, program fragments, and data structures that are implemented separately. The tools support (i) parallel task generation and their allocation to the processors, (ii) distribution of data to the processors, (iii) run-time optimization, and (iv) rapid prototyping of different parallel implementations.
Through the application of transformation techniques, different versions of the same program can be generated from decomposed components. The synthesized computation uses load assignment, data distribution, and synchronization appropriate to the size and type of target parallel architecture. The computation synthesis is guided by conditiona.1 dependence graphs that represent externally accessible information in each of the synthesized fragments. Usage of conditional information in dataflow analysis and parallelization significantly increase efficiency of the generated parallel code.
A brief description of the EPL language, its annotations and configurations is given in Section 2. The relationship of EPL constructs and tools to different levels of parallelism is discussed in Section 3. The EPL compiler is discussed in Section 4. Section 4.4 includes an overview of our approach to scalable parallel code generation. A dynamic load management strategy for adaptive scientific computation on SIMD architecture is the topic of Section 5. Finally, conclusions and comparison to other approaches is given in Section 6.
Extract: Overview of the EPL Language
Overview of the EPL Language
EPL is a simple non-strict functional language with a type inference designed for scientific computation. Although computationally vast, scientific computations are typically quite regular both in terms of control flow patterns and employed data structures. The data structures used are usually some variations of multidimensional arrays (sparse matrices, grids, jagged-edge arrays, and even some hierarchical structures can be viewed a.s suc.b). Correspondingly, the KPL language is defined in terinw of just a few constructs: generalized arrays and subscripts for data structures, recurrent equations for program specification, ports for process communication, and virtual processors to facilitate mapping of computation onto processors and load balancing.
A computation is viewed in EPL as a collection of cooperating processes. A process is described by an EPL program that consists of only data declarations and annotated conditional equations. The canonical data structure is a tree with nodes that can repeat and with leaves of primitive types. In its simplest form such a tree can be viewed as a multi-dimensional array; each level of a tree corresponding to a new dimension of the corresponding array. Structured files are provided for communication with an external environment (in records) and with other processes (through ports). EPL enforces a single-assignment rule, i.e.. each data element should be defined exactly once (the EPL compiler, however, is free to produce multiple-assignment object code). Thus equations, though syntactically reminiscent of assignment statements, are best viewed as assertions of equality,
The EPL programmer also defines the process interconnection network (the graph obtained by representing processes as nodes and port interconnections as edges) in the configuration file. Processes along with the configuration files are provided by the user to facilitate the compiler in extracting the coarse grain parallelism in the computation by generating processes and inter-process communication constructs. Configurations also allow the programmer to reuse the same process in different computations.
2.1    Iterations
An iteration is a staple of scientific computing. In EPL, iterations are programmed using subscripts. A subscript assumes a range of integers as its value. Subscripts give EPL a dual flavor. In the definitional view, they may be treated as universal quantifiers and equations can be viewed as logical predicates. In the operational view, they can be seen as loop control variables and each equation can be seen as a statement nested in loops implied by its subscripts.
There is a special class of indirect indexes, called sublinear subscripts, that are used in scientific applications so often that a special construct devoted to them has been introduced in EPL. Formally, an indirect index s defined over the subscript i is sublinear to this subscript if it satisfies the following property:
(0< -s[l] < 1) and (s[i] < s[i + L] < *[/] + !) for i= 1.2,...
It immediately follows from this definition that the sublinear subscript s[i] starts with the value of either 1 or 0 and then, with each increase of i. it is either incremented by 1 or kept unchanged. Typically, there is a condition associated with each sublinear subscript. The condition dictates when the subscript increases. This is the way a sublinear subscript is defined in EPL. For example, a sparse matrix 5 that is a row-major representation of a matrix D can be defined in EPL using a sublinear subscript col\j\ as follows:
subscript: col is sublinear j :  D[i.j] ^ 0; Sli,cef\ = D[itj\
Sublinear subscripts have an implicit range determined by the number of times the defining condition yields true.
The sublinear subscripts are convenient in expressing such operations as creating a list of selected elements, operating on sparse matrices, or defining a subset of the given set. Even more important is the fact that in the implementation of a, process no new iteration has to be created for computation associated with the sublinear subscripts. Instead, all necessary computation can be nested in the iterations created for subscripts in terms of which the considered sublinear subscript has been defined. Sublinear subscripts are also useful in defining dynamic distribution of data to processors at run-time. An example of such a definition is given in Section 5.2.
2.2     Reduction
A computation that frequently occurs in scientific applications is to apply a binary operation over an entire vector and store the result in the last element of the vector. For example, in scientific computation there is often a need to apply an associative operator (such as +,*,-, max, min, etc.) selectively on the elements of an array. Scan and Reduce are language constructs in EPL and other parallel languages that allow such operations to be succinctly written. Reduce applied to a vector of values produces a scalar result, whereas scan results in a vector of partial results. For example, consider a matrix A multiplied by a vector X with the result placed in a vector r. This operation can be written in EPL as:
Temp[iJ] - if j==l then A[ij]*X[j] else Temp[i,j-l]+A[i:j]*X[i]: r[i] ? Temp [i, range .j];
or, even shorter as
Such operations result in references of the form V[. . . range. i. . . .], where range.i indicates the range of the reduced/scanned dimension of a multidimensional array V. (In general, the EPL range variable prefix denotes the size of its suffix.)   The presence of such references in the program is explored by memory optimization and scheduling which is discussed later. A more detailed description of the language is given in [3],
2.3    Configurations
In our approach a parallel computation is viewed as a collection of cooperating processes. Processes are defined as functional programs. Process cooperation is described by a simple macro dataflow specification, called a configuration. Configurations support programming- in-t he-large. The user can experiment with various configurations to find the one that results in the most efficient code.
The configurator uses the dependence graph created during configuration analysis to generate an architecture-independent parallel description which is fed to the code generator. Configurations define processes (and their aggregates) and ports. Statements of the configuration represent relations between ports in different processes. They are supplied by the user to direct integration of the processes into a parallel computation.
Processes created dynamically can communicate with ports located at parent, child, and sibling processes; each of those processes is just a copy of the same program, except the parent process that can be arbitrary.
[...]
2.4    Program Decomposition through Annotations
Annotations provide an efficient way of introducing the user's directives that assist the compiler in program paiallelization.   Annotations have been proposed in many systems by various researchers
[11, 12, 13, 14, 15] and are used mainly as compiler directives. In our approach annotations limit the feasible mappings of computation onto the processors. Hence, they are used only during the decomposition of a process into smaller fragments. This kind of annotation is similar to ON clause as used in the Kali compiler [11], Fortran D [12] or Vienna Fortran [13].
Annotation does not have any effect on the result computed by a program. Consequently, sequential programs that have manifested their correctness over many years of usage are good candidates for parallelization through annotations. Being orthogonal to the program description, annotations support rapid prototyping of different parallel solutions for the same problem, which can be helpful in performance tuning.
In EPL, each equation can be annotated with the name of an array of virtual processors on which it is to be mapped. Virtual processors can be indexed by the equation's subscripts to identify instances of equations assigned to individual virtual processors. Such instances constitute the smallest granule of parallel computation. For example, for the process MVM the following annotation:
will cause the compiler to consider only the tasks that define a sequence of r vector elements. Each task will locally store one row of array A but the vectors x[k, *] must be broadcast to all of those ta,sks. The above partitioning allocates a slice of the equation defined by a single subscript value. The resultant granularity may be too fine for a target architecture. However, when an annotation is indexed by a sublinear subscript, then the corresponding sublinear expression dictates how the annotated equations are clustered onto the virtual processors. For example, let p be a sublinear subscript of i. then range.p is the number of physical or virtual processors. (This number may be a system constant not even known explicitly to the user; it may depend on the architecture, system load, or it may be defined by the user or compiler directive.) Considering again the previous example of a matrix vector multiplication, we can use an annotation:
P[p]: r[k,i] - reduce(+,A[ij]*x[kJ]J);
It will distribute (or partition) the last dimension of r and A over range.p processors in a block fashion (each processor will hold |_^J or |~^~| elements of r and rows of A). In Section 5.2 there is an example in which a different distribution is achieved using a sublinear subscript in an annotation. This distribution balances the load on the processors.
There are similarities as well as differences between the EPL annotations and the Fortran language extensions that have been introduced in many systems, e.g., Vienna Fortran [13, 16. 17]. Fortran D [12, 18. 19] and SUPERB [20]. Vienna Fortran provides directives for array-like processor structure definition. The distribution of arrays can be specified at compile-time through the use of a DIST directive with BLOCK or CYCLIC options. INDIRECT directives can be added to indicate runtime distribution. Such a distribution may have a range of valid distributions defined in its declaration. It uses an explicit mapping array to assign a distribution by an executable statement. The assigned distribution can be part of the condition in the source program. In addition to direct distribution definition, an array in Vienna Fortran can inherit a distribution from the definition of its alignment relative to some other array (and vice versa). Directive DIST can be used with options like =A, TRANSPOSED), PERMUTED, PERM) to align an array with, respectively, another array AD, transposed array A or array A with indicies permuted according to the given vector PERM,
Fortran D directives are similar to Vienna Fortran, however distribution is separated from alignment. In Fortran D, first the DECOMPOSITION statement is used to declare a problem domain for each computation. The ALIGN statement is then used to describe problem mapping tha,t defines the alignment of arrays with respect to each other. Finally, the DISTRIBUTE statement is used to map the problem and its associated arrays to the physical machine.
In EPL, by subscripting the annotated virtual process names and defining the appropriate ranges for the subscripts, the user can distribute the arrays in blocks, columns or rows. The arrays can also be transposed by permuting the subscripts of annotated virtual processors. Unlike Vienna Fortran and Fortran D, EPL does not provide the user with directives to do manual alignment of data. Instead. data alignment algorithms have been developed to facilitate this task automatically (see Section 4.4.1), Hence embedding alignment directives in source programs is not necessary.
      in Scientific Programming, 3(3) 1994 view details