Kali(ID:1560/kal006)


for the many hands of Kali in Indian statuary

Data parallel language.


Related languages
BLAZE => Kali   Evolution of
Kali => Vienna Fortran   Incorporated some features of

References:
  • Koelbel, C.; et al "Supporting Shared Data Structures on Distributed Memory Architectures", view details
          in [SIGPLAN] SIGPLAN Notices 25(03) The Second ACM SIGPLAN Symp on Principles and Practices of Parallel Programming, March 1990 view details
  • Can Ozturan, Balaram Sinharoy and Boleslaw K. Szymanski "Compiler Technology for Parallel Scientific Computation" pp201-225 view details Extract: Introduction
    Introduction
    With a constant need to solve scientific and engineering problems of ever-growing complexity, there is an increasing need for software tools that provide solutions with minimal user involvement. Parallel computation is becoming indispensable in the solution of the large-scale problems that arise in science and engineering. While the use of parallel computation has been increasing, its widespread application has been hampered by the level of effort required to develop and implement the needed software, Parallel software often must be tuned to a particular parallel architecture to execute efficiently; thus, it often requires costly redesign when ported to new machines. Parallel program correctness requires the results to be independent of the number and speed of the processors. This requirement can be satisfied only if the parallel tasks are independent of each other or properly synchronized when a dependence exists. Designing proper synchronization is a major source of difficulty in ensuring parallel program correctness. Different categories of parallel architectures have led to a proliferation of dialects of standard computer languages. Varying parallel programming primitives for different parallel language dialects greatly limit parallel software portability. Poor portability of parallel programs has resulted in a duplication of efforts and has limited the use of developed systems.
    Parallel computation can be viewed as an interwoven description of operations that are to be applied to data distributed over the processors, and of data mapping and synchronization that dictate the data movements and the computation order. The traditional programming languages, such as Fortran, C, or C++, cope well with the task of prescribing operations to be performed. However, the description of data mapping and synchronization in such languages is often introduced by ad
    hoc architecture-dependent extensions. Examples are various synchronization constructs, like busy-wait, locks or barriers, used in programs for shared-memory machines, send and receive with different semantics employed by programs for message-pas sing architectures, and dimension projection and data broadcast popular in programs for SIMD computers. To avoid such architecture-dependent language definitions, we propose to separate the description of operations to be performed on the data values from the definition of data mapping and synchronization needed to supply these data values to the proper processor at the proper instance of the program execution.
    With this goal in mind, we developed tools [1, 2] that (i) decompose, at least partially, the parallel program into the two (nearly) orthogonal parts described above, (ii) translate the necessary data movements into optimal form customized for the target architecture, and (Hi) synthesize an overall parallel computation. Using these tools the user can describe high-level features of a program and synthesize parallel computation from numerical algorithms, program fragments, and data structures that are implemented separately. The tools support (i) parallel task generation and their allocation to the processors, (ii) distribution of data to the processors, (iii) run-time optimization, and (iv) rapid prototyping of different parallel implementations.
    Through the application of transformation techniques, different versions of the same program can be generated from decomposed components. The synthesized computation uses load assignment, data distribution, and synchronization appropriate to the size and type of target parallel architecture. The computation synthesis is guided by conditiona.1 dependence graphs that represent externally accessible information in each of the synthesized fragments. Usage of conditional information in dataflow analysis and parallelization significantly increase efficiency of the generated parallel code.
    A brief description of the EPL language, its annotations and configurations is given in Section 2. The relationship of EPL constructs and tools to different levels of parallelism is discussed in Section 3. The EPL compiler is discussed in Section 4. Section 4.4 includes an overview of our approach to scalable parallel code generation. A dynamic load management strategy for adaptive scientific computation on SIMD architecture is the topic of Section 5. Finally, conclusions and comparison to other approaches is given in Section 6.
    Extract: Overview of the EPL Language
    Overview of the EPL Language
    EPL is a simple non-strict functional language with a type inference designed for scientific computation. Although computationally vast, scientific computations are typically quite regular both in terms of control flow patterns and employed data structures. The data structures used are usually some variations of multidimensional arrays (sparse matrices, grids, jagged-edge arrays, and even some hierarchical structures can be viewed a.s suc.b). Correspondingly, the KPL language is defined in terinw of just a few constructs: generalized arrays and subscripts for data structures, recurrent equations for program specification, ports for process communication, and virtual processors to facilitate mapping of computation onto processors and load balancing.
    A computation is viewed in EPL as a collection of cooperating processes. A process is described by an EPL program that consists of only data declarations and annotated conditional equations. The canonical data structure is a tree with nodes that can repeat and with leaves of primitive types. In its simplest form such a tree can be viewed as a multi-dimensional array; each level of a tree corresponding to a new dimension of the corresponding array. Structured files are provided for communication with an external environment (in records) and with other processes (through ports). EPL enforces a single-assignment rule, i.e.. each data element should be defined exactly once (the EPL compiler, however, is free to produce multiple-assignment object code). Thus equations, though syntactically reminiscent of assignment statements, are best viewed as assertions of equality,
    The EPL programmer also defines the process interconnection network (the graph obtained by representing processes as nodes and port interconnections as edges) in the configuration file. Processes along with the configuration files are provided by the user to facilitate the compiler in extracting the coarse grain parallelism in the computation by generating processes and inter-process communication constructs. Configurations also allow the programmer to reuse the same process in different computations.
    2.1    Iterations
    An iteration is a staple of scientific computing. In EPL, iterations are programmed using subscripts. A subscript assumes a range of integers as its value. Subscripts give EPL a dual flavor. In the definitional view, they may be treated as universal quantifiers and equations can be viewed as logical predicates. In the operational view, they can be seen as loop control variables and each equation can be seen as a statement nested in loops implied by its subscripts.
    There is a special class of indirect indexes, called sublinear subscripts, that are used in scientific applications so often that a special construct devoted to them has been introduced in EPL. Formally, an indirect index s defined over the subscript i is sublinear to this subscript if it satisfies the following property:
    (0< -s[l] < 1) and (s[i] < s[i + L] < *[/] + !) for i= 1.2,...
    It immediately follows from this definition that the sublinear subscript s[i] starts with the value of either 1 or 0 and then, with each increase of i. it is either incremented by 1 or kept unchanged. Typically, there is a condition associated with each sublinear subscript. The condition dictates when the subscript increases. This is the way a sublinear subscript is defined in EPL. For example, a sparse matrix 5 that is a row-major representation of a matrix D can be defined in EPL using a sublinear subscript col\j\ as follows:
    subscript: col is sublinear j :  D[i.j] ^ 0; Sli,cef\ = D[itj\
    Sublinear subscripts have an implicit range determined by the number of times the defining condition yields true.
    The sublinear subscripts are convenient in expressing such operations as creating a list of selected elements, operating on sparse matrices, or defining a subset of the given set. Even more important is the fact that in the implementation of a, process no new iteration has to be created for computation associated with the sublinear subscripts. Instead, all necessary computation can be nested in the iterations created for subscripts in terms of which the considered sublinear subscript has been defined. Sublinear subscripts are also useful in defining dynamic distribution of data to processors at run-time. An example of such a definition is given in Section 5.2.
    2.2     Reduction
    A computation that frequently occurs in scientific applications is to apply a binary operation over an entire vector and store the result in the last element of the vector. For example, in scientific computation there is often a need to apply an associative operator (such as +,*,-, max, min, etc.) selectively on the elements of an array. Scan and Reduce are language constructs in EPL and other parallel languages that allow such operations to be succinctly written. Reduce applied to a vector of values produces a scalar result, whereas scan results in a vector of partial results. For example, consider a matrix A multiplied by a vector X with the result placed in a vector r. This operation can be written in EPL as:
    Temp[iJ] - if j==l then A[ij]*X[j] else Temp[i,j-l]+A[i:j]*X[i]: r[i] ? Temp [i, range .j];
    or, even shorter as
    Such operations result in references of the form V[. . . range. i. . . .], where range.i indicates the range of the reduced/scanned dimension of a multidimensional array V. (In general, the EPL range variable prefix denotes the size of its suffix.)   The presence of such references in the program is explored by memory optimization and scheduling which is discussed later. A more detailed description of the language is given in [3],
    2.3    Configurations
    In our approach a parallel computation is viewed as a collection of cooperating processes. Processes are defined as functional programs. Process cooperation is described by a simple macro dataflow specification, called a configuration. Configurations support programming- in-t he-large. The user can experiment with various configurations to find the one that results in the most efficient code.
    The configurator uses the dependence graph created during configuration analysis to generate an architecture-independent parallel description which is fed to the code generator. Configurations define processes (and their aggregates) and ports. Statements of the configuration represent relations between ports in different processes. They are supplied by the user to direct integration of the processes into a parallel computation.
    Processes created dynamically can communicate with ports located at parent, child, and sibling processes; each of those processes is just a copy of the same program, except the parent process that can be arbitrary.
    [...]
    2.4    Program Decomposition through Annotations
    Annotations provide an efficient way of introducing the user's directives that assist the compiler in program paiallelization.   Annotations have been proposed in many systems by various researchers
    [11, 12, 13, 14, 15] and are used mainly as compiler directives. In our approach annotations limit the feasible mappings of computation onto the processors. Hence, they are used only during the decomposition of a process into smaller fragments. This kind of annotation is similar to ON clause as used in the Kali compiler [11], Fortran D [12] or Vienna Fortran [13].
    Annotation does not have any effect on the result computed by a program. Consequently, sequential programs that have manifested their correctness over many years of usage are good candidates for parallelization through annotations. Being orthogonal to the program description, annotations support rapid prototyping of different parallel solutions for the same problem, which can be helpful in performance tuning.
    In EPL, each equation can be annotated with the name of an array of virtual processors on which it is to be mapped. Virtual processors can be indexed by the equation's subscripts to identify instances of equations assigned to individual virtual processors. Such instances constitute the smallest granule of parallel computation. For example, for the process MVM the following annotation:
    will cause the compiler to consider only the tasks that define a sequence of r vector elements. Each task will locally store one row of array A but the vectors x[k, *] must be broadcast to all of those ta,sks. The above partitioning allocates a slice of the equation defined by a single subscript value. The resultant granularity may be too fine for a target architecture. However, when an annotation is indexed by a sublinear subscript, then the corresponding sublinear expression dictates how the annotated equations are clustered onto the virtual processors. For example, let p be a sublinear subscript of i. then range.p is the number of physical or virtual processors. (This number may be a system constant not even known explicitly to the user; it may depend on the architecture, system load, or it may be defined by the user or compiler directive.) Considering again the previous example of a matrix vector multiplication, we can use an annotation:
    P[p]: r[k,i] - reduce(+,A[ij]*x[kJ]J);
    It will distribute (or partition) the last dimension of r and A over range.p processors in a block fashion (each processor will hold |_^J or |~^~| elements of r and rows of A). In Section 5.2 there is an example in which a different distribution is achieved using a sublinear subscript in an annotation. This distribution balances the load on the processors.
    There are similarities as well as differences between the EPL annotations and the Fortran language extensions that have been introduced in many systems, e.g., Vienna Fortran [13, 16. 17]. Fortran D [12, 18. 19] and SUPERB [20]. Vienna Fortran provides directives for array-like processor structure definition. The distribution of arrays can be specified at compile-time through the use of a DIST directive with BLOCK or CYCLIC options. INDIRECT directives can be added to indicate runtime distribution. Such a distribution may have a range of valid distributions defined in its declaration. It uses an explicit mapping array to assign a distribution by an executable statement. The assigned distribution can be part of the condition in the source program. In addition to direct distribution definition, an array in Vienna Fortran can inherit a distribution from the definition of its alignment relative to some other array (and vice versa). Directive DIST can be used with options like =A, TRANSPOSED), PERMUTED, PERM) to align an array with, respectively, another array AD, transposed array A or array A with indicies permuted according to the given vector PERM,
    Fortran D directives are similar to Vienna Fortran, however distribution is separated from alignment. In Fortran D, first the DECOMPOSITION statement is used to declare a problem domain for each computation. The ALIGN statement is then used to describe problem mapping tha,t defines the alignment of arrays with respect to each other. Finally, the DISTRIBUTE statement is used to map the problem and its associated arrays to the physical machine.
    In EPL, by subscripting the annotated virtual process names and defining the appropriate ranges for the subscripts, the user can distribute the arrays in blocks, columns or rows. The arrays can also be transposed by permuting the subscripts of annotated virtual processors. Unlike Vienna Fortran and Fortran D, EPL does not provide the user with directives to do manual alignment of data. Instead. data alignment algorithms have been developed to facilitate this task automatically (see Section 4.4.1), Hence embedding alignment directives in source programs is not necessary.
          in Scientific Programming, 3(3) 1994 view details
  • Mehrotra, Piyush; Van Rosendale, John; Zima, Hans "High Performance Fortran: History, Status and Future" Technical Report TR 97-8, Institute for Software Technology and Parallel Systems, University of Vienna, September 1997. view details Extract: Kali and Blaze
    In the context of MIMD machines, Kali (and its predecessor BLAZE) were the first languages to introduce distribution declarations. Kali allows the dimensions of an array to be orthogonally mapped onto an explicitly declared processor array using simple regular distributions such as block, cyclic and block-cyclic, and more complex distributions such as irregular in which the mapping of each array element is explicitly specified. Simple forms of user-defined distribution were also permitted. Kali also introduced the idea of dynamic distributions which allow the user to change the distribution of an array at runtime. Parallel computation is specified by means of forall loops within a global name space. The language also introduced the concept of an on clause which allows the users to control the distribution of loop iterations across the processors. The Kali compiler was the first compiler to integrate both a static and a runtime communication strategy. When enough information was available at compile-time, the compiler generated the required communication statically. In other cases, the compiler used the inspector-executor strategy to generate the communication at runtime. Extract: Conclusion
    Conclusion
    HPF is a well-designed language which can handle most data parallel scientific applications with reasonable facility. However, as architectures evolve and scientific programming becomes more sophisticated, the limi-  tations of the language are becoming increasingly apparent. There are at least three points of view one could  take:
    1. HPF is too high-level a language --- MPI-style languages are more appropriate.
    2. HPF is too low-level a language --- aggressive compiler technologies and improving architectures obviate the need for HPF-style compiler directives.
    3. The level of HPF is about right, but extensions are required to handle some applications for some upcoming architectures.

    All three of these alternatives are being actively pursued by language researchers. For example, HPC++ [?] is an effort to design an HPF-style language using C++ as a base. On the other hand, F - - [?] is an attempt  to provide a lower-level data-parallel language than HPF. Like HPF, F - - provides a single thread of flow  control. But unlike HPF, F - - requires all communication to be explicit using "get'' and "put'' primitives.

    While it is difficult to predict where languages will head, the coming generation of SMP-cluster ar- chitectures may induce new families of languages which will take advantage of the hardware support for  shared-memory semantics with an SMP, while covering the limited global communication capability of the  architectures. In this effort the experience gained in the development and implementation of HPF will surely  serve us well.
          in Scientific Programming, 3(3) 1994 view details