Systolic array parallel language
Systolic array parallel language for the CMU Warp Machine. Individual cells programmed.
in [ACM SIGACT-SIGPLAN] Proceedings of the Fifteenth Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, San Diego, California (January 1988) view details
This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.
The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
The machine is programmed using a language called W2. In W2, conventional Pascal-like control constructs are used to specie the cell programs, and asynchronous computation primitives are used to specify inter-cell communication. The Warp machine and the W2 compiler have been used extensively for about two years, in many applications such as low-level vision for robot vehicle navigation, image and signal processing, and scientific computing. Our previous papers presented an overview of the compiler and described an array level optimization that supports efficient fme-grain parallelism among cells. This paper describes the scheduling techniques used to generate code for the parallel and pipelined functional units in each cell.
This paper consists of three parts: Part I describes the software pipelining algorithm for loops containing straightline loop bodies, focusing on the extensions and improvements. Part II describes the hierarchical reduction approach, and shows how software pipelining can be applied to all loops. Part III contains an evaluation and a comparison with the trace scheduling technique.
in SIGPLAN Notices 23(07) July 1988 (Proceedings of the SIGPLAN'88 conference on Programming Language design and Implementation, June 20-24, 1988, Atlanta, Georgia, United States) view details
Programmable systolic arrays are attractive for high speed scientific computing because of their truly scalable architecture. However, systolic array programming has been an intellectually challenging problem since the architecture was made popular by H. T. Kung in the late 1970s. For many years, researchers have been studying compiler techniques for translating existing scientific code for systolic arrays. However, all these efforts were spent on compiling nested linear recurrence loops for paper systolic arrays. In 1985, a CMU group led by H. T. Kung built a programmable systolic array, the Warp machine. Since then, the Warp group has been engaged in real life problems of programming the Warp machine.
Lam wrote "Compiling processor-oblivious programs to efficient (parallel) code appears intractable at present. In order that efficiency and generality can be achieved, we choose to expose the configuration of the array to the user, while hiding the internal pipelining and parallelism in each cell in the array."
The result of her research is the W2 compiler for the Warp cell. Although the W2 compiler significantly advanced the art of systolic array programming, the user'still has to program each cell of the systolic array and manage the intercell communication explicitly. The complexity of manually parallelizing existing scientific programs into W2 programs discouraged researchers from using Warp for serious scientific computing applications.
In December 1987, we started tackling the problem of compiling processor-oblivious programs into efficient systolic array programs . For this research, we devised a programming language AL  and implemented an AL compiler for Warp. In April 1988, we compiled the LINPACK LU decomposition routine (SGEFA) and achieved a nearly 8-fold speedup for matrices of size 180x180. Within two weeks, we successfully compiled the long-wanted LINPACK SVD (SSVDC) routine, for which no one had ever successfully written a W2 program by hand. Since then, AL has been used to program large Warp applications in scientific computations.
In the paper, we describe the AL compiler for the Warp systolic array. We (1) give background information on the Warp and the W2 cell programming language, (2) introduce the AL programming language (3) give an overview of the AL compiler, (4) detail data relations, loop distribution, and data distribution, (5) evaluate our compiler by measuring and analyzing achieved performance for a set of scientific programs, (6) compare our compiler techniques with the compiler techniques for other parallel architectures, and (7) conclude this paper. Extract: The W2 language
The W2 cell programming language
W2 is a language in which the computation and communication of each cell are individually programmed. W2 has two communication primitives, SEND and RECV, for, controlling the communication channels.
A SEND statement:
instructs the cell to send a word to a communication channel (X or Y). Correspondingly, a RECV statement:
instructs a cell to receive a word from a communication channel (X or Y). Synchronization is implicitly done with communication: the SEND operation blocks if the buffer (512 words) on the channel is full; the RECV operation blocks if the buffer on the channel is empty.
The W2 compiler uses sophisticated data dependence analysis and scheduling techniques to generate code for the wide instruction word architecture. Normally, the W2 compiler can produce optimal code for most of the time-consuming innermost loops. Extract: The AL programming language
The AL programming language
AL is a language in which the entire systolic array is programmed as if it were a sequential computer. AL has three classes of data objects: scalar, array and distributed array (or DARRAY for short). Scalars and arrays are duplicated in all the cells while elements of a DARRAY are distributed among cells.
Elements of a DARRAY can either be scalars or arrays. For example, the statement
DABBAY int A
declares a DARRAY of integers and the statement
DAFLBAY int   B 
declares a DARRAY of 2-dimensional integer arrays.
We call an element of a DARRAY a slice of the DARRAY to distinguish it from an element of a normal array. The compiler distributes slices of a DARRAY among cells but keeps each slice within a cell of the systolic array.
AL has expression statements, compound statements, conditional statements (IF-THEN-ELSE), WHILE statements and DO-loop statements as in a general-purpose imperative programming language. AL also has a procedure call mechanism. However, to simplify the illustration, we will not discuss procedure calls in this paper. An AL program is just a normal sequential program. We can compile an AL program to execute on a sequential computer without forking multiple processes to simulate the systolic array.
Extract: Related work
Although systolic arrays and shared memory multiprocessors are two extremes in the spectrum of parallel computer architectures, loop distribution in systolic arrays achieves the same effect as loop parallelization in shared memory multiprocessors.
Of course, the means of achieving the same effect cannot be the same for both architectures. Loop distribution relies on data compatibility relations to generate data distribution and intercell communication, whereas loop parallelization relies on data dependence relations to generate inter-thread synchronization.
Loop distribution assigns loop iterations to cells at compile-time whereas loop parallelization either pre-schedules loop iterations among processors at compile-time or lets processors self-schedule loop iterations at run-time. Loop distribution is achieved by selecting loop iterations from a default duplicated computation within each cell, whereas loop parallelization is? achieved by forking out loop iterations from a process executing on a single processor. Although loop distribution in systolic arrays cannot be as general as loop parallelization in shared memory multiprocessors, shared memory multiprocessors cannot scale as well as systolic arrays.
Callahan and Kennedy extended FORTRAN77 with DISTRIBUTE and DECOMPOSE statements for data distribution and studied compiler techniques for distribu.ted memory parallel computers in general and Intel iPSC/2 hypercube in particular. Rogers and Pin&y  extended the functional programming language Id Nouveau with data distribution functions MAP, LOCAL, ALLOC and reported a compiler fo:r the Intel hypercube iPSC/B. Riihl and Annaratone: defined a set of compiler directives for data distribution in FORTRAN77 and reported a compiler for the K2 computer. Distributed data objects are also used in other parallel programming languages such as DIN0 for iPSC/S hypercube, BLAZE for the IBM RP3 and BBN Butterfly. Unlike AL, where the compiler uses data relations to manage data distribution, all of them require the user to explicitly define the data distribution functions. In summary, contributions of this research include:
1. a systolic array compiler which is able to generate efficient parallel code for many Warp applications in scientific computing.
2. the notion of data relations in compiling pre grams for systolic arrays.
in [SIGPLAN] SIGPLAN Notices 25(06) June 1990 (Proceedings of the ACM SIGPLAN 1990 Conference on Programming language design and implementation (PLDI)) view details