Split-C (2522/spl019) |
|
- Country: us
- Began: 1995
- Published: 1995
Parallel extension of C for distributed memory multiprocessors. Aims to provide efficient low-level access to the underlying machine.
"Traditionally, different programming models, e.g., shared memory, message passing, or data parallel, were supported by distinct languages on vastly different architectures. Split-C supports these models by programming conventions, rather than enforcing them through language constraints"
Related languages
C |
=> |
Split-C | |
Extension of |
Split-C |
=> |
UPC | |
Incorporated some features of |
References:
David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Thorsten von Eicken, and Katherine Yelick "Parallel Programming in SplitC" Tech Report Computer Science Division University of California, Berkeley 1993
view details
Abstract: We introduce the Split-C language, a parallel exten-sion of C intended for high performance programming on distributed memory multiprocessors, and demon-strate the use of the language in optimizing parallel programs. Split-C provides a global address space with a clear concept of locality and unusual assign-ment operators. These are used as tools to reduce the frequency and cost of remote access. The language allows a mixture of shared memory, message passing, and data parallel programming styles while providing efficient access to the underlying machine. We demon-strate the basic language concepts using regular and irregular parallel programs and give performance re-sults for various stages of program optimization.
External link: Online copy
Extract: Overview Overview
Split-C is a parallel extension of the C programming language that supports efficient access to a global address space on current distributed memory multiprocessors. It retains the "small language'' character of C and supports careful engineering and optimization of programs by providing a simple, predictable cost model. This is in stark contrast to languages that rely on extensive program transformation at compile time to obtain performance on parallel machines. Split-C programs do what the programmer specifies; the compiler takes care of addressing and communication, as well as code generation. Thus, the ability to exploit parallelism or locality is not limited by the compiler's recognition capability, nor is there need to second guess the compiler transformations while optimizing the program. The language provides a small set of global access primitives and simple parallel storage layout declarations. These seem to capture most of the useful elements of shared memory, message passing, and data parallel programming in a common, familiar context. Split-C is currently implemented on the Thinking Machines Corp. CM-5, building from GCC and Active Messages[17] and implementations are underway for architectures with more aggressive support for global access. It has been used extensively as a teaching tool in parallel computing courses and hosts a wide variety of applications. Split-C may also be viewed as a compilation target for higher level parallel languages.
This paper describes the central concepts in Split-C and illustrates how these are used in the process of optimizing parallel programs. We begin with a brief overview of the language as a whole and examine each concept individually in the following sections. The presentation interweaves the example use, the optimization techniques, and the language definition concept by concept.
Extract: Control Model Split-C in a nutshell Control Model: Split-C follows an SPMD (single program, multiple data) model, where each of PROCS processors begin execution at the same point in a common code image. The processors may each follow distinct flow of control and join together at rendezvous points, such as barrier(). Processors are distinguished by the value of the special constant, MYPROC.
Global Address Space: Any processor may access any location in a global address space, but each processor owns a specific region of the global address space. The local region contains the processor's stack for automatic variables, static or external variables, and a portion of the heap. There is also a spread heap allocated uniformly across processors.
Global pointers: Two kinds of pointers are provided, reflecting the cost difference between local and global accesses. Global pointers reference the entire address space, while standard pointers reference only the portion owned by the accessing processor.
Split-phase Assignment: A split-phase assignment operator (:=) allows computation and communication to be overlapped. The request to get a value from a location (or to put a value into a location) is separated from the completion of the operation.
Signaling Store: A more unusual assignment operator (:-) signals the processor that owns the updated location that the store has occurred. This provides an essential element of message driven and data parallel execution that shared memory models generally ignore.
Bulk Transfer: Any of the assignment operators can be used to transfer entire records, i.e., structs. Library routines are provided to transfer entire arrays. In many cases, overlapping computation and communication becomes more attractive with larger transfer units.
Spread Arrays: Parallel computation on arrays is supported through a simple extension of array declarations. The approach is quite different from that of HPF and its precursors because there is no separate layout declaration. Furthermore, the duality in C between arrays and pointers is carried forward to spread arrays through a second form of global pointer, called a spread pointer.
Krishnamurthy , A.; Culler, D. E.; Dusseau, A.; Goldstein, S. C.; Lumetta, S.; von Eicken , T. ; and Yelick, K. "Parallel programming in Split-C" view details
in [ACM/IEEE] Proceedings of the 1993 ACM/IEEE conference on Supercomputing, December 1993, Portland, Oregon, United States view details
David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Steve Luna, Thorsten von Eicken, Katherine Yelick "Introduction to SplitC Version 1.0" Computer Science Division University of California, Berkeley April 25, 1995
view details
Abstract: Split-C is a parallel extension of the C programming language primarily intended for distributed memory multiprocessors. It is designed around two objectives. The first is to capture certain useful elements of shared memory, message passing, and data parallel programming in a familiar context, while eliminating the primary deficiencies of each paradigm. The second is to provide efficient access to the underlying machine, with no surprises. (This is similar to the original motivation for C---to provide a direct and obvious mapping from high-level programming constructs to low-level machine instructions.) Split-C does not try to obscure the inherent performance characteristics of the machine through sophisticated transformations. This combination of generality and transparency of the language gives the algorithm or library designer a concrete optimization target. This document describes the central concepts in Split-C and provides a general introduction to programming in the language. Both the language and the document are undergoing active development, so please view the document as working notes, rather than the final language definition.
External link: Online copy
Extract: Introduction Introduction Split-C is a parallel extension to the C programming language designed for large, distributed memory multiprocessors. Following the C tradition, Split-C is a general-purpose language, but not a "very high level" language, nor a "big" one. It strives to provide the programmer enough machinery to construct powerful parallel data structures and operate on these in a machine independent fashion with reasonable clarity. At the same time, it does not attempt to hide the fundamental performance characteristics of the machine through elaborate language constructs or visionary compilation. Whereas C "deals with the sort of objects that most sequential computers do,"[1] the extensions in Split-C deal with the additional operations that most collections of computers support. In either case, we expect the compiler to be reasonably good at address calculations, instruction scheduling, and local storage management, with the usual optimizations that pertain to these issues.
Large-scale multiprocessors introduce two fundamental concerns: there is an active thread of control on each processor and there is a new level of the storage hierarchy which involves access to remote memory modules via an interconnection network. The Split-C extensions address these two concerns under the assumption that the programmer must think about these issues in designing effective data structures and algorithms and desires a reasonable means of expressing the results of the design effort. The presence of parallelism and remote access should not unduly obscure the resulting program. The underlying machine model is a collection of processors operating in a common global address space, which is expected to be implemented as a physically distributed collection of memories. The global address space is two dimensional from the viewpoint of address arithmetic on global data structures and from a performance viewpoint, in that each processor has efficient access to a portion of the address space. We may call this the local portion of the global space. Split-C provides access to global objects in a manner that reflects the access characteristics of the interprocessor level of the storage hierarchy.
Split-C attempts to combine the most valuable aspects of shared memory programming with the most valuable aspects message passing and data parallel programming within a coherent framework. The ability to dereference global pointers provides access to data without prearranged co-ordination between processors on which the data happens to reside. This allows sophisticated, linked data structures to be constructed and used. Split-phase access, e.g., prefetch, allows global pointers to be dereferenced without causing the processor to stall during access. The global address space and the syntactic support for distributed data structures provides a means of documenting the global data structures in the program. This global structure is usually lost with traditional message passing, because it is implicit in the communication patterns. Algorithms that are natural to state in terms of message passing are more efficient within a global address framework with bulk transfer; they are as easy to express, and the fundamental storage requirements of the algorithm are made explicit. Traditional shared-memory loses the inherent event associated with transfer of information, so even simple global operations such as tree summation are hard to express efficiently. Split-C allows notification to be associated with access to the global addresses using an approach similar to split-phase access. Data parallel programming involves phases of local computation and phases of global communication. The global communication phases are often very general, say scattering data from each processor to every other, so the global address is very useful, but there is no need to maintain consistency on a per-operation basis. Split-C is built upon an active message substrate[AM], so the functionality of the language can easily be extended by libraries that use the lowest level communication primitive directly, while providing meaningful abstractions within a global address framework.
Extract: Primitives Split-C Primitives Overview The extensions introduced in Split-C attempt to expose the salient features of modern multiprocessor machines in a generic fashion. The most obvious facet is simply the presence of multiple processors, each following an independent thread of control. More interesting is the presence of a very large address space that is accessed by these threads. In all recent large-scale multiprocessors this is realized by storage resources that are local to the individual processors. This trend is expected to continue. Split-C provides a range of access methods to the global address space, but encourages a "mostly local" programming style. It is anticipated that different architectures will provide varying degrees of support for direct access to remote memory. Finally, it is expected that global objects will often be shared and this requires an added degree of control in how they are accessed.
Split-C provides the following extensions to C:
- Multiple persistent threads: A Split-C program is parallel ab initio. From program begin to program end there are PROCS threads of control within the same program image. 2 Each thread has a unique number given by a special variable MYPROC that ranges from 0 to PROCS \Gamma 1. Generally, we will use the term processor to mean the thread of control or process on that processor. A variety of convenient parallel control structures can be built on this substrate and several are provided as C preprocessor (cpp) macros, but the basic language definition does not prescribe dynamic thread manipulation or task scheduling. A small family of global synchronization operations are provided to co-ordinate the entire collection of threads, e.g., barrier. No specific programming paradigm, such as data parallel, data driven, or message passing, is imposed by the language. However, these programming paradigms can be supported as a matter of convention.
- 2D Global Address Space: Any processor can access any object in a large global address space. However, the inherent two dimensional structure of the underlying machine is not lost. Each processor "owns" a specific region of the address space and is permitted to access that region via standard, local pointers. Rather than introducing a complicated set of mapping functions, as in Fortran-D, or mysterious mappings in the run-time system, as in CM-Fortran or C*, simple mapping rules are associated with multidimensional structures and global pointer types. Sophisticated mappings are supported by exploiting the relationship between arrays and pointers, as is common in C.
- Global pointers: A global pointer refers to an arbitrary object of the associated type anywhere in the system. We will use the term global object to mean an object referenced by a global pointer. A global object is "owned" entirely by a processor, which may have efficient access to the object though standard pointers. A new keyword global is introduced to qualify a pointer as meaningful to all processors. Global pointers can be dereferenced in the same manner as standard pointers, although the time to dereference a global pointer is considerably greater than that for a local pointer, perhaps up to ten times a local memory operation (i.e., a cache miss). The language provides support for allocating global objects, constructing global pointers from local counterparts, and destructuring global pointers. (In general, global objects may contain local pointers, but such pointers must be interpreted relative to the processor owning the global object.) A pointer in C references a particular object, but also defines a sequence of objects that can be referenced by arithmetic operations on the pointer. In Split-C the sequence of objects referenced by a standard pointer are entirely local to the processor. Address arithmetic on a global pointer has the same meaning as arithmetic on a standard pointer by the processor that owns the object. Hence, all the objects referenced relative to a global pointer are associated with one processor.
- Spread pointers: A second form of global pointer is provided which defines a sequence of objects that are distributed or spread across the processors. The keyword spread is used as the qualifier to declare this form of global pointer. Consecutive objects referenced by a spread pointer are "wrapped" in a helical fashion through the global address space with the processor dimension varying fastest. Each object is entirely owned by a single processor, but the consecutive element, (i.e., that referenced by ++) is on the next processor.
- Spread arrays: The duality in C between pointers and arrays is naturally extended to spread pointers and arrays that are spread across processors, called spread arrays. Spread arrays are declared by inserting a "spreader", ::, which identifies the dimensions that are to be spread across processors. All dimensions to the left of the spreader are wrapped over the processors. Dimensions to the right of the spreader define the object that is allocated within a processor. The spreader position is part of the static type, so efficient code can be generated for multidimensional access. Indexing to the left of the spreader corresponds to arithmetic on spread pointers while indexing to the right of the spreader corresponds to arithmetic on global pointers. The & operator applied to an array expression yields a pointer of the appropriate type. Generic routines that operate independent of the input layout utilize the duality between arrays and pointers to eliminate the higher dimensions.
- Split-phase assignment: A new assignment operator, :=, is introduced to split the initiation of a global access from the completion of the access. This allows the time of a global access to be masked by other useful work and the communication resources of the system to be effectively utilized. In contrast, standard assignments stall the issuing processor until the assignment is complete, to guarantee that reads and writes occur in program order. However, there are restrictions on the use of split assignments. Whereas the standard assignment operator describes arbitrary reads and one write, the split assignment operator specifies either to get the contents of a global reference into a local one or to put the contents of a local reference into a global one. Thus, arbitrary expressions are not allowed on the right hand side of a split assignment. The := initiates the transfer, but does not wait for its completion. A sync operation joins the preceeding split assignments with the thread of control. A local variable assigned by a get (similarly, a global variable assigned by a put) is guaranteed to have its new value only after the following sync statement. The value of the variable prior to the sync is not defined. Variables appearing in split assignments should not be modified (either directly or through aliases) between the assignment and the following sync, and variables on the left hand side should not be read during that time. The order in which puts take effect is only constrained by sync boundaries; between those boundaries the puts may be reordered. No limit is placed on the number of outstanding assignments.
- Signaling assignment: A weaker form of assignment, called store and denoted :-, is provided to allow efficient data driven execution and global operations. Store updates a global location, but does not provide any acknowledgement of its completion to the issuing processor. Completion of a collection of such stores is detected globally using all store sync, executed by all processors. For global data rearrangement, in which all processors are cooperating to move data, a set of stores by the processors are followed by an all store sync. In addition, the recipient of store can determine if certain number of stores to it have completed using store sync, which takes the expected number of stores and waits until they have completed. This is useful for data driven execution with predictable communication patterns.
- Bulk assignment: Transfers of complete objects are supported through the assignment operators and library routines. The library operations allow for bulk transfers, which reflect the view that, in managing a storage hierarchy, the unit of transfer should increase with the access time. Moreover, bulk transfers enhance the utility of split-phase operations. A single word get is essentially a binding prefetch. The ability to prefetch an entire object or block often allows the prefetch operation to be moved out of the inner loop and increases the distance between the time where the get is issued and the time where the result is needed. The assignment and split-assignment operators transfer arbitrary data types or structs, as with the standard C assignment. However, C does not provide operators for copying entire arrays. Bulk operations are provided to operate on arrays
- Synchronizing assignment: Concurrent access to shared objects, as occurs in manipulating linked data structures, requires that the accesses be protected under a meaningful locking strategy. Split-C libraries provide a variety of atomic access primitives, such as fetch-andadd, and a general facility for constructing locking versions of structs and manipulating them under mutual exclusion, single writer multiple reader, or other strategies.
in [ACM/IEEE] Proceedings of the 1993 ACM/IEEE conference on Supercomputing, December 1993, Portland, Oregon, United States view details
Bjoern Haake , Klaus E. Schauser , Chris Scheiman "Profiling a parallel language based on fine-grained communication" view details
Abstract: Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present a profiler for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. As our experience shows, it is much more challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program. We quantify the profiling overhead and present several Split-C applications which show that the profiler is useful in determining performance bottlenecks.
DOI
in Proceedings of the 1996 ACM/IEEE conference on Supercomputing November 1996 view details
Resources
|