Split-C (2522/spl019)

Country: us
- languages for us
- us/1995
Began: 1995
Published: 1995
- 1995

Parallel extension of C for distributed memory multiprocessors. Aims to provide efficient low-level access to the underlying machine.

"Traditionally, different programming models, e.g., shared memory, message passing, or data parallel, were supported by distinct languages on vastly different architectures. Split-C supports these models by programming conventions, rather than enforcing them through language constraints"

Related languages

C	=>	Split-C		Extension of
Split-C	=>	UPC		Incorporated some features of

References:

David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Thorsten von Eicken, and Katherine Yelick "Parallel Programming in SplitC" Tech Report Computer Science Division University of California, Berkeley 1993 view details Abstract: We introduce the Split-C language, a parallel exten-sion of C intended for high performance programming on distributed memory multiprocessors, and demon-strate the use of the language in optimizing parallel programs. Split-C provides a global address space
with a clear concept of locality and unusual assign-ment operators. These are used as tools to reduce the frequency and cost of remote access. The language allows a mixture of shared memory, message passing, and data parallel programming styles while providing
efficient access to the underlying machine. We demon-strate the basic language concepts using regular and irregular parallel programs and give performance re-sults for various stages of program optimization. External link: Online copy Extract: Overview
Overview

Split-C is a parallel extension of the C programming language that supports efficient access to a global address space on current distributed memory multiprocessors. It retains the "small language'' character of  C and supports careful engineering and optimization  of programs by providing a simple, predictable cost  model. This is in stark contrast to languages that rely  on extensive program transformation at compile time  to obtain performance on parallel machines. Split-C programs do what the programmer specifies; the  compiler takes care of addressing and communication,  as well as code generation. Thus, the ability to exploit parallelism or locality is not limited by the compiler's recognition capability, nor is there need to second guess the compiler transformations while optimizing the program. The language provides a small set  of global access primitives and simple parallel storage  layout declarations. These seem to capture most of the  useful elements of shared memory, message passing,  and data parallel programming in a common, familiar context. Split-C is currently implemented on the  Thinking Machines Corp. CM-5, building from GCC  and Active Messages[17] and implementations are underway for architectures with more aggressive support  for global access. It has been used extensively as a  teaching tool in parallel computing courses and hosts  a wide variety of applications. Split-C may also be  viewed as a compilation target for higher level parallel languages.

This paper describes the central concepts in Split-C and illustrates how these are used in the process of optimizing parallel programs. We begin with a brief  overview of the language as a whole and examine each  concept individually in the following sections. The  presentation interweaves the example use, the optimization techniques, and the language definition concept by concept. Extract: Control Model
Split-C in a nutshell
Control Model:
Split-C follows an SPMD (single program, multiple data) model, where each of PROCS  processors begin execution at the same point in a  common code image. The processors may each follow distinct flow of control and join together at rendezvous points, such as barrier(). Processors are distinguished by the value of the special constant, MYPROC.

Global Address Space:
Any processor may access any location in a global address space, but each processor owns a specific region of the global address  space. The local region contains the processor's stack  for automatic variables, static or external variables,  and a portion of the heap. There is also a spread heap  allocated uniformly across processors.

Global pointers:
Two kinds of pointers are provided, reflecting the cost difference between local and global accesses. Global pointers reference the entire  address space, while standard pointers reference only  the portion owned by the accessing processor.

Split-phase Assignment:
A split-phase assignment operator (:=) allows computation and communication to be overlapped. The request to get a value from a location (or to put a value into a location) is  separated from the completion of the operation.

Signaling Store:
A more unusual assignment operator (:-) signals the processor that owns the updated location that the store has occurred. This provides an essential element of message driven and data parallel execution that shared memory models generally ignore.

Bulk Transfer:
Any of the assignment operators can be used to transfer entire records, i.e., structs. Library routines are provided to transfer entire arrays.  In many cases, overlapping computation and communication becomes more attractive with larger transfer  units.

Spread Arrays:
Parallel computation on arrays is supported through a simple extension of array declarations. The approach is quite different from that of  HPF and its precursors because there is no separate  layout declaration. Furthermore, the duality in C between arrays and pointers is carried forward to spread  arrays through a second form of global pointer, called  a spread pointer.

Krishnamurthy , A.; Culler, D. E.; Dusseau, A.; Goldstein, S. C.; Lumetta, S.; von Eicken , T. ; and Yelick, K. "Parallel programming in Split-C" view details
in [ACM/IEEE] Proceedings of the 1993 ACM/IEEE conference on Supercomputing, December 1993, Portland, Oregon, United States view details

David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Steve Luna, Thorsten von Eicken, Katherine Yelick "Introduction to SplitC Version 1.0" Computer Science Division University of California, Berkeley April 25, 1995 view details Abstract: Split-C is a parallel extension of the C programming language primarily intended for distributed memory multiprocessors. It is designed around two objectives. The first is to capture certain useful  elements of shared memory, message passing, and data parallel programming in a familiar context,  while eliminating the primary deficiencies of each paradigm. The second is to provide efficient access  to the underlying machine, with no surprises. (This is similar to the original motivation for C---to  provide a direct and obvious mapping from high-level programming constructs to low-level machine  instructions.) Split-C does not try to obscure the inherent performance characteristics of the  machine through sophisticated transformations. This combination of generality and transparency  of the language gives the algorithm or library designer a concrete optimization target.
This document describes the central concepts in Split-C and provides a general introduction to programming in the language. Both the language and the document are undergoing active development, so please view the document as working notes, rather than the final language definition. External link: Online copy Extract: Introduction
Introduction
Split-C is a parallel extension to the C programming language designed for large, distributed memory multiprocessors. Following the C tradition, Split-C is a general-purpose language, but not a  "very high level" language, nor a "big" one. It strives to provide the programmer enough machinery  to construct powerful parallel data structures and operate on these in a machine independent  fashion with reasonable clarity. At the same time, it does not attempt to hide the fundamental  performance characteristics of the machine through elaborate language constructs or visionary  compilation. Whereas C "deals with the sort of objects that most sequential computers do,"[1]  the extensions in Split-C deal with the additional operations that most collections of computers  support. In either case, we expect the compiler to be reasonably good at address calculations,  instruction scheduling, and local storage management, with the usual optimizations that pertain  to these issues.

Large-scale multiprocessors introduce two fundamental concerns: there is an active thread of control on each processor and there is a new level of the storage hierarchy which involves access to  remote memory modules via an interconnection network. The Split-C extensions address these two  concerns under the assumption that the programmer must think about these issues in designing  effective data structures and algorithms and desires a reasonable means of expressing the results  of the design effort. The presence of parallelism and remote access should not unduly obscure  the resulting program. The underlying machine model is a collection of processors operating in  a common global address space, which is expected to be implemented as a physically distributed  collection of memories. The global address space is two dimensional from the viewpoint of address  arithmetic on global data structures and from a performance viewpoint, in that each processor has  efficient access to a portion of the address space. We may call this the local portion of the global  space. Split-C provides access to global objects in a manner that reflects the access characteristics  of the interprocessor level of the storage hierarchy.

Split-C attempts to combine the most valuable aspects of shared memory programming with the most valuable aspects message passing and data parallel programming within a coherent framework.  The ability to dereference global pointers provides access to data without prearranged co-ordination  between processors on which the data happens to reside. This allows sophisticated, linked data  structures to be constructed and used. Split-phase access, e.g., prefetch, allows global pointers to  be dereferenced without causing the processor to stall during access. The global address space and  the syntactic support for distributed data structures provides a means of documenting the global  data structures in the program. This global structure is usually lost with traditional message  passing, because it is implicit in the communication patterns. Algorithms that are natural to  state in terms of message passing are more efficient within a global address framework with bulk  transfer; they are as easy to express, and the fundamental storage requirements of the algorithm  are made explicit. Traditional shared-memory loses the inherent event associated with transfer of  information, so even simple global operations such as tree summation are hard to express efficiently.  Split-C allows notification to be associated with access to the global addresses using an approach  similar to split-phase access. Data parallel programming involves phases of local computation and  phases of global communication. The global communication phases are often very general, say  scattering data from each processor to every other, so the global address is very useful, but there is  no need to maintain consistency on a per-operation basis. Split-C is built upon an active message  substrate[AM], so the functionality of the language can easily be extended by libraries that use  the lowest level communication primitive directly, while providing meaningful abstractions within  a global address framework. Extract: Primitives
Split-C Primitives Overview
The extensions introduced in Split-C attempt to expose the salient features of modern multiprocessor machines in a generic fashion. The most obvious facet is simply the presence of multiple processors, each following an independent thread of control. More interesting is the presence of a  very large address space that is accessed by these threads. In all recent large-scale multiprocessors this is realized by storage resources that are local to the individual processors. This trend is  expected to continue. Split-C provides a range of access methods to the global address space, but  encourages a "mostly local" programming style. It is anticipated that different architectures will  provide varying degrees of support for direct access to remote memory. Finally, it is expected that  global objects will often be shared and this requires an added degree of control in how they are  accessed.

Split-C provides the following extensions to C:

Multiple persistent threads: A Split-C program is parallel ab initio. From program begin to program end there are PROCS threads of control within the same program image. 2 Each thread has a unique number given by a special variable MYPROC that ranges from 0 to PROCS \Gamma 1. Generally, we will use the term processor to mean the thread of control or process on that processor. A variety of convenient parallel control structures can be built on this substrate and several are provided as C preprocessor (cpp) macros, but the basic language definition does not prescribe dynamic thread manipulation or task scheduling. A small family of global synchronization operations are provided to co-ordinate the entire collection of threads, e.g., barrier. No specific programming paradigm, such as data parallel, data driven, or message passing, is imposed by the language. However, these programming paradigms can be supported as a matter of convention.
2D Global Address Space: Any processor can access any object in a large global address space. However, the inherent two dimensional structure of the underlying machine is not lost. Each processor "owns" a specific region of the address space and is permitted to access that region via standard, local pointers. Rather than introducing a complicated set of mapping functions, as in Fortran-D, or mysterious mappings in the run-time system, as in CM-Fortran or C*, simple mapping rules are associated with multidimensional structures and global pointer types. Sophisticated mappings are supported by exploiting the relationship between arrays and pointers, as is common in C.
Global pointers: A global pointer refers to an arbitrary object of the associated type anywhere in the system. We will use the term global object to mean an object referenced by a global pointer. A global object is "owned" entirely by a processor, which may have efficient access to the object though standard pointers. A new keyword global is introduced to qualify a pointer as meaningful to all processors. Global pointers can be dereferenced in the same manner as standard pointers, although the time to dereference a global pointer is considerably greater than that for a local pointer, perhaps up to ten times a local memory operation (i.e., a cache miss). The language provides support for allocating global objects, constructing global pointers from local counterparts, and destructuring global pointers. (In general, global objects may contain local pointers, but such pointers must be interpreted relative to the processor owning the global object.) A pointer in C references a particular object, but also defines a sequence of objects that can be referenced by arithmetic operations on the pointer. In Split-C the sequence of objects referenced by a standard pointer are entirely local to the processor. Address arithmetic on a global pointer has the same meaning as arithmetic on a standard pointer by the processor that owns the object. Hence, all the objects referenced relative to a global pointer are associated with one processor.
Spread pointers: A second form of global pointer is provided which defines a sequence of objects that are distributed or spread across the processors. The keyword spread is used as the qualifier to declare this form of global pointer. Consecutive objects referenced by a spread pointer are "wrapped" in a helical fashion through the global address space with the processor dimension varying fastest. Each object is entirely owned by a single processor, but the consecutive element, (i.e., that referenced by ++) is on the next processor.
Spread arrays: The duality in C between pointers and arrays is naturally extended to spread pointers and arrays that are spread across processors, called spread arrays. Spread arrays are declared by inserting a "spreader", ::, which identifies the dimensions that are to be spread across processors. All dimensions to the left of the spreader are wrapped over the processors. Dimensions to the right of the spreader define the object that is allocated within a processor. The spreader position is part of the static type, so efficient code can be generated for multidimensional access. Indexing to the left of the spreader corresponds to arithmetic on spread pointers while indexing to the right of the spreader corresponds to arithmetic on global pointers. The & operator applied to an array expression yields a pointer of the appropriate type. Generic routines that operate independent of the input layout utilize the duality between arrays and pointers to eliminate the higher dimensions.
Split-phase assignment: A new assignment operator, :=, is introduced to split the initiation of a global access from the completion of the access. This allows the time of a global access to be masked by other useful work and the communication resources of the system to be effectively utilized. In contrast, standard assignments stall the issuing processor until the assignment is complete, to guarantee that reads and writes occur in program order. However, there are restrictions on the use of split assignments. Whereas the standard assignment operator describes arbitrary reads and one write, the split assignment operator specifies either to get the contents of a global reference into a local one or to put the contents of a local reference into a global one. Thus, arbitrary expressions are not allowed on the right hand side of a split assignment. The := initiates the transfer, but does not wait for its completion. A sync operation joins the preceeding split assignments with the thread of control. A local variable assigned by a get (similarly, a global variable assigned by a put) is guaranteed to have its new value only after the following sync statement. The value of the variable prior to the sync is not defined. Variables appearing in split assignments should not be modified (either directly or through aliases) between the assignment and the following sync, and variables on the left hand side should not be read during that time. The order in which puts take effect is only constrained by sync boundaries; between those boundaries the puts may be reordered. No limit is placed on the number of outstanding assignments.
Signaling assignment: A weaker form of assignment, called store and denoted :-, is provided to allow efficient data driven execution and global operations. Store updates a global location, but does not provide any acknowledgement of its completion to the issuing processor. Completion of a collection of such stores is detected globally using all store sync, executed by all processors. For global data rearrangement, in which all processors are cooperating to move data, a set of stores by the processors are followed by an all store sync. In addition, the recipient of store can determine if certain number of stores to it have completed using store sync, which takes the expected number of stores and waits until they have completed. This is useful for data driven execution with predictable communication patterns.
Bulk assignment: Transfers of complete objects are supported through the assignment operators and library routines. The library operations allow for bulk transfers, which reflect the view that, in managing a storage hierarchy, the unit of transfer should increase with the access time. Moreover, bulk transfers enhance the utility of split-phase operations. A single word get is essentially a binding prefetch. The ability to prefetch an entire object or block often allows the prefetch operation to be moved out of the inner loop and increases the distance between the time where the get is issued and the time where the result is needed. The assignment and split-assignment operators transfer arbitrary data types or structs, as with the standard C assignment. However, C does not provide operators for copying entire arrays. Bulk operations are provided to operate on arrays
Synchronizing assignment: Concurrent access to shared objects, as occurs in manipulating linked data structures, requires that the accesses be protected under a meaningful locking strategy. Split-C libraries provide a variety of atomic access primitives, such as fetch-andadd, and a general facility for constructing locking versions of structs and manipulating them under mutual exclusion, single writer multiple reader, or other strategies.

in [ACM/IEEE] Proceedings of the 1993 ACM/IEEE conference on Supercomputing, December 1993, Portland, Oregon, United States view details

Bjoern Haake , Klaus E. Schauser , Chris Scheiman "Profiling a parallel language based on fine-grained communication" view details Abstract: Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present a profiler for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. As our experience shows, it is much more challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program. We quantify the profiling overhead and present several Split-C applications which show that the profiler is useful in determining performance bottlenecks.
DOI
in Proceedings of the 1996 ACM/IEEE conference on Supercomputing November 1996 view details
Resources