Split-C (2522/spl019)

l

Parallel extension of C for distributed memory multiprocessors. Aims to provide efficient low-level access to the underlying machine.

"Traditionally, different programming models, e.g., shared memory, message passing, or data parallel, were supported by distinct languages on vastly different architectures. Split-C supports these models by programming conventions, rather than enforcing them through language constraints"


Related languages
C => Split-C   Extension of
Split-C => UPC   Incorporated some features of

References:
  • David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Thorsten von Eicken, and Katherine Yelick "Parallel Programming in Split­C" Tech Report Computer Science Division University of California, Berkeley 1993 view details Abstract: We introduce the Split-C language, a parallel exten-sion of C intended for high performance programming on distributed memory multiprocessors, and demon-strate the use of the language in optimizing parallel programs. Split-C provides a global address space
    with a clear concept of locality and unusual assign-ment operators. These are used as tools to reduce the frequency and cost of remote access. The language allows a mixture of shared memory, message passing, and data parallel programming styles while providing
    efficient access to the underlying machine. We demon-strate the basic language concepts using regular and irregular parallel programs and give performance re-sults for various stages of program optimization. External link: Online copy Extract: Overview
    Overview

    Split-C is a parallel extension of the C programming language that supports efficient access to a global address space on current distributed memory multiprocessors. It retains the "small language'' character of  C and supports careful engineering and optimization  of programs by providing a simple, predictable cost  model. This is in stark contrast to languages that rely  on extensive program transformation at compile time  to obtain performance on parallel machines. Split-C programs do what the programmer specifies; the  compiler takes care of addressing and communication,  as well as code generation. Thus, the ability to exploit parallelism or locality is not limited by the compiler's recognition capability, nor is there need to second guess the compiler transformations while optimizing the program. The language provides a small set  of global access primitives and simple parallel storage  layout declarations. These seem to capture most of the  useful elements of shared memory, message passing,  and data parallel programming in a common, familiar context. Split-C is currently implemented on the  Thinking Machines Corp. CM-5, building from GCC  and Active Messages[17] and implementations are underway for architectures with more aggressive support  for global access. It has been used extensively as a  teaching tool in parallel computing courses and hosts  a wide variety of applications. Split-C may also be  viewed as a compilation target for higher level parallel languages.

    This paper describes the central concepts in Split-C and illustrates how these are used in the process of optimizing parallel programs. We begin with a brief  overview of the language as a whole and examine each  concept individually in the following sections. The  presentation interweaves the example use, the optimization techniques, and the language definition concept by concept. Extract: Control Model
    Split-C in a nutshell
    Control Model:
    Split-C follows an SPMD (single program, multiple data) model, where each of PROCS  processors begin execution at the same point in a  common code image. The processors may each follow distinct flow of control and join together at rendezvous points, such as barrier(). Processors are distinguished by the value of the special constant, MYPROC.

    Global Address Space:
    Any processor may access any location in a global address space, but each processor owns a specific region of the global address  space. The local region contains the processor's stack  for automatic variables, static or external variables,  and a portion of the heap. There is also a spread heap  allocated uniformly across processors.

    Global pointers:
    Two kinds of pointers are provided, reflecting the cost difference between local and global accesses. Global pointers reference the entire  address space, while standard pointers reference only  the portion owned by the accessing processor.

    Split-phase Assignment:
    A split-phase assignment operator (:=) allows computation and communication to be overlapped. The request to get a value from a location (or to put a value into a location) is  separated from the completion of the operation.

    Signaling Store:
    A more unusual assignment operator (:-) signals the processor that owns the updated location that the store has occurred. This provides an essential element of message driven and data parallel execution that shared memory models generally ignore.

    Bulk Transfer:
    Any of the assignment operators can be used to transfer entire records, i.e., structs. Library routines are provided to transfer entire arrays.  In many cases, overlapping computation and communication becomes more attractive with larger transfer  units.

    Spread Arrays:
    Parallel computation on arrays is supported through a simple extension of array declarations. The approach is quite different from that of  HPF and its precursors because there is no separate  layout declaration. Furthermore, the duality in C between arrays and pointers is carried forward to spread  arrays through a second form of global pointer, called  a spread pointer.
  • Krishnamurthy , A.; Culler, D. E.; Dusseau, A.; Goldstein, S. C.; Lumetta, S.; von Eicken , T. ; and Yelick, K. "Parallel programming in Split-C" view details
          in [ACM/IEEE] Proceedings of the 1993 ACM/IEEE conference on Supercomputing, December 1993, Portland, Oregon, United States view details
  • David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Steve Luna, Thorsten von Eicken, Katherine Yelick "Introduction to Split­C Version 1.0" Computer Science Division University of California, Berkeley April 25, 1995 view details Abstract: Split-C is a parallel extension of the C programming language primarily intended for distributed memory multiprocessors. It is designed around two objectives. The first is to capture certain useful  elements of shared memory, message passing, and data parallel programming in a familiar context,  while eliminating the primary deficiencies of each paradigm. The second is to provide efficient access  to the underlying machine, with no surprises. (This is similar to the original motivation for C---to  provide a direct and obvious mapping from high-level programming constructs to low-level machine  instructions.) Split-C does not try to obscure the inherent performance characteristics of the  machine through sophisticated transformations. This combination of generality and transparency  of the language gives the algorithm or library designer a concrete optimization target.
    This document describes the central concepts in Split-C and provides a general introduction to programming in the language. Both the language and the document are undergoing active development, so please view the document as working notes, rather than the final language definition. External link: Online copy Extract: Introduction
    Introduction
    Split-C is a parallel extension to the C programming language designed for large, distributed memory multiprocessors. Following the C tradition, Split-C is a general-purpose language, but not a  "very high level" language, nor a "big" one. It strives to provide the programmer enough machinery  to construct powerful parallel data structures and operate on these in a machine independent  fashion with reasonable clarity. At the same time, it does not attempt to hide the fundamental  performance characteristics of the machine through elaborate language constructs or visionary  compilation. Whereas C "deals with the sort of objects that most sequential computers do,"[1]  the extensions in Split-C deal with the additional operations that most collections of computers  support. In either case, we expect the compiler to be reasonably good at address calculations,  instruction scheduling, and local storage management, with the usual optimizations that pertain  to these issues.

    Large-scale multiprocessors introduce two fundamental concerns: there is an active thread of control on each processor and there is a new level of the storage hierarchy which involves access to  remote memory modules via an interconnection network. The Split-C extensions address these two  concerns under the assumption that the programmer must think about these issues in designing  effective data structures and algorithms and desires a reasonable means of expressing the results  of the design effort. The presence of parallelism and remote access should not unduly obscure  the resulting program. The underlying machine model is a collection of processors operating in  a common global address space, which is expected to be implemented as a physically distributed  collection of memories. The global address space is two dimensional from the viewpoint of address  arithmetic on global data structures and from a performance viewpoint, in that each processor has  efficient access to a portion of the address space. We may call this the local portion of the global  space. Split-C provides access to global objects in a manner that reflects the access characteristics  of the interprocessor level of the storage hierarchy.

    Split-C attempts to combine the most valuable aspects of shared memory programming with the most valuable aspects message passing and data parallel programming within a coherent framework.  The ability to dereference global pointers provides access to data without prearranged co-ordination  between processors on which the data happens to reside. This allows sophisticated, linked data  structures to be constructed and used. Split-phase access, e.g., prefetch, allows global pointers to  be dereferenced without causing the processor to stall during access. The global address space and  the syntactic support for distributed data structures provides a means of documenting the global  data structures in the program. This global structure is usually lost with traditional message  passing, because it is implicit in the communication patterns. Algorithms that are natural to  state in terms of message passing are more efficient within a global address framework with bulk  transfer; they are as easy to express, and the fundamental storage requirements of the algorithm  are made explicit. Traditional shared-memory loses the inherent event associated with transfer of  information, so even simple global operations such as tree summation are hard to express efficiently.  Split-C allows notification to be associated with access to the global addresses using an approach  similar to split-phase access. Data parallel programming involves phases of local computation and  phases of global communication. The global communication phases are often very general, say  scattering data from each processor to every other, so the global address is very useful, but there is  no need to maintain consistency on a per-operation basis. Split-C is built upon an active message  substrate[AM], so the functionality of the language can easily be extended by libraries that use  the lowest level communication primitive directly, while providing meaningful abstractions within  a global address framework. Extract: Primitives
    Split-C Primitives Overview
    The extensions introduced in Split-C attempt to expose the salient features of modern multiprocessor machines in a generic fashion. The most obvious facet is simply the presence of multiple processors, each following an independent thread of control. More interesting is the presence of a  very large address space that is accessed by these threads. In all recent large-scale multiprocessors this is realized by storage resources that are local to the individual processors. This trend is  expected to continue. Split-C provides a range of access methods to the global address space, but  encourages a "mostly local" programming style. It is anticipated that different architectures will  provide varying degrees of support for direct access to remote memory. Finally, it is expected that  global objects will often be shared and this requires an added degree of control in how they are  accessed.

    Split-C provides the following extensions to C:

    • Multiple persistent threads: A Split-C program is parallel ab initio. From program begin to program end there are PROCS threads of control within the same program image. 2 Each  thread has a unique number given by a special variable MYPROC that ranges from 0 to PROCS \Gamma 1.  Generally, we will use the term processor to mean the thread of control or process on that  processor. A variety of convenient parallel control structures can be built on this substrate  and several are provided as C preprocessor (cpp) macros, but the basic language definition  does not prescribe dynamic thread manipulation or task scheduling. A small family of global  synchronization operations are provided to co-ordinate the entire collection of threads, e.g.,  barrier. No specific programming paradigm, such as data parallel, data driven, or message passing, is imposed by the language. However, these programming paradigms can be  supported as a matter of convention.

    • 2D Global Address Space: Any processor can access any object in a large global address space. However, the inherent two dimensional structure of the underlying machine is not lost. Each  processor "owns" a specific region of the address space and is permitted to access that region  via standard, local pointers. Rather than introducing a complicated set of mapping functions,  as in Fortran-D, or mysterious mappings in the run-time system, as in CM-Fortran or C*,  simple mapping rules are associated with multidimensional structures and global pointer  types. Sophisticated mappings are supported by exploiting the relationship between arrays  and pointers, as is common in C.

    • Global pointers: A global pointer refers to an arbitrary object of the associated type anywhere in the system. We will use the term global object to mean an object referenced by a global  pointer. A global object is "owned" entirely by a processor, which may have efficient access to  the object though standard pointers. A new keyword global is introduced to qualify a pointer  as meaningful to all processors. Global pointers can be dereferenced in the same manner as  standard pointers, although the time to dereference a global pointer is considerably greater  than that for a local pointer, perhaps up to ten times a local memory operation (i.e., a  cache miss). The language provides support for allocating global objects, constructing global  pointers from local counterparts, and destructuring global pointers. (In general, global objects  may contain local pointers, but such pointers must be interpreted relative to the processor  owning the global object.)  A pointer in C references a particular object, but also defines a sequence of objects that can  be referenced by arithmetic operations on the pointer. In Split-C the sequence of objects  referenced by a standard pointer are entirely local to the processor. Address arithmetic on a  global pointer has the same meaning as arithmetic on a standard pointer by the processor that  owns the object. Hence, all the objects referenced relative to a global pointer are associated  with one processor.

    • Spread pointers: A second form of global pointer is provided which defines a sequence of objects that are distributed or spread across the processors. The keyword spread is used  as the qualifier to declare this form of global pointer. Consecutive objects referenced by a  spread pointer are "wrapped" in a helical fashion through the global address space with the  processor dimension varying fastest. Each object is entirely owned by a single processor, but  the consecutive element, (i.e., that referenced by ++) is on the next processor.

    • Spread arrays: The duality in C between pointers and arrays is naturally extended to spread pointers and arrays that are spread across processors, called spread arrays. Spread arrays  are declared by inserting a "spreader", ::, which identifies the dimensions that are to be  spread across processors. All dimensions to the left of the spreader are wrapped over the  processors. Dimensions to the right of the spreader define the object that is allocated within  a processor. The spreader position is part of the static type, so efficient code can be generated  for multidimensional access. Indexing to the left of the spreader corresponds to arithmetic  on spread pointers while indexing to the right of the spreader corresponds to arithmetic  on global pointers. The & operator applied to an array expression yields a pointer of the  appropriate type. Generic routines that operate independent of the input layout utilize the  duality between arrays and pointers to eliminate the higher dimensions.

    • Split-phase assignment: A new assignment operator, :=, is introduced to split the initiation of a global access from the completion of the access. This allows the time of a global access to be  masked by other useful work and the communication resources of the system to be effectively  utilized. In contrast, standard assignments stall the issuing processor until the assignment  is complete, to guarantee that reads and writes occur in program order. However, there  are restrictions on the use of split assignments. Whereas the standard assignment operator  describes arbitrary reads and one write, the split assignment operator specifies either to get  the contents of a global reference into a local one or to put the contents of a local reference  into a global one. Thus, arbitrary expressions are not allowed on the right hand side of a  split assignment. The := initiates the transfer, but does not wait for its completion. A sync  operation joins the preceeding split assignments with the thread of control. A local variable  assigned by a get (similarly, a global variable assigned by a put) is guaranteed to have its new  value only after the following sync statement. The value of the variable prior to the sync is  not defined. Variables appearing in split assignments should not be modified (either directly  or through aliases) between the assignment and the following sync, and variables on the left  hand side should not be read during that time. The order in which puts take effect is only  constrained by sync boundaries; between those boundaries the puts may be reordered. No  limit is placed on the number of outstanding assignments.

    • Signaling assignment: A weaker form of assignment, called store and denoted :-, is provided to allow efficient data driven execution and global operations. Store updates a global location, but does not provide any acknowledgement of its completion to the issuing processor.  Completion of a collection of such stores is detected globally using all store sync, executed  by all processors. For global data rearrangement, in which all processors are cooperating to  move data, a set of stores by the processors are followed by an all store sync. In addition,  the recipient of store can determine if certain number of stores to it have completed using  store sync, which takes the expected number of stores and waits until they have completed.  This is useful for data driven execution with predictable communication patterns.

    • Bulk assignment: Transfers of complete objects are supported through the assignment operators and library routines. The library operations allow for bulk transfers, which reflect the view that, in managing a storage hierarchy, the unit of transfer should increase with the  access time. Moreover, bulk transfers enhance the utility of split-phase operations. A single  word get is essentially a binding prefetch. The ability to prefetch an entire object or block  often allows the prefetch operation to be moved out of the inner loop and increases the distance between the time where the get is issued and the time where the result is needed. The  assignment and split-assignment operators transfer arbitrary data types or structs, as with  the standard C assignment. However, C does not provide operators for copying entire arrays.  Bulk operations are provided to operate on arrays

    • Synchronizing assignment: Concurrent access to shared objects, as occurs in manipulating linked data structures, requires that the accesses be protected under a meaningful locking  strategy. Split-C libraries provide a variety of atomic access primitives, such as fetch-andadd, and a general facility for constructing locking versions of structs and manipulating them  under mutual exclusion, single writer multiple reader, or other strategies.

          in [ACM/IEEE] Proceedings of the 1993 ACM/IEEE conference on Supercomputing, December 1993, Portland, Oregon, United States view details
  • Bjoern Haake , Klaus E. Schauser , Chris Scheiman "Profiling a parallel language based on fine-grained communication" view details Abstract: Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present a profiler for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. As our experience shows, it is much more challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program. We quantify the profiling overhead and present several Split-C applications which show that the profiler is useful in determining performance bottlenecks.
    DOI
          in Proceedings of the 1996 ACM/IEEE conference on Supercomputing November 1996 view details
    Resources