Statistical analysis language at Bell Labsvery high-level language and an environment for data analysis and graphics. Written by Richard A. Becker, John M. Chambers, and Allan R. Wilks of AT&T Bell Lab Statistics Research Dept, after the manner of Tukey's EDA. More recently other Bell Lab researchers have made major contributions to a new modeling capability in S. The S language is the form in which S users express their computations. The environment provides facilities for data management, support for many graphics devices, etc. S is useful for computation in a wide range of applications. It's a very general tool, so that applications are not restricted to any particular subject areas. Places Related languages
References: Extract: Background Background S is a language and system for the interactive analysis of data, developed at AT&T Bell Laboratories, and currently in use on the UNIX operating system. An extensive user's guide, S: An Interactive Environment for Data Analysis and Graphics is available. As of April 1983, about 250 sites had obtained S and over 4,500 copies of the previous user's manual had been distributed. S is being used at universities, research laboratories, and other organizations. While sharing many characteristics with other statistical systems, S differs significantly in its design goals, its implementation, and the way it is used. The design goal for S is, most broadly stated, to enable and encourage good data analysis, that is, to provide users with specific facilities and a general environment that helps them quickly and conveniently look at many displays, summaries, and models for their data, and to follow the kind of iterative, exploratory path that most often leads to a thorough analysis. The system is designed for interactive use with simple but general expressions for the user to type, and immediate, informative feedback from the system including graphic output on any of a variety of graphical devices. In addition, the system is open to change: Even though the current system has many capabilities, a variety of ways are available to extend the system as new applications and techniques appear. The implementation of S draws on a number of modern computing principles and techniques. Table I summarizes some of these. Many, of course, are popular concepts, although few statistical systems apply them together consistently. Some, such as hierarchical data structures, seem to be unique to S among statistical systems. Vector structures and our approach to an interface language are also novel. Work on S began at Bell Laboratories in 1976; an initial implementation on a large Honeywell mainframe system was in use late that year. Starting in 1978, a version of S was developed for the UNIX operating system. Since 1981, this version has been distributed outside Bell Laboratories. S represents both an evolution from earlier statistical computing work at Bell Laboratories, particularly program libraries and graphics software (see [10]), and also our opinions about what was good and bad in the software used for data analysis at the time. (For a more complete description of how S is used in actual data analysis, see [7].) Extract: S and other systems S and other systems When the design of S began, a group of us at Bell Laboratories considered the then existing statistical software in terms of our goal of good data analysis, particularly in an interactive, exploratory environment. There were three main approaches to doing statistics on the computer: programming in a conventional language, usually FORTRAN (this had been our own previous approach); mainframe statistical packages such as BMD, SAS, and SPSS; and a few interactive languages, notably APL. We recognized the need for better use of human resources than having to write FORTRAN programs, but found problems with the existing alternatives. Statistical packages arose in the 1960s and were closely modeled on the idea of sequentially processing a series of records on punched cards or magnetic tape. Relatively recent user guides to BMDP [8] and SAS [16] still picture the user input as a card deck. This model has several bad influences. Good data analysis is highly iterative, responding to important facts observed in the analysis itself. Picturing analysis as processing a sequence of records through a limited set of statistical commands discourages this freewheeling interaction with the data. In particular, interactive use of the statistical packages was either not available or consisted largely of the ability to set up the card deck and run it from a terminal. S, on the other hand, was designed with the model of a language operating on complete data sets, interactively, in a nonsequential manner. A number of modern statistical techniques, e.g., robust estimation, cannot easily be expressed in the sequential form, and are therefore hard to incorporate in some of the packages. Another result of the batch approach was the tendency to "shotgun" output, printing all the summaries likely ever to be relevant from a particular model or process. Instead, S tries to provide a wide variety of displays, particularly graphical, that can be used interactively to see the summaries that are relevant to the particular user. Graphics, like interaction, was not part of the original design of the mainframe packages. Since 1976, many of them have added graphical facilities; however, the graphics tend to be viewed as "reports," rather than being integrated into the analysis. For example, most of the graphics add-ons do not include graphic input which in our opinion is essential for identifying important features observed in the plots. Extract: S and APL The APL language, while not designed for statistical computing, offered a very different, and in many ways, more attractive approach. It was intended for interactive use, with users typing expressions that operate on whole data sets and produce immediate output at the terminal. Users can extend the language by defining interpreted "functions" that can then be used in the same way as primitive APL operators. These are all features that contribute to APL's usefulness for data analysis, and which we have incorporated into S. The consistency and functionality of APL's operators is also present in S; however, in S, such operations are normally carried out by functions rather than operators. The main problems with APL are its syntax, its data structures, and its isolation from other languages. APL has only operators, i.e., functions with one or two arguments, and its precedence rules are different from those of ordinary algebra. For statistical applications, the latter is inconvenient for many users, and the former is a serious drawback. Statistical functions usually have a few main arguments (the data to work on) and any number of additional optional parameters or auxiliary data. They are generally awkward to express as unary or binary operators, as noted in Section 4. In S, we responded by allowing general function calls and by using common algebraic notation for expressions. The APL data structure is the multiway array, while the result of most statistical functions tends to be less regular. A regression, for example, needs to be described by coefficients, residuals, and summaries of the numerical and statistical methods applied. Fitting this into a single multiway array is unnatural. Allowing completely general, hierarchical data structures in S let the results be expressed naturally, while allowing any data structure to be the value returned by a function hid the structure from users who had no need to extract the pieces explicitly. The interface to user-written primitive functions discussed in Section 7 allows new functions to be defined when a purely interpretive form would be difficult to write or very inefficient. Both APL and the mainframe statistical packages made the process of interfacing to, say, a new FORTRAN-based algorithm either severely constrained, (e.g., only one user-defined extension) pr complicated (involving the implementation details of function interfaces). The substantial number of high-quality algorithms published by journals, such as Transactions on Mathematical Software and Applied Statistics, makes them an important source of extensions to statistical systems. Changes in packages and languages since the development of S have often reflected similar concerns to those we felt. Many packages have added graphics and interactive modes. A recent new version of APL moves toward more general data structures. A system built on APL, STATGRAPHICS [24] adds graphics and hides the syntax behind menu-driven interfaces. These are beneficial changes for the users of such systems; however, designing interaction, graphics, and generality in from the beginning makes for a cleaner result. Extract: S and non-influences In retrospect, it is clear that the evolution of S, in many respects, parallels a number of other contemporary computing activities. Our emphasis on user-extensible data structures and operations, and on removing details of data management and implementation from the user is similar to Smalltalk [18]. The approach in S to data structures, dynamic determination of their properties, and a blending of data and "program" (in macros) has some of the flavor of many LISP-based systems. Speakeasy [17] has some of the S flavor of building an interactive user interface to make mathematical and statistical computations user-friendly, although it is more restridive in terms of data structures and extensibility. S represents a growing approach to computing that emphasizes the effectiveness of the human as the most important design criterion, as shown by the emphasis on friendly interactive access to computing, on information hiding, and on greater flexibility through delayed binding. Our philosophy is that the effectiveness of the human is the most important criterion for design of a computer system. in [ACM] CACM 27(05) (May 1984) view details in [ACM] CACM 27(05) (May 1984) view details in [ACM] CACM 27(05) (May 1984) view details in Proc. Am. Stat. Assn. Sesquicentennial Invited Paper Sessions 1989 view details What were the basic ideas involved in S? Our primary goal was to bring interactive computing to bear on statistics and data analysis problems. S was designed as an interactive language based entirely on functions. A few of these functions would be implemented as prefix or infix operators (arithmetic, subscripting, creating sequences), but most would be written with a function name followed by a parenthesized argument list. Functions would be allowed an arbitrary number of positional and keyword arguments and would return a data structure that would allow a collection of named results. Arguments that were left out would have default values. The basic data structures would be vectors and a hierarchical structure, a combination of vectors and/or other such structures (Chambers, 1978). Entire collections of data would be referred to by a single name. The power of S and its functions would never be used, however, unless the system were extremely flexible and provided the operations that our colleagues wanted. In order to access the computations that were presently available as subroutines in the SCS library (and those that would later be written) we planned to import subroutines written in a lower-level programming language (Fortran) and we planned to allow ordinary users to do the same thing. Of course, to import an algorithm would require some sort of interface, to translate between the internal S data structures and those of the language, as well as to regularize the calling sequence. Our model for this was a diagram that represented an algorithm as a circle, with a square interface routine wrapped around it, allowing it to fit into the square slot provided for S functions. Another important notion that came early was that the language should be defined generally, based on a formal grammar, and have as few restrictions as possible. (We were all tired of Fortran?s seemingly capricious restrictions, on the form of subscripts, etc.) English-like syntax had been tried in many other contexts and had generally failed to resemble a natural language in the end, so we were happy to have a simple, regular language. Other applications languages at the time seemed too complicated ? although a formal grammar might describe the basic expression syntax, there were generally many other parts. We wanted the expression syntax to cover everything. General functions and data structures were the key to this simplicity. Functions that could take arbitrary numbers of arguments as input, and produce an arbitrarily complex data structure as output should be all the language would need. Infix operators for arithmetic and subscripting provided very natural expressions, but were simply syntactic sugar for underlying arithmetic and subscripting functions. From the beginning, one of the most powerful operations in S was the subscripting operator. It comes in several flavors. First, a vector of numeric subscripts selects corresponding elements of an object. Even here there is a twist, because the vector of subscripts can contain repeats, making the result longer than the original. Negative subscripts may be an idea that originated with S: they describe which elements should not be selected. Logical subscripts select elements corresponding to TRUE values, and empty subscripts select everything. All of these subscripting operations generalize to multi-way arrays. In addition, any data structure can be subscripted as if it were a vector.? S began with several notions about data. The most common side effect in S is accomplished by the assignment function; it gives a name to an S object and causes it to be stored. This storage is persistent ? it lasts across S sessions. The basic data structure in S is a vector of like-elements: numbers, character strings, or logical values. Although the notion of an attribute for an S object wasn?t clearly implemented until the 1988 release, from the beginning S recognized that the primary vector of data was often accompanied by other values that described special properties of the data. For example, a matrix is just a vector of data along with an auxiliary vector named Dim that tells the dimensionality (number of rows and columns). Similarly, a time series has a Tsp attribute to tell the start time, end time, and number of observations per cycle. These vectors with attributes are known as vector structures, and this distinguishes S from most other systems. For example, in APL, everything is a multi-way array, while in Troll, everything is a time series. The LISP notion of a property list is probably the closest analogy to S attributes. The general treatment of vectors with other attributes in S makes data generalizations much easier and has naturally extended to an object-oriented implementation in recent releases (but that?s for later). Vector structures in S were treated specially in many computations. Most notably, subscripting was handled specially for arrays, with multiple subscripts allowed, one for each dimension. Similarly, time series were time aligned (influenced by Troll) for arithmetic operations. Another data notion in the first implementation of S was the idea of a hierarchical structure, which contained a collection of other S objects.? Early S structures were made up of named components; they were treated specially, with the ??$?? operator selecting components from structures and functions for creating and modifying structures. (There was even a nascent understanding of the need to iterate over all elements of a structure, accomplished by allowing a number instead of a name when using the ??$?? operator.) By the 1988 release we recognized these structures as yet another form of vector, of mode list, with an attribute vector of names. This was a very powerful notion, and it integrated the treatment of hierarchical structures with that of ordinary vectors. in Proc. Am. Stat. Assn. Sesquicentennial Invited Paper Sessions 1989 view details in Proc. Am. Stat. Assn. Sesquicentennial Invited Paper Sessions 1989 view details Resources
|