LISP-STAT(ID:3276/)Stats system in lispStatistical system designed as an extensible dialect of Common LISP. Implemented as a dialect of XLISP Related languages
References: XLISP-STAT is a statistical environment built on top of the XLISP programming language. This document is intended to be a tutorial introduction to the basics of XLISP-STAT. It is written primarily for the Apple Macintosh version, but most of the material applies to other versions as well; some points where other versions differ are outlined in an appendix. The first three sections contain the information you will need to do elementary statistical calculations and plotting. The fourth section introduces some additional methods for generating and modifying data. The fifth section describes some additional features of the Macintosh user interface that may be helpful. The remaining sections deal with more advanced topics, such as interactive plots, regression models, and writing your own functions. All sections are organized around examples, and most contain some suggested exercises for the reader. This document is not intended to be a complete manual. However, documentation for many of the commands that are available is given in the appendix. Brief help messages for these and other commands are also available through the interactive help facility described in Section 5.1 below. XLISP itself is a high-level programming language developed by David Betz and made available for unrestricted, non-commercial use. It is a dialect of Lisp, most closely related to the Common Lisp dialect. XLISP also contains some extensions to Lisp to support object-oriented programming. These facilities have been modified in XLISP-STAT to implement the screen menus, plots and regression models. Several excellent books on Common Lisp are available. One example is Winston and Horn [22]. A book on XLISP itself has recently been published. Unfortunately it is based on XLISP 1.7, which differs significantly from XLISP 2.0, the basis of XLISP-STAT 2.0. XLISP-STAT was originally developed for the Apple Macintosh. It is now also available for UNIX systems using the X11 window system, for Sun workstations under the SunView window system, and, with only rudimentary graphics, for generic 4.[23]BSD UNIX systems. The Macintosh version of XLISP-STAT was developed and compiled using the Lightspeed C compiler from Think Technologies, Inc. The Macintosh user interface is based on Paul DuBois' TransSkel and TransEdit libraries. Some of the linear algebra and probability functions are based on code given in Press, Flannery, Teukolsky and Vetterling [14]. Regression computations are carried out using the sweep algorithm as described in Weisberg [21]. This tutorial has borrowed several ideas from Gary Oehlert's MacAnova user's Guide [13]. Many of the on-line help entries have been adopted directly or with minor modifications from the Kyoto Common Lisp System. Most of the examples used in this tutorial have been taken from Devore and Peck [11]. Many of the functions added to XLISP-STAT were motivated by similar functions in the S statistical environment [2,3]. The present version of XLISP-STAT, Version 2.0, seems to run fairly comfortably on a Mac II or Mac Plus with 2MB of memory, but is a bit cramped with only 1MB. It will not run in less than 1Mb of memory. The program will occasionally bomb with an ID=28 if it gets into a recursion that is too deep for the Macintosh stack to handle. On a 1MB Mac it may also bomb with an ID=15 if too much memory has been used for the segment loader to be able to bring in a required code segment. Development of XLISP-STAT was supported in part by grants of an Apple Macintosh Plus computer and hard disk and a Macintosh II computer from the MinneMac Project at the University of Minnesota, by a single quarter leave granted to the author by the University of Minnesota, by grant DMS-8705646 from the National Science Foundation, and by a research contract with Bell Communications Research. Extract: Why XLISP-STAT Exists Why XLISP-STAT Exists There are three primary reasons behind my decision to produce the XLISP-STAT environment. The first is to provide a vehicle for experimenting with dynamic graphics and for using dynamic graphics in instruction. Second, I wanted to be able to experiment with an environment supporting functional data, such as mean functions in nonlinear regression models and prior density and likelihood functions in Bayesian analyses. Finally, I was interested in exploring the use of object-oriented programming ideas for building and analyzing statistical models. I will discuss each of these points in a little more detail in the following paragraphs. The development of high resolution graphical computer displays has made it possible to consider the use of dynamic graphics for understanding higher-dimensional structure. One of the earliest examples is the real time rotation of a three dimensional point cloud on a screen -- an effort to use motion to recover a third dimension from a two dimensional display. Other techniques that have been developed include brushing a scatterplot -- highlighting points in one plot and seeing where the corresponding points fall in other plots. A considerable amount of research has been done in this area, see for example the discussion in Becker and Cleveland [4] and the papers reproduced in Cleveland and McGill[8]. However most of the software developed to date has been developed on specialized hardware, such as the TTY 5620 terminal or Lisp machines. As a result, very few statisticians have had an opportunity to experiment with dynamic graphics first hand, and still fewer have had access to an environment that would allow them to implement dynamic graphics ideas of their own. Several commercial packages for microcomputers now contain some form of dynamic graphics, but most do not allow users to customize their plots or develop functions for producing specialized plots, such as dynamic residual plots. XLISP-STAT provides at least a partial solution to these problems. It allows the user to modify a scatter plot with Lisp functions and provides means for modifying the way in which a plot responds to mouse actions. It is also possible to add functions written in C to the program. On the Macintosh this has to be done by adding to the source code. On some unix systems it is also possible to compile and dynamically load code written in C or FORTRAN. An integrated environment for statistical calculations and graphics is essential for developing an understanding of the uses of dynamic graphics in statistics and for developing new graphical techniques. Such an environment must essentially be a programming language. Its basic data types must include types that allow groups of numbers -- data sets -- to be manipulated as entire objects. But in model-based analyses numerical data are only part of the information being used. The remainder is the model itself. Sometimes a model is easily characterized by specifying a set of numbers. A normal linear regression model with errors might be described by the number of covariates, the coefficients and the error variance. On the other hand, in many cases it is easier to specify a model by specifying a function. To specify a normal nonlinear regression model, for example, one might specify the mean function. If our language is to allow us to specify this function within the language itself then the language must support a functional data type with full rights: It has to be possible to define functions that manipulate functions, return functions, apply functions to arguments, etc.. The choice I faced was to define a language from scratch or use an existing language. Because of the complexity of issues involved in functional programming I decided to use a dialect of a well understood functional language, Lisp. The syntax of Lisp is somewhat unfamiliar to most users of statistical packages, but it is easy to learn and several good tutorials are available in local book stores. I considered the possibility of using Lisp to write a top level interface with a more "natural'' syntax, but I did not see any way of doing this without complicating access to some of the more powerful features of Lisp or running into some of the pitfalls of functional programming. I therefore decided to retain the basic Lisp top level syntax. To make the manipulation of numerical data sets easier I have redefined the arithmetic operators and basic numerical functions to work on lists and arrays of data. Having decided to use Lisp as the basis for my environment XLISP was a natural choice for several reasons. It has been made available for unrestricted, non-commercial use by its author, David Betz. It is small (for a Lisp system), its source code is available in C, and it is easily extensible. Finally, it includes support for object-oriented programming. Object-oriented programming has received considerable attention in recent years and is particularly natural for use in describing and manipulating graphical objects. It may also be useful for the analysis of statistical data and models. A collection of data and assumptions may be represented as an object. The model object can then be examined and modified by sending it messages. Many different kinds of models will answer similar questions, thus fitting naturally into an inheritance structure. XLISP-STAT's implementation of linear and nonlinear regression models as objects, with nonlinear regression inheriting many of its methods from linear regression, is a first, primitive attempt to exploit this programming technique in statistical analysis Introduction Lisp-Stat is an extensible statistical computing environment for data analysis, statistical instruction and research, with an emphasis on providing a framework for exploring the use of dynamic graphical methods. Extensibility is achieved by basing Lisp-Stat on the Lisp language, in particular on a subset of Common Lisp. Lisp-Stat extends standard Lisp arithmetic operations to perform element-wise operations on lists and vectors, and adds a variety of basic statistical and linear algebra functions. A portable window system interface forms the basis of a dynamic graphics system that is designed to work identically in a number of different graphical user interface environments, such as the Macintosh operating system, the X window system, and Microsoft Windows. A prototype-based object-oriented programming system is used to implement the graphics system and to allow it to be customized and adapted. The object-oriented programming system is also used as the basis for statistical model representations, such as linear and nonlinear regression models and generalized linear models. Lisp-Stat was first release in 1989. It has been used for data analysis, as a research tool, and for implementing several larger projects (e. g. Cook and Weisberg, 1994; Young, 1993). Based on experience gained from this use, the system is currently being redesigned. The redesign is evolutionary, with backward compatibility a major objective. The redesign project can be divided into six major segments: the basic Lisp system, data representation and operating system issues, the object system, the graphical system, the statistical component, and the user interface. They will be attacked in this order. The redesign of the basic Lisp system is nearly complete, and some of the changes are outlined in the next section. The third section describes some of the issues involved in the later stages of the revision; this section is more speculative in nature. The final section briefly discusses the importance of extensibility in a statistical software environment. Extract: New features New features Lisp-Stat was originally designed as a specification to be implemented on various Lisp systems. The requirement on the Lisp system base is that it support an appropriate subset of the Common Lisp standard (Steele, 1990). The reason for using the Common Lisp specification as a base was that Common Lisp is a rich, high-level language with many features that are already provided and do not need to be designed and documented from scratch. Even though the XLISP language lacked some important Common Lisp features, it was useful as an initial implementation base for Lisp-Stat since it was small and freely available in source form. It was hoped that a transition to a full Common Lisp implementation could be made in the future. Unfortunately this hope has not been fulfilled. There are a number of reasons, including the continued high cost of commercial Common Lisp implementations, the uncertain future of free and of commercial implementations, and the lack of standardization in window system and foreign function interfaces. Instead, XLISP has been brought closer to a full Common Lisp implementation by adding many Common Lisp functions and some key missing features. The most important added features are multiple values, packages, typed vectors, and a byte code compiler. Other changes include a new garbage collector and new random number generators. Many of these changes have been folded into the standard XLISP distribution. Other features contributed to the standard distribution by Tom Almy and others have also been or will shortly be incorporated into the XLISP-STAT base. In particular, Tom Almy's unlimited precision integer arithmetic functions will be added in the near future. 2.1 New Common Lisp Features in XLISP 2.1.1 Multiple Values Multiple values are useful when a function needs to return one primary value and several secondary ones that may be but often are not of interest. Using multiple values avoids the need to make and take apart a list. The hash table lookup function gethash, for example, returns the item found as its first value or NIL if no item is found. A second value is t if an item was found, NIL otherwise. This makes it possible to distinguish an item with value NIL from an item not found. Several other high level languages support multiple values. One example is MATLAB. For example, the eig function in MATLAB returns only the eigen values when a single answer is requested; if two values are asked for, it returns the eigenvectors and eigenvalues. Functions that are only called for their side effects can return no values. 2.1.2 Packages If a language is to allow the development of substantial subsystems, then it is critical to provide some form of name space management to allow a system to export only its interface and to hide and protect implementation details. Common Lisp manages name spaces by organizing its symbols into collections called packages. Each package is divided into internal and external (or exported) symbols. A package can use other packages, thus making their symbols accessible within the package. Within a package, only symbols in the package and external symbols of packages used by the package can be referenced directly using their names. As a simple example, if a file contains the code (defpackage ''MY-PACKAGE'' (:use ''COMMON-LISP'') (:export ''MY-FUNCTION'')) (defun utility () ...) (defun my-function () ... (utility) ...) then all symbols in the ''COMMON-LISP'' package and all symbols like utility that are in ''MY-PACKAGE'' are accessible in ''MY-PACKAGE'', but only the exported symbol my-function will be available to other packages that use ''MY-PACKAGE''. Packages are not modules in the sense of ADA, MODULA-2 or MODULA-3, and they have many shortcomings: They do not allow separate exporting of variables and functions, only symbols; symbols cannot be imported under alternate names; there is no support for organizing separate compilation of system components. But they are a useful first step and can be used as the basis for more sophisticated module systems. Support for Common Lisp packages is now available in XLISP; a proper module system (e. g. Curtis and Rauen, 1990; Davis et al., 1994) may be added in the future. 2.1.3 Pathnames The Common Lisp pathname functions allow the portable specification of hierarchical directory structures. For example, the expression (make-pathname :directory '(:relative ''a'') :name ''b'') produces ''a/b'' in UNIX, ''a"b'' in MD DOS, and '':a:b'' on the Macintosh. Using these functions it is possible to describe directory structures of a system in a portable way. 2.1.4 Typed Vectors Vectors and arrays can be restricted to contain only elements of certain specified types. This allows more efficient storage of floating point data and also facilitates the interface to C code by allowing the address of the vector data to be passed directly to a C function. The linear algebra subsystem of Lisp-Stat is being re-implemented to take advantage of this ability. In particular, an interface to Level 1 BLAS and some Level 2 and Level 3 BLAS routines (Anderson et al., 1992) will be provided to allow destructive modification of floating point arrays. The details of the interface are still under development. Once they have been completed, they will allow users to implement efficient linear algebra routines at the Lisp level. 2.2 The Byte Code Compiler The byte code compiler translates a Lisp function definition into a string of bytes that form an instruction sequence for a virtual machine (VM), a fast interpreter for the byte code language. Interpreting byte code is not as fast as executing native machine code, but with a good design the interpreter overhead can be minimized. Byte codes themselves are usually machine-independent, thus making it possible to transfer byte compiled files from one machine to another. In addition, the VM can be implemented in C, thus eliminating hardware dependencies of a native code compiler. To illustrate what the compiler does, consider the function for adding up a list of numbers shown in Figure 1a. The dolist macro is expanded in Figure 1b to show the options for local transfer of control in the loop body (the inner tagbody) and for nonlocal exit (the enclosing block) that an interpreter has to consider. The compiler recognizes that neither of these is needed and is able to simplify the code down to the set of instructions shown in Figure 2. Unlike many other byte code VM's, the XLISP VM is not based on a stack model. Instead, the basic instructions are of a three-address-code nature (Aho et al., 1986, Chapter 8). Thus the instruction (add2 x y z) adds the values stored at offsets x and y and stores the result at offset z from the current frame base. This design seems to produce faster code than a stack-based design for benchmarks that should be representative of statistical applications. When the example function given here is applied to a list of 1000 integers, the byte compiled code is approximately ten times faster than the interpreted version. Functions in which most iteration is already done in the vectorized code will experience a much smaller improvement. The use of byte codes has a long history, including, for example, the pcode of the UCSD Pascal system. Recent versions of the Microsoft C compiler have re-introduced the use of byte code as an option to take advantage of the fact that byte code is often more compact than native machine code. Another recent use of byte code is in the Java language (Gosling, 1995), where the machine-independence of byte code is used to allow transferring compiled small applications, or applets, for local use by the HotJava World Wide Web browser. The XLISP byte code compiler is based on the design of the ORBIT Scheme compiler (Krantz et al., 1986), which uses conversion to continuation passing style to support a variety of code transformation optimizations (Friedman et al., 1992). The code produced is properly tail recursive; thus iterative computations expressed using recursion will be compiled to iterative code. Even though the XLISP byte code compiler can already speed up computations considerably, there is still room for improvement. Additional code analysis and support for type declarations will in some cases allow direct use of native machine data types for integers and floating point numbers instead of boxed representations. Optimization strategies designed to improve imperative code, such as static single assignment analysis, which can be related to continuation passing representation (Kelsey, 1995), may also help. It is also possible to replace byte code on a particular machine by threaded code, or to generate C code from the intermediate assembly code and use a local C compiler to produce native code. The compiler developed up to now is a standard Lisp compiler with only very minimal adaptations to statistical applications. Future work will explore the possibility of incorporating support for vectorized arithmetic and graphical operations at the compiler level in order to optimize performance in statistical applications. 2.3 New Garbage Collection System The original XLISP memory management system used a mark-and-sweep garbage collector. This collector has the advantage of requiring only two bits of storage per node to implement, but the disadvantage of scanning the entire heap on each collection. With a large heap this can result in pauses long enough to degrade interactive performance. To address this problem, the mark-and-sweep collector was replaced by a simple two-generation generational collector in the spirit of Appel (1989). Generational collectors are based on the assumption that most allocated objects are very short-lived. By distinguishing recently allocated objects from older ones, the collector can usually reclaim adequate space from minor collections in which only the newer nodes are examined. Only rarely is a full collection involving all nodes required. Since the number of active new nodes in the system at any given time is usually very small, the minor collections are very fast and hardly noticeable. Major collections take about as long as mark-and-sweep collections, but occur much less frequently. Generational collectors are usually implemented as copying collectors, but the resulting data motion would make designing functions that call back to XLISP from C or FORTRAN quite difficult. A treadmill-type in-place design (Baker, 1992; Wilson, 1992) was therefore used. The nominal space overhead for this approach is considerably larger than for mark-and-sweep: six bytes per node on 32-bit hardware. However on many workstations alignment requirements force enough free space into each node to accommodate this overhead, thus eliminating the space cost on these systems. A compromise that may be worth exploring is to have a first generation that is copied into a fixed second generation. This may provide the advantages of fast allocation achieved by copying collectors without some of the drawbacks that moving data has for call-backs (Doligez and Leroy, 1993). More work is needed to optimize tuning of the new memory management system to typical statistical activities. The use of adaptive tuning strategies may be explored as well. Support for weak pointers and finalization will also be added. 2.4 New Random Number Generators The Marsaglia lagged Fibonacci generator used in older versions of XLISP-STAT has been replaced as the default generator by L'Ecuyer's version of the Wichman-Hill generator (L'Ecuyer, 1986; Bratley et al., 1987, Algorithm UNIFL). The original generator is still available, mainly to allow results produced with this generator to be reproduced. Two additional generators are available as well, Marsaglia's Super-Duper generator as used in S, and a combined Tauseworthe generator of Tezuka and L'Ecuyer (1991). Random states now contain both generator and seed information. Having several very different generators available is useful for examining the possible sensitivity of simulation results to the generation mechanism. At present the set of available generators is fixed. In the future, a mechanism for adding new generators will be provided. Extract: Future Directions 3 Future Directions 3.1 Additional Data Representations Until recently data sets of floating point numbers could only be represented in Lisp-Stat as lists or as generic vectors. This requires storing each number in a separate node, and can be quite wasteful. With the addition of typed arrays, it is now possible to use more compact storage. Once typed arrays have been fully integrated, this should increase the size of data sets that can be handled conveniently on standard memory configurations to the level of hundreds of thousands of observations. For larger data sets in the range of several millions of observations, more effective representations will be needed. One possibility is to allow the contents of disk files to be treated as an array. Memory mapped file support may be useful on operating systems where it is available. Since large data sets might only be accessible over a network, remotely stored arrays should be supported as well. To reflect the fact that files may be read-only, it will be necessary to allow arrays to be made read-only as well. It will also be useful to be able to reference smaller subsets of larger arrays indirectly, to support shared sub-arrays. Once adequate support for basic handling of larger data sets is available, algorithms for sparse array manipulation will need to be added, and other algorithms will need to be re-examined to insure that they have adequate numerical properties even for large input arrays. To support adding new algorithms, the current minimal C and FORTRAN interface will need to be improved. Recent developments that have resulted in the inclusion of shared libraries in most operating systems will greatly facilitate this effort. 3.2 Communication and Parallel Processing The ability to communicate with other applications running locally or remotely is becoming increasingly important. Several new languages have been proposed recently with a structure designed to allow them to take advantage of features of the World Wide Web. Two examples are Java (Gosling and McGilton, 1995) and Obliq (Cardelli, 1995). Lisp-Stat has already been used as a teaching tool in conjunction with the World Wide Web (Rossini and Rosenberger, 1994). Its use with the Web can be enhanced by adding some of the ideas found in Java as well as some lower level communication mechanisms. Security issues that have played a major role in the design of Java will also need to be examined to insure that Lisp-Stat can be used safely with the Web. Adding basic interprocess communication mechanisms such as sockets and X properties for UNIX, Apple events for the Macintosh, and DDE and OLE for MS Windows, will allow Lisp-Stat to take advantage of other applications available in those environments. In addition, in a networked environment these mechanisms can form the basis of a parallel processing environment. The PVM system under UNIX (Geist et al., 1994) is designed around this approach. Either a similar system can be implemented, or an interface to PVM can be provided to allow Lisp-Stat to take advantage of the multiple workstation environments that are now quite common. Another form of parallelism worth exploring is the use of threads or light-weight processes with shared global memory. Allowing long-running computations to coexist with a graphical user interface is accomplished much more naturally with a threads mechanism than the form of manual implementation that is currently required. In addition, threads allow a system to take advantage of shared memory multiprocessors which are also becoming more common. The SR language (Andrews and Olsson, 1993) provides a useful framework for integrating both separate processes and threads. One component of Lisp-Stat that is inherently parallel, though the current implementation is serial, is the vectorized arithmetic system. Recent advances in the understanding of nested parallel vector languages (NESL Blelloch, 1994; Proteus Goldberg et al., 1994) may be useful in redesigning this system to be more expressive by making it easier to define vectorized functions at the user level, and more efficient by allowing parallel architecture to be exploited when it is available. One possibility is to re-implement the Lisp-Stat vectorized arithmetic system using the CVL library (Blelloch et al., 1994), which provides implementations for workstations, the Connection Machines CM2 and CM5, the Cray Y-MP and the MasPar MP2. 3.3 The Object System The Lisp-Stat object system is both unusual and conventional. It is unusual in being based on prototypes rather than classes, and it is conventional in using only single dispatching for handling methods. The use of prototypes instead of classes seems to have been successful, and a number of recent object-oriented languages with a similar emphasis on interactive use have taken this route as well. Many Lisp-based object systems, such as CLOS (Steele, 1990), the EuLisp object system (Padget et al., 1994), and Dylan (Apple Computer, 1994) use multiple dispatching. Other languages that use multiple dispatching are Cecil (Chambers, 1993) and S. Multiple dispatching has more expressive power than single dispatching, but also represents a more complex programming paradigm. Most work on object-oriented design (e. g. Rumbaugh et al., 1991) is based on the single dispatch model. Only recently have researchers begun to formulate a framework for understanding multiple dispatch (Chambers, 1992). If these efforts are successful, then it may be worth reconsidering the use of multiple dispatching. For now, single dispatching appears adequate and better understood. There are situations where it would be useful to develop specialized object-oriented subsystems to support a particular project. This might be to provide increased efficiency or increased expressive power. Such a system can be built from scratch, but would be easier to construct if it could leverage off of the existing system. The need for customized object systems has lead to the development of meta-object protocols (Kiczales et al., 1991; Padget et al., 1994). It may prove useful to design a meta-object protocol for Lisp-Stat as well and to implement the current protocol as a special case. An area of considerable current research and commercial interest is the development of standards for linking objects in separate applications and on remote systems. Some of the projects with this objective are OpenDoc, OLE, SOM, CORBA, and ILU. Most approaches seem to be working towards compliance with CORBA. ILU (Janssen et al., 1995), which provides a CORBA interface, or Fresco (Linton and Price, 1993), which is based on CORBA, may provide an effective means for integrating object linking into Lisp-Stat. Providing a standard linking mechanism will allow Lisp-Stat to more easily communicate with other programs, either using them as compute engines or serving them as a compute engine. It will also allow Lisp-Stat sessions on separate workstations to communicate with one another in a transparent fashion. The current Lisp-Stat object system does not provide a standardized broadcasting mechanism for efficiently distributing change notifications to interested objects. At present such broadcasts have to be implemented by hand (Tierney, 1993). An efficient, standard mechanism is needed to adequately support the Model-View-Controller paradigm that has become central to graphical user interface design. A mechanism similar to the one used in Smalltalk will need to be incorporated. 3.4 The Graphics System The current Lisp-Stat graphics system was designed as a compromise between flexibility, simplicity, and efficiency. The goal of redesigning the graphics system is to increase the flexibility of the system while maintaining or improving on simplicity and efficiency. For example, the original design identifies plots with their containing windows. This simplifies the user model for dealing with plots, but prevents placing multiple plots in the same window. Similarly, dialog items were considered part of special dialog windows, thus preventing the integration of standard dialog items with plot windows. The new design will support a hierarchical window structure in which each top level window contains a nested hierarchy of widgets. Each widget can be an elementary item such as a button or a slider, or another collection of widgets. Geometry managers will be provided to facilitate display-independent layout management. The design of the Tk toolkit (Ousterhout, 1994) may provide a useful model to explore. With increases in workstation speed experienced in recent years it may also be possible to represent plots as collections of widgets. This will again increase flexibility, but may need to be deferred if it is still too costly in performance on current hardware. In addition to supporting standard widgets and widgets defined in Lisp-Stat, the graphics system should also support externally defined widgets, such as OLE controls. The ability to embed widgets related to other processes running locally or remotely also needs to be explored. The Fresco toolkit (Linton and Price, 1993), which is based on the CORBA standard for distributed objects, may provide a useful model or a possible basis for this development. Within the statistical graphs themselves, it would be helpful to provide more programmability to layout features, such as the axes on a plot. It would also be useful to provide primitives for managing symbols or other glyphs that represent groups of points rather than just individual points. This would provide a useful superstructure for histograms as well as for binned scatterplots (Carr, 1991) Finally, it would be useful to allow closer adaptation to native GUI standards, but without sacrificing code portability. This is difficult to achieve, but is facilitated somewhat by the convergence of features in different GUI's, such as the Macintosh, MS Windows, and Motif, that has occurred over the last few years. 3.5 Models and Data The current statistical model system has proven quite effective for code re-use, but it does have some design features that are now generally considered to be unfortunate form the point of view of object-oriented design (Rumbaugh et al., 1991). In particular, it would be better to design a nonlinear regression model to have a linear regression model component for code re-use (a has-a relationship) and delegate appropriate messages to this component, instead of having nonlinear regression models inherit from the linear regression model (an is-a relationship). Using inheritance means that even inappropriate methods are inherited; using containment and delegation provides more reasonable control. To assist with this change, the object system should be modified to provide direct support to delegating messages received by one object to another object, usually a slot value of the original receiver. It will also be necessary to design a useful data set prototype that is capable of storing attribute information, such as whether the values of data are to be interpreted as numerical values or factor levels. The lack of such a system has forced several users to develop variants of their own. 3.6 Syntax and User Interface Issues Lisp syntax is often perceived as a bit of an impediment to the use of the language. There is considerable debate about the degree to which this impediment is real or perceived. The success of Lisp-Stat to date suggests it may be less of an issue that is sometimes claimed. Nevertheless, alternate syntaxes, at least for parts of the system, are worth exploring. For example, a simple infix parser may be useful for specifying mathematical formulas. It may also be useful to develop a simple vector subscripting language similar to the ones used in S or MATLAB. Developing a textual syntax that is more natural, in some sense, than Lisp's parenthesized prefix syntax while remaining as powerful is a difficult task. Even though discussions of syntax pros and cons often focus on a comparison of infix and prefix notation, probably a more significant aspect of Lisp syntax that can make it hard to follow at times is that there are no syntactic cues to help distinguish special forms, or syntactic keywords, from standard functions -- the programmer has to know which symbols refer to special forms. This is a weakness of Lisp syntax, but at the same time it is also a great strength: there are no syntactic impediments to the introduction of new special forms. This makes Lisp a programmable language that can be used to define new, problem-specific languages (Graham, 1994). Achieving this level of flexibility with an infix syntax is extremely difficult; this is reflected by the long delay introduced into the Dylan project by their decision to adopt an infix syntax (Apple Computer, 1994) A promising alternative to a textual syntax is a visual one. Research on visual languages has met with some successes (Cox et al., 1989; Khoros Rasure et al., 1990; Burnett et al., 1994) and has also seen some application in statistical computing (Oldford and Peters, 1988). Another interesting and related area of research is programming by example, or programming by demonstration (Cypher, 1993). It is still too early to tell whether there are visual paradigms that are sufficiently universal to be intuitive and easy to use, while at the same time retaining the expressive power of their textual counterparts. But even if these approaches cannot entirely replace a textual syntax, it may be possible to develop very useful and effective visual interfaces to significant portions of a statistical system. This could greatly enhance the ease of use for those portions, but it comes at a price: Unless all features are accessible using a visual interface, a barrier is established between those portions that are and those that are not. Learning to use simpler aspects of the system does not provide any assistance at reaching beyond this barrier. The result could be to discourage, rather than encourage, experimentation and development; this would be unfortunate. Extract: Discussion 4 Discussion The major objective of Lisp-Stat is to provide a flexible system that can easily be extended both in its numerical and its graphical capabilities. The current system represents a first step; the revisions currently in progress are designed to bring it closer towards this goal. Extensibility is critical for a system to be able to adapt to new statistical problems and ideas. Having an extensible system allows research to progress more rapidly, since new ideas are easier to test and refine. But it also gives a data analyst more flexibility to adapt methods to a problem instead of having to adapt problems to available methods. In short, having an extensible computing environment helps to reduce the gap between statistical research and practice, which is to the benefit of both. The Heidelberg Workshop where this paper was presented provided a nice opportunity to illustrate the advantages of extensibility. In an evening session Andreas Buja presented a new idea for interactively controlling a tour of four-dimensional space (Buja and ???, 1996). The idea was clearly excellent, but it was hard to appreciate fully without being able to try it out. It was promised that the idea would be incorporated in a future release of XGobi, but it was not clear when that might be available. Fortunately, by taking advantage of the extensible nature of Lisp-Stat, I was able to put together a simple implementation in an hour or two that evening, and could then begin to experiment with it the next day. The implementation was pedestrian to be sure, but adequate as a prototype. Having the prototype to experiment with helped to underscore the quality of the basic idea. |