Information Algebra(ID:164/inf007)formalism for data processingLanguage Structure Group of CODASYL Theoretical formalism for data processing Rival to L-hat from share Related languages
References: The Language Structure Group functions within the CODASYL activity as a cooperative effort on the part of individuals interested in the development of problem-oriented, machine-independent programming languages. It has existed since August 1959, meeting bimonthly as a committee. Its goal is to explore the functional requirements of data processing tasks and, if possible, to develop a mode of expression for translating them into an input language for a class of data processing machines. The first results of the Group's efforts are expressed in draft form in an unpublished document entitled "An Information Algebra". The Information Algebra provides both a notation for identifying and characterizing the data which undergo processing and a technique for expressing the relationships which exist among the data. Thus, the algebra provides succinct and rigorous terms for the problem analysis which presently finds expression only through extensive verbalization. The future work of the Language Structure Group will follow several parallel paths. The Algebra requires certain extensions in order to cope with some of the less orthodox data processing problems, and work will continue in that area. Problems of implementation will have to be considered also. It is necessary to prepare a narrative description of the algebra because the present notation is terse and may be difficult to grasp. Simultaneously, a search for equivalent mathematical structures will be made so that the specialized notation of the algebra may be replaced by a better-known notation and a more highly-developed structure. Attention will be given to the definition of macroinstructions, to provide a set of verbs for the ultimate language. The subject of describing data to the system will have to be gone into as well. Finally, the system must be tested on a host of problems to discover the difficulties and shortcomings which may arise. in [ACM] CACM 4(03) (March 1961) view details in [ACM] CACM 5(04) April 1962 view details COBOL was developed by the COBOL Committee. which is a subcommittee of a layer committee called the "Conference on Data System Languages" which has an Executive Committee and, in addition to the COBOL Committee, a Development Committee. Within the Development Committee are two working groups called the Language Structure Group (whose members are Roy Goldfinger of IBM, Robert Bosak of SDC, Carey Dobbs of Spcrry Rand, Renee Jasper of Navy Management Office, William Keating of National Cash Register, George Kendrick of General Electric, and Jack Porter of Mitre Corporation). and the Systems Group. Both of these groups are interested in going beyond COBOL to further simplify the work of problem preparation. The Systems Group has concentrated on an approach which represents a generalization of TABSOL in GE's GECOM. The driving force behind this effort is Burt Grad. The Language Structure Group has taken a different approach. It notes that FACT, COBOL, and other current data-processing compilers are "procedure-oriented," that is, they require a solution to the dataprocessing problem to be worked out in detail, and they are geared to the magnetic-tape computer and assume that information is available in files on magnetic tape. sorted according to some keys. The Language Structure Group felt that an information algebra could be defined which would permit the definition of a data-processing problem without postulating a solution. It felt that if this point were reached. then certain classes of problems could be handled by compilers that would, in effect, invent the necessary procedures for the problem solution. One sees a trend in this direction in FLOW-MATIC, COBOL, Commercial Translator, and FACT, if one notes that a sort can be specified in terms of the ipnput file and the output file, with no discussion of the technique required to create strings of ordered items and merge these strings. Similarly, a report writer specifies where the information is to come from and how it is to be arranged on the printed page, and the compiler generates a program which accomplishes the purpose without the programmer having to supply the details. The FACT update verb specifies a couple of input files and criteria for matching, and FACT generates a program which does the housekeeping of reading records from both files, matching them and going to appropriate routines depending on where there is a match, an extra item, a missing item, etc. A careful study of existing compilers will reveal many areas where the compiler supplies a substantial program as a result of having been given some information about data and interrelationships between the data. The Language Structure Group has, by no means, provided a technique of solving problems once they are stated in abstract form. Indeed, it is clear that this problem is not soluble in this much generality. All the Language Structure Group claims is to provide an information algebra which may serve as a stimulus to the development of compilers with somewhat larger abilities to invent solutions in restricted cases. The work of the Language Structure Group will be described in detail in a paper which will appcar shortly in the Coinnnmications offhe ACM. Extract: Basic concepts Basic concepts The algebra is built on three undefined concepts: ENTITY, PROPERTY and VALUE. These concepts are related by the following two rules which are used as the basis of two specific postulates: (a) Each PROPERTY has a set of VALUES associated with it. (b) There is one and only one VALUE associated with each PROPERTY of each ENTITY. The data of data processing are collections of VALUES of certain selected PROPERTIES of certain selected ENTITIES. For example, in a payroll application. one class of ENTITIES are the employees. Some of the PROPERTIES which may be selected for inclusion are EMPLOYEE NUMBER, NAME, SEX, PAY RATE, and a payroll file which contain a set of VALUES of these PROPERTIES for each ENTITY. Extract: Property spaces Property spaces Following the practice of modern algebra, the Information Algebra deals with sets of points in a space. The set of VALUES is assigned to a PROPERTY that is called the PROPERTY VALUE SET. Each PROPERTY VALUE SET contains at least two VALUES U (undefined, not releUant) and II (missing, relevant, but not known). For example, the draft status of a female employee would be undefined and not relevant, whereas if its value were missing for a male employee it would be relevant but not known. Several definitions are introduced : (1) A co-ordinate set (Q) is a finite set of distinct properties. (2) The null co-ordinate set contains no properties. (3) Two co-ordinate sets are equivalent if they contain exactly the same properties. (4) The property space (P) of a co-ordinate set (Q) is the cartesian product P = V, X V, x Vj X . . . X V,, where V, is the property value set assigned to the ith property of Q. Each point (p) of the property space will be represented by an n-tuple p = (a,, a,, a,, . . ., a,), where a, is some value from C',. The ordering of properties in the set is for convenience only. If n -= 1, then P = V,. If n = 0, then P is the Null Space. (5) The datum point (d) of an entity (e) in a property space (P) is a point of P such that if (1 -= (a,, al. a,. . . ., a,,), then a, is the value assigned to e from the ith property value set of P for i -I 1. 2, 3, . . ., n. ((l) is the representation of (e) in (P). Thus, only a small subset of the points in a property space are datum points. Every entity has exactly one datum point in a given property space. Extract: Lines and functions of lines Lines and functions of lines A line (L) is an ordered set of points chosen from P. The span (n) of the line is the number of points comprising the line. A line L of span n is written as: L = (p,, p2. . . ., p,,) The term line is introduced to provide a generic term for a set of points which are related. In a payroll application the datum points might be individual records for each person for five working days of the week. These live records plotted in the property space would represent a line of span five. A function of lines (FOL) is a mapping that assigns one and only one value to each line in P. The set of distinct values assigned by an FOL is the value set of the FOL. It is convenient to write an FOL in the functional form , f ' ( X ) , where f is the FOL and X is the line. In the example of the five points representing work records for five days work, a FOL for such a line might be the weekly gross pay of an individual, which would be the hours worked times payrate summed over the five days of the week. An Ordinal FOL (OFOL) is an FOL whose value set is contained within some defining set for which there exists a relational operator R which is irreflexive, asymetric, and transitive. Such an operator is "less than" on the set of real numbers. Extract: Areas and functions of areas Areas and functions of areas An area is any subset of the property space P; thus the representation of a file in the property space is an area. A function of an area (FOA) is a mapping that assigns one and only one value to each area. The set of distinct values assigned by an FOA is defined to be the value set of the FOA. It is convenient to write an FOA in the functional form f'(X), where f is the FOA and X is the area. Extract: Bundles and functions of bundles Bundles and functions of bundles In data-processing applications, related data from several sources must be grouped. Various types of functions can be applied to these data to define new information. For example, files are merged to create a new file. An area set /A of order n is an ordered n-tuple of areas (A,, A,. . . ., A, = /A). Area set is a generic term for a collection of areas in which the order is significant. It consists of one or more areas which are considered simultaneously; for example. a transaction file and master file from an area set of order 2. Definition: The Bundle B -: B(b, /A) of an area set /A for a selection OFOL h is the set of all lines L such that if (a) /A = (A,, .Az. . . ., A,,) and (0) L = ( p , . p*, . . .,p,,). where p, is a point of A, for i = 1,2 , . . . , n, then b( L) = True. A bundle thus consists of a set of lines each of order n where 11 is the order of the area set. Each line in the bundle contains one. and only one. point from each area. The concept of bundles gives us a method of conceptually linking points from different areas so that they may be considered jointly. As an example, consider two areas whose names are "Master File" and "Transaction File," each containing points with the property Part Number. The bundling function is the Part Number. The lines of the bundle are tlie pairs of points representing one Master File record and one Transaction File record with the same partner. A function of a bundle (FOB) is a mapping that assigns an area to a bundle such that (a) there is a many-to-one correspondence between the lines in the bundle and the points in the area; (b) the value of each properly of each point in the area is defined by an FOL of the corresponding line of the bundle; (c) the value set of such an FOL must be a subset of the corresponding property value set. The function of a bundle assigns a point to each line of the bundle; thus a new Master File record must be assigned to an old master file record and a matching transaction file record. Extract: Glumps and functions of glumps Glumps and functions of glumps If A is an area and g is an FOL defined over lines of span 1 in A, then a Glump G - G(g, A) of an area A for an FOL g is a partition of A by g such that an ele- ment of this partition consists of all points of A that have identical values f' or g. The function g will be called the Glumping Function for G. The concept of a glump does not imply an ordering either of points within an element, or of elements within a glump. As an example, let a back order consist of points with the three non-null properties: Part Number, Quantity Ordered, and Date. Define a glumping function to he Part Number. Each element of this glump consists of all orders for the same part. Different glump elements may contain different numbers of points. A function of a glump (FOG) is a mapping that assigns an area to a glump such that (a) there is a many-to-one correspondence between the elements of the glump and the points of the area ; (b) the value of each property of a point in the assigned area is defined by an FOA of the corresponding element in the glump: (c) the value set of the FOA must be a subset of the corresponding property value set. in The Computer Journal 5(3) October 1962 view details In the intervening period several attacks on the problem of rigorous definition of data processing and specification of an algebra have been undertaken. One of the more advanced and complete attempts has been made by Lionello Lombardi in several papers published during the past two years. [ See particularly Lombardi, Lionello. "Mathematical Structure of Nonarithmetic Data Processing Procedures." J. AC.1! 9, 1 (Jan. 1962), 136-159. ] Lombardi's constructions are unsatisfactory, partially because he appeared to lack a fundamental appreciation for the data processing problem in large and complex systems, and also because his basic concept of a universe of information was not sufficiently well worked out. The Lombardi papers are procedure-oriented when the need was for description and definition of the geometry of data processing, as it were. It is this need that "An Information Algebra" fulfills. Beginning with the concept of data as points in a property space, the algebra proceeds to define the various dimensions that data and strings of data may assume and to define by use of functions the operations that may be performed upon these data. There is not space here to review completely the development of the Algebra. There is space, however, to urge that every serious student of data processing study the article carefully, with a view to actual use of this tool both to test it and to help further its development. It is demanding reading, particularly for the nonmathematician, but the difficulties may not be laid to the authors. They deserve hearty congratulations, in fact, for a determined and largely successful effort to combine rigor and clarity. Any problems involved in understanding must be charged directly to the subject matter and to the number of new concepts advanced. in ACM Computing Reviews 3(05) September-October 1962 view details A file is represented as a set of points in property space and is called an area. The creation of output files from input is equivalent to transformation of one or more given areas (i.e. the input files) into one or more new areas (the output files) ... this transformation does not in any way depend on the sequence of operations and it is sometimes helpful, in fact, to regard the transformation as occurring simultaneously. in [AFIPS JCC 24] Proceedings of the 1963 Fall Joint Computer Conference FJCC 1963, Las Vegas, Nev., Nov. 1963, view details in ACM Computing Reviews 4(03) May-June 1963 view details A section on conclusions points out the main result of this experiment; viz., that a complex information manipulation problem, not in business data processing, could be described in IA -- a form both nonprocedural, and machine-independent. This section also contains brief but instructive comments on the extent to which a description can practically be nonprocedural, and on the process of translation from IA to machine code. The only additional point that might be made is that the author may somewhat overstate the conclusion that can properly be drawn from his work when he says that is suggests that IA "might be suitable, with some minor extensions, as a software programming language& quot;. If by "suitable" he means "usable", he is probably correct; if he means "desirable", more evidence on other types of software and more work on other languages are needed. However the paper represents an important contribution to the all too scanty literature on abstract problem description methods. in ACM Computing Reviews 5(05) September-October 1964 view details in ACM Computing Reviews 5(05) September-October 1964 view details Problem Statement Language The Problem Statement Language is a language people use to write their requirements that the target Information Processing System is to satisfy. The Problem Statement Language is the crucial part of ISDOS since it forms the interface between the organization and ISDOS. It must, on the one hand be suitable for use by the Problem Definer, and on the other hand provide information in sufficient detail for the succeeding parts of ISDOS. General purpose programming languages such as COBOL, FORTRAN and PL/I, are not satisfactory for use as requirement statement languages though they are similar in that they must both have sufficient structure to be analyzable by computer programs. The requirement statement language being developed in the ISDOS project is known as PSL (Problem Statement Language). PSL can be considered a generalization of languages such as those of Young and Kent; Information Algebra (CODASYL Committee); SYSTEMATICS (Grindley); TAG (IBM); and ADS (National Cash Register). A problem statement in PSL consists of a number of statements grouped into units called problem statement units (PSU). Each PSU describes an input or an output and consists of four major parts: ? An identification section which contains statements relating the input and output to the world external to the IPS. ? A Time-and-Volume section which describes the timing of the input or output and parameters which determine the volume. ? A Data Structure definition section which describes the structure of the data. ? A computational definition section which states the formulas defining the variables which must be calculated. Another part of PSL is used to describe requirements for the system as a whole: systemwide parameters, constraints, capabilities of hardware and software available and the performance criterion. Extract: Problem Statement Analyzer Problem Statement Analyzer The statements made in PSL must be analyzed to make sure that the requirements are consistent and correctly stated. The Problem Statement Analyzer (PSA) performs this function in ISDOS. It is a software package that accepts problem statements in PSL, and maintains files necessary for its use, carries out analyses and produces output. The analysis is carried as far as is possible without bringing in hardware considerations. The process of statement analysis is monitored by a System Definer and a Data Administrator who in turn communicate with the Problem Definers and their managers. When the total problem statement has been analyzed, the output of PSA is made available to the Systems Optimization Design Algorithm (SODA) and the Code Generator (CG). Some requirements for information of the type produced by PSA have been discussed by King and in the TAG manual, IBM. The Problem Statement Analyzer is the tool which amplifies the capability of an individual to understand and comprehend the requirements that have been stated. in ACM Computing Reviews 5(05) September-October 1964 view details in [ACM] Proceedings of the 1972 Annual Conference of the ACM view details The exact number of all the programming languages still in use, and those which are no longer used, is unknown. Zemanek calls the abundance of programming languages and their many dialects a "language Babel". When a new programming language is developed, only its name is known at first and it takes a while before publications about it appear. For some languages, the only relevant literature stays inside the individual companies; some are reported on in papers and magazines; and only a few, such as ALGOL, BASIC, COBOL, FORTRAN, and PL/1, become known to a wider public through various text- and handbooks. The situation surrounding the application of these languages in many computer centers is a similar one. There are differing opinions on the concept "programming languages". What is called a programming language by some may be termed a program, a processor, or a generator by others. Since there are no sharp borderlines in the field of programming languages, works were considered here which deal with machine languages, assemblers, autocoders, syntax and compilers, processors and generators, as well as with general higher programming languages. The bibliography contains some 2,700 titles of books, magazines and essays for around 300 programming languages. However, as shown by the "Overview of Existing Programming Languages", there are more than 300 such languages. The "Overview" lists a total of 676 programming languages, but this is certainly incomplete. One author ' has already announced the "next 700 programming languages"; it is to be hoped the many users may be spared such a great variety for reasons of compatibility. The graphic representations (illustrations 1 & 2) show the development and proportion of the most widely-used programming languages, as measured by the number of publications listed here and by the number of computer manufacturers and software firms who have implemented the language in question. The illustrations show FORTRAN to be in the lead at the present time. PL/1 is advancing rapidly, although PL/1 compilers are not yet seen very often outside of IBM. Some experts believe PL/1 will replace even the widely-used languages such as FORTRAN, COBOL, and ALGOL.4) If this does occur, it will surely take some time - as shown by the chronological diagram (illustration 2) . It would be desirable from the user's point of view to reduce this language confusion down to the most advantageous languages. Those languages still maintained should incorporate the special facets and advantages of the otherwise superfluous languages. Obviously such demands are not in the interests of computer production firms, especially when one considers that a FORTRAN program can be executed on nearly all third-generation computers. The titles in this bibliography are organized alphabetically according to programming language, and within a language chronologically and again alphabetically within a given year. Preceding the first programming language in the alphabet, literature is listed on several languages, as are general papers on programming languages and on the theory of formal languages (AAA). As far as possible, the most of titles are based on autopsy. However, the bibliographical description of sone titles will not satisfy bibliography-documentation demands, since they are based on inaccurate information in various sources. Translation titles whose original titles could not be found through bibliographical research were not included. ' In view of the fact that nany libraries do not have the quoted papers, all magazine essays should have been listed with the volume, the year, issue number and the complete number of pages (e.g. pp. 721-783), so that interlibrary loans could take place with fast reader service. Unfortunately, these data were not always found. It is hoped that this bibliography will help the electronic data processing expert, and those who wish to select the appropriate programming language from the many available, to find a way through the language Babel. We wish to offer special thanks to Mr. Klaus G. Saur and the staff of Verlag Dokumentation for their publishing work. Graz / Austria, May, 1973 in [ACM] Proceedings of the 1972 Annual Conference of the ACM view details in Proceedings of the ACM SIGPLAN symposium on Very high level languages, March 28-29, 1974, Santa Monica, California, United States view details |