Constrained natural language for querying attribute value pairs 

BASEBALL Green et al 1961

Natural language system, used for query databanks

Written in IPL, used IPL lists as storage structures, ran on IBM 7090

People: Hardware:
Related languages
IPL-V => BASEBALL   Influence
IPL-V => BASEBALL   Written using
BASEBALL => CODIL   Influence

  • Green, B.F., Jr., and Selfridge, O. G. Current Res. Dev. Scient. Doc. (NSF), No. 8 (May, 196l), pp153-154. view details
  • Green, Bert F. Jr., Wolf, Alice K., Chomsky, Carol, Laughery, Kenneth "Baseball: an automatic question answerer" pp219-224 view details
          in [JCC 19] Proceedings of the Western Joint Computer Conference, May 1961 view details
  • Grems, Mandalay "A survey of languages and systems for information retrieval" pp43-46 view details Extract:
    Baseball is a eomputer program that answers questions phrased it, in ordinary English about stored data. The program's present context is baseball games; it, answers such questions as "Did the Tigers play the Red Sex in July?" and "Where did each team play in each month?". Afier reading the question from punched cards, the program looks up the words and idioms in a stored dictionary of approximately 300 entries. The phrase structure and other syntactic facts are determined for a content analysis which sets up a list, of attribute value pairs that specify the information given and the information requested. The retrieval program extracts the information requested from the data matching the specifications. Finally, the necessary processing, e.g. counting, is done on the extracted data, and the answer is printed. No attempt is yet being made to provide a program
    for the computer to respond in a grammatical English sentence.
          in [ACM] CACM 5(01) January 1962 "Design, Implementation and Application of IR-Oriented Languages," ACM Computer Language Committee on Information Retrieval on 20-21 October 1961 in Princeton, N. J. view details
  • Green, Bert F. Jr., Wolf, Alice K., Chomsky, Carol, Laughery, Kenneth "Baseball: an automatic question answerer" view details
          in Feigenbaum, E. and Feldman, J. (eds.) "Computers and Thought" MIT Press, Cambridge, MA, 1963 view details
  • Bobrow, D.G. "Natural Language Input for a Computer Problem Solving System", Report MAC-TR-1, Project MAC, M.I.T., Cambridge, Mass., June 1964 view details External link: Online copy pdf ps Abstract: The STUDENT problem solving system, programmed in LISP, accepts as input a comfortable but restricted subset of English which can express a wide variety of algebra story problems. STUDENT finds the solution to a large class of these problems. STUDENT can utilize a store of global information not specific to any one problem, and may make assumptions about the interpretation of ambiguities in the wording of the problem being solved. If it uses such information, or makes any assumptions, STUDENT communicates this fact to the user. The thesis includes a summary of other English language question-answering systems. All these systems, and STUDENT are evaluated according to four standard criteria. The linguistic analysis in STUDENT is a first approximation to the analytic portion of a semantic theory of discourse outlined in the thesis. STUDENT finds the set of kernel sentences which are the base of the input discourse, and transforms this sequence of kernel sentences into a set of simultaneous equations which form the semantic base of the Student system. STUDENT then tries to solve this set of equations for the values of requested unknowns. If it is successful it gives the answers in English. If not, STUDENT asks the user for more information, and indicates the nature of the desired information. The STUDENT system is a first step toward natural language communication with computers. Further work on the semantic theory proposed should result in much more sophisticated systems. Extract: Introduction
    The aim of the research reported here was to discover how one could build a computer program which could communicate with people in a natural language within some restricted problem domain. In the coarse of this investigation, I wrote a set of computer pro-grams, the STUDENT system, which accepts as input a comfortable but restricted subset of English which can be used to express, a wide variety of algebra story problems. The problems shown in Figure 1 illustrate some of the communication and problem solving capabilities of this system.
    In the following discussion, I shall use phrases such as "the computer understands English".  In all such cases, the "English" is just the restricted subset of  English which is allowable as input for the computer program under discussion. In addition, for purposes of this report I have adopted the following operational definition of understanding.  A computer "understands" a subset of English if it accepts input sentences which are members of this subset, and answers questions based on information centered in the input. The STUDENT system understands English in this sense. Extract: BASEBALL
    Baseball is a question-answering  system designed and programmed at Lincoln Laboratories by  Green, Wolf, Chomsky and Laughery (19).    It  is a data base system  in which the data  is placed in memory in a prestructered tree format.    The data consists of the dates,   location, opposing teams and scores of some American League baseball games.    Only questions to the system can be given in English, not the data.
    Questions mast be simple sentences, with no relative clauses,  logical or coordinate connectives. With these restrictions, the program will accept any question couched in words contained in a vocabulary list quite adequate for asking questions about baseball statistics.    In addition,  the parsing routine,  based on techniques developed by Harris (21) , must find a parsing for the question.
    The questions must pertain to statistics about baseball games  found in the information store.    One cannot    ask questions about extrema,  such as "Highest" score or "fewest" number of games won.    The parsed question is transformed into a standard specification (or spec)  list and the question-answering routine utilizes this canonical form for the meaning of the question. For example, the question  "Who beat the Yankees on July 4th?" would be transformed into the "spec list":
    Team (Losing)= New York
    Team (winning) = ?
    Date      = July
    Because Baseball does not utilize English for data input, we cannot talk about deductions made from information implicit in several sentences.    However, Baseball can perform operations such as counting (the number of games played by Boston, for example)  and thus  in the sense that it is utilizing  several separate data units in its store,   it is performing deductions.
    Baseball's abilities can only be extended by extensive re-programming,  though the techniques utilized have some general applicability.    Because the parsing program has a very complete grammar, and the vocabulary list  is quite comprehensive for the problem domain, the user needs no knowledge of the internal structure of the Baseball program.    No provision for interaction with the user was made.
    Extract: Advice-taker
    McCarthy's Advice-taker, though not designed to accept English input, would make an excellent base for a question-answering system. Fischer Black has programmed a system which can do all of McCarthy's Advice-Taker problems, and can be adapted to accept a very limited subset of English. The deductive system in Black's program is equtvalent to the propositional calculus.
          in Feigenbaum, E. and Feldman, J. (eds.) "Computers and Thought" MIT Press, Cambridge, MA, 1963 view details
  • Simmons, R. F. "Answering English questions by computer: a survey" p53-70 view details Abstract: Fifteen experimental English language question-answering systems which are programmed and operating are described and reviewed. The systems range from a conversation machines to programs which make sentences about pictures and systems which translate from English into logical calculi. Systems are classified as list-structured data-based, graphic data-based, text-based and inferential. Principles and methods of operations are detailed and discussed.

    It is concluded that the data-base question-answerer has passed from initial research into the early developmental phase. The most difficult and important research questions for the advancement of general-purpose language processors are seen to be concerned with measuring meaning, dealing with ambiguities, translating into formal languages and searching large tree structures. DOI Extract: BASEBALL
    This is a program originally conceived by Frick, Selfridge arid Dineen and constructed by Green, Wolf, Chomsky and Laughery (1963). It answers English questions about the scores, teams, locations and dates of baseball games. The input questions are restricted to single clauses without logical connectives such as "and," "or" or "but" and excluding such relation words as "most" or "highest." Within the limitations of its data and its syntactic capability, Baseball is the most sophisticated and successful of the first generation of experiments with question-answering machines. It is of particular interest for the depth and detail of its analysis of questions.
    Baseball is programmed in IPL and uses list structures to organize data. The data are set up with a major heading of months. For each month there is a list of places in which games were played. For each place there is a list of days, for each day a list of games, and for each game a list of teams and score values, exemplified by the following data format:

    Month = July
    Place1 = Boston
    Day1 = 7
    Game Serial# = 96
    Team = Red Sox, Score = 5
    Team = Yankees, Score = 3

    The program also contains a dictionary which includes the part of speech of a word, its meaning, an indication of whether it belongs to an idiom, and a code to show if it is a question word. The first part of the program's task is to use the dictionary, parsing routines and content routines to translate from the English language question into a specification (or spee) list which is similar in format to the data structure.
    The first step is to substitute dictionary codes for the English words. A parsing using a modification of Zellig Harris's approach (1962) results first in bracketing the phrases of the question, then in determination of subject, object and verb. For example, the question "how many games did the Yankees play in July?" gives the following bracketing:
    (How many games) did (the Yankees) play (in (July))?
    The brackets distinguish noun phrases and preposition phrases and locate the data which are needed for the spec list. The parsing phase resolves some ambiguities of the
    noun-verb type while others such as "Boston = place" or "Boston = team" are resolved later. Somc, of course, are not resolvable.
    A semantic analysis phase actually builds the spec list from the parsed question. In this phase the dictionary meanings of the words are used. The meaning may be an
    attribute which is part of the data structure, as in "team" means "team = (blank)," or "who" means "team"= ?";
    or the meaning may be a call to a subroutine, as for example "winning" means "routine Al" which attaches the additional condition "winning" to "team" on the spec list.
    The output of these routines is a spec list which is used to search the list structures of the data store for an acceptable answer.
    After the spec list is completed, the processing phase takes over. In some cases, this requires the simple matching of a blank item on the spec list such as the place in which a given team played on a given day. In other cases, as with the words "every," "either," and "how many," processing is a very complicated searching and counting procedure. The output of the program is in the form of a found list which shows all of the acceptable answers to tile question.
    In the Baseball system three aspects of the question-answering problem stand out clearly. A first phase of syntactic analysis merges into the second phase, semantic analysis. However, for the first time a third logical processing phase becomes explicit. In this phase, even though the relations between words and the meaning of words are already known, a wide range of operations are performed as a function of these meanings. Having considered the manner in which Lindsay's SAD SAM reads text, to append data to a list structure similar to that used by the Baseball system, it is apparent that Baseball could become a completely self-contained (though limited) automatic language processor. To achieve this goal, factual statements would be read and analyzed into their spec lists and a new processor would be required to add the data to the storage lists.
          in [ACM] CACM 8(01) Jan 1965 view details
  • Simmons, R. F. "Storage and retrieval of aspects of meaning in directed graph structures" view details Extract: Introduction
    Behind the development of every new computer language there lies a set of problems and a set of programming structures with whose aid the problems can be managed. With Fortran the problem was to solve algebraic equations without the need for a great deal of I/O bookkeeping and without concern for detailed computer-word-packing statements. Behind JovrAL lay the command-control problem, which customarily dealt with complex data structures and the need to use every bit of computer memory efficiently. IPL grew in response to a need to use associative list structures of nonnumeric symbolic data in a computer. Lisp answered the need for a high-level functional language to handle recursive algebraic and symbolic structures. Comit was the machine translator's approach to handling natural language strings.
    In developing a special concept dictionary for storing and retrieving the meanings of words and phrases of English, the authors have found it desirable to use a complex network whose nodes are interrelated in several ways. Basically, the idea is that of a dictionary of English words in which each word has associated with it word-class information and lists of attribute-value pairs that define various aspects of its meaning and usage in terms of pointers to other words in the dictionary. In a data structure language such as Jovial, in addition to ordinary table structures several additional levels of associative coding are required to give easy access to the data without excessive costs in either space or processing time.
    Because of the many levels of associative linking required the authors decided to use Lisp, at least for early experimental work with the system. Advantages of Lisp extended beyond the ease of producing complex data structures; they also included the simplicity of producing complex, often recursive functions for maintaining and querying the dictionaiy. An additional advantage is gained in the fact that although Lisp is primarily an interpretive system it does allow for the compiling of a fast-running completed program. The most serious disadvantage of Lisp for our system is that in present versions1 it is limited to core memory for most uses. This limitation means that a dictionary of the type we are studying could not exceed two or three hundred words.
    Since we are aiming for an eventual vocabulary of from five to fifty thousand words, the limitation to core memory is intolerable. Either an expansion of Lisp will be required or the writing of a special language using auxiliary memory for handling cyclical structures in large complex networks will grow out of our experiments with the conceptual dictionary.
    Extract: The Problem
    The Problem
    The major shortcoming of all existing retrieval systems is their inability to handle anything vaguely resembling the meaning of words taken singly, let alone the meaning of the language strings that they comprise. Synonym dictionaries and thesauri have often been added but have proved but feeble makeshifts offering little improvement over the use of root forms of words alone. To the extent that automatic syntactic analysis has been available it has only emphasized the need for word and phrase meanings.
    Five years of Synthex research toward the development of question-answering systems based on natural language text have confirmed this inability to deal with meanings. In these five years many approaches have been attempted toward representing some aspects related to the meaning of words. Most have been unsuccessful. It was learned early that the use of a synonym dictionary did not greatly improve our understanding of text in response to questions. More recently it was realized that even a well-coded thesaurus was not an answer. At various times attempts were made to save syntactic contexts associated with words as possible representations of meanings; these too, although promising, did not appear to be a reasonable answer. With more recent research, particularly that of Bobrow [1964], Raphael [1964], Quillian [1965], and Thompson [1964], it has become apparent that, in addition to dictionary-type meanings, there is a need for something that can best be characterized as a knowledge of the world (e.g., Cows eat grass. Walls are vertical. Grass doesn't eat., etc.). Without something representing knowledge of the world it can hardly be hoped that a word or sentence can be understood.
    The consequence of this line of thought is the realization that the problem requires the development of a conceptual dictionary that would contain definitional material, associative material, and some representation of knowledge of the world. These three aspects of a word's meaning seem to be the minimum that will allow for enough understanding of English to make question answering even a reasonable probability.
    Extract: The Conceptual Dictionary
    The Conceptual Dictionary
    In a conceptual dictionary each word would be characterized by (a) a set of class memberships, (b) a set of attributes, (c) a set of associations, and (d) a set of active and passive actions.
    The set of class memberships includes statements of the form the/an X(noun) is a/an F(noun). Thus, "an aard-vark is an animal" or "an aardvark is a mammal" are both examples of statements giving rise to class membership characteristics. For many nouns the class membership set is one of the basic definitional aspects of the word.
    Attributes characterizing a word are such that if y is an attribute of x, then "x is y" is a true statement and the string "the yx" is grammatical. Thus if "scaly" is an attribute of "aardvark," then "an aardvark is scaly" is true and "the scaly aardvark" is grammatical. Associates of a word are in a loose part-whole relationship. If x has a y, then y is an associate of x\ thus "John has a wallet" and "John has a nose" provide the two associates, "nose" and "wallet," for John.
    The set of actions characterizing a word are derived from the verbs and their complements or objects that appear in context with the word. The two sentences "Natives eat aardvarks" and "Aardvarks eat ants" provide "eaten by natives" and "eat ants" as passive and active actions related to aardvarks.
    The idea underlying this schema is that the meaning of a word can be conceptualized in bands of closeness of relation. The class membership of an object is so closely related as to be an integral part of it or its perception. Attributes and things closely associated with a word are seen as important but less essential, while the actions associating one word and another may range from important to irrelevant. Extract: Toward an Operational System
    Toward an Operational System
    The conceptual dictionary briefly described above exists now as an experimental Lisp program in. 47k of core memory in the ARPA-SDC Q-32 Time-Sharing system. It is currently limited to a dictionary of 200-300 words and a relatively small set of program functions. What is required of even an early operational system is the ability to handle from 20 to 50 thousand words and a rather large set of functions for dealing with questions of increasing difficulty. Our expectations are that Lisp will be expanded into a system that uses up to four million words of disk to augment core and that we will be able to pay an increased cost of response time in favor of continuing to use Lisp for a large operational system. If that cost is prohibitive it will be necessary to produce a system tailored to the needs of the conceptual dictionary, and that will be able to use auxiliary memory efficiently to deal with a very large network of complexly linked words.
    Although it is our belief that new problems create the need for new languages, it is apparent that existing languages are largely sufficient for our language processing problems, but in many cases, especially among the list-oriented languages, they simply have not geared themselves to the large amounts of data and data processing required in this special field.
    Extract: Acknowledgments
    Acknowledgments. I wish to acknowledge my debt to the twenty or so people who have studied question-answering systems over the past decade. Their work is reviewed elsewhere [Simmons 1965]. Headers knowing the Quillian system will recognize that the result of the author's three years of acquaintance with Quillian was the appropriation wholeheartedly of his ideas insofar as the author was able to understand them. A special debt is also expressed to Fred Thompson for leading the author to an understanding of parsing directly into a data structure and for acquainting him with TEMPO'S forthcoming language system of associative cycles. Programming and detail design of the conceptual dictionary described in this paper were accomplished by John Burger.
    Extract: Discussion
    Salton opened the discussion with the comment that a system such as this is inherently not extendable. The system will operate nicely with various kinds of "fish," but will run into trouble when "whales" appear, since the system will not know how to deal with aquatic mammals. He compared the system to the "Baseball" system, in which one cannot go beyond a limited range of questions, e.g., one has trouble if a new team appears. Young objected to this view, saying he believed the two systems were quite different, and that the present system was in principle infinitely extendable and very general. He said he could conceive of a system one level above this one which could deal with generalizations and which would make handling propositions easier than with the present special programming.
    Burger said that work with higher level relationships was planned, but that the immediate extensions would be more trivial, in the way of inserting a great deal of knowledge about the world, of the kind any child has (e.g., "Walls are vertical.").
    Gora observed that the present system is based upon two forms of the verb "is" and the verb "has," and that more relationships were needed. Burger said the system was not in principle so limited. Gorn then asked how they would deal with the statement " 'Word' is a word." Burger had no immediate answer, though he thought they would eventually be able to deal with such cases.
    Responding to a question from Cheydleur, Burger said they had about 50 different functions.
    Mooers asked if there were any inherent features of Lisp which limited their work. Burger said that a fundamental limitation was the limitation to use of core storage alone in the SDC version of Lisp. They could not use disks. The second limitation was the inability to break out individual characters from the "atoms" of Lisp. This prevents au easy and direct way of treating the similarity between "dog" and "dogs." At present this relationship has to be put in as a separate piece of information.
    Mitchell then mentioned that for a very much larger data base a limitation will be the time required to pass the dictionaries through the machine. He said one of the big improvements in speed due to syntax-directed compilers was that they had less need to refer to a very large dictionary. Burger admitted that this was a very important problem, and one which is being considered. What can be done outside of Lisp is being studied, since a problem in using an auxiliary memory with a list-structure system like Lisp is that interrelationships between lists are broken up if just one section is brought from the auxiliary memory. So far, a good solution to the problem has not been found.
    Abstract: An experimental system that uses LISP to make a conceptual dictionary is described. The dictionary associates with each English word the syntactic information, definitional material, and references to the contexts in which it has been used to define other words. Such relations as class inclusion, possession, and active or passive actions are used as definitional material. The resulting structure serves as a powerful vehicle for research on the logic of question answering. Examples of methods of inputting information and answering simple English questions are given. An important conclusion is that, although LISP and other list processing languages are ideally suited for producing complex associative structures, they are inadequate vehicles for language processing on any large scale—at least until they can use auxiliary memory as a continuous extension of core memory.

          in [ACM] CACM 9(03) March 1966 includes proceedings of the ACM Programming Languages and Pragmatics Conference, San Dimas, California, August 1965 view details
  • Barter, C.J. "Data Structure and Question Answering" view details Abstract: A brief discussion of the "question-answering" problem will be given, Certain aspects of the problem of system design will be discussed; in parlicular, we will examine what is meant by "the structure of a data base". A survey of existing systems will not be attempted, but a few well known systems will be used as examples to illustrate various points.
          in Kaneff, S. (ed) Picture Language Machines: Proceedings of a Conference held at the Australian National University, Canberra on 24-28 February, 1969 view details
  • Sammet, Jean E. "Computer Languages - Principles and History" Englewood Cliffs, N.J. Prentice-Hall 1969. view details Extract: BASEBALL
    One of the first [of the questioning-answering systems] was the BASEBALL system in which the user was able to write such things as



    The input sentences are restricted to single clauses and do not permit logical connectives such as and and or. Relation words such as most or highest are also not permitted.

    BASEBALL is programmed in IPL-V [...] and organizes the data into list structures. The first part of the program uses a dictionary, parsing routines and semantic analysis routines to translate the input question into a specification list similar in format to that of the data. This permits retrieval of the answers.

          in Kaneff, S. (ed) Picture Language Machines: Proceedings of a Conference held at the Australian National University, Canberra on 24-28 February, 1969 view details
  • Stock, Marylene and Stock, Karl F. "Bibliography of Programming Languages: Books, User Manuals and Articles from PLANKALKUL to PL/I" Verlag Dokumentation, Pullach/Munchen 1973 81 view details Abstract: PREFACE  AND  INTRODUCTION
    The exact number of all the programming languages still in use, and those which are no longer used, is unknown. Zemanek calls the abundance of programming languages and their many dialects a "language Babel". When a new programming language is developed, only its name is known at first and it takes a while before publications about it appear. For some languages, the only relevant literature stays inside the individual companies; some are reported on in papers and magazines; and only a few, such as ALGOL, BASIC, COBOL, FORTRAN, and PL/1, become known to a wider public through various text- and handbooks. The situation surrounding the application of these languages in many computer centers is a similar one.

    There are differing opinions on the concept "programming languages". What is called a programming language by some may be termed a program, a processor, or a generator by others. Since there are no sharp borderlines in the field of programming languages, works were considered here which deal with machine languages, assemblers, autocoders, syntax and compilers, processors and generators, as well as with general higher programming languages.

    The bibliography contains some 2,700 titles of books, magazines and essays for around 300 programming languages. However, as shown by the "Overview of Existing Programming Languages", there are more than 300 such languages. The "Overview" lists a total of 676 programming languages, but this is certainly incomplete. One author ' has already announced the "next 700 programming languages"; it is to be hoped the many users may be spared such a great variety for reasons of compatibility. The graphic representations (illustrations 1 & 2) show the development and proportion of the most widely-used programming languages, as measured by the number of publications listed here and by the number of computer manufacturers and software firms who have implemented the language in question. The illustrations show FORTRAN to be in the lead at the present time. PL/1 is advancing rapidly, although PL/1 compilers are not yet seen very often outside of IBM.

    Some experts believe PL/1 will replace even the widely-used languages such as FORTRAN, COBOL, and ALGOL.4) If this does occur, it will surely take some time - as shown by the chronological diagram (illustration 2) .

    It would be desirable from the user's point of view to reduce this language confusion down to the most advantageous languages. Those languages still maintained should incorporate the special facets and advantages of the otherwise superfluous languages. Obviously such demands are not in the interests of computer production firms, especially when one considers that a FORTRAN program can be executed on nearly all third-generation computers.

    The titles in this bibliography are organized alphabetically according to programming language, and within a language chronologically and again alphabetically within a given year. Preceding the first programming language in the alphabet, literature is listed on several languages, as are general papers on programming languages and on the theory of formal languages (AAA).
    As far as possible, the most of titles are based on autopsy. However, the bibliographical description of sone titles will not satisfy bibliography-documentation demands, since they are based on inaccurate information in various sources. Translation titles whose original titles could not be found through bibliographical research were not included. ' In view of the fact that nany libraries do not have the quoted papers, all magazine essays should have been listed with the volume, the year, issue number and the complete number of pages (e.g. pp. 721-783), so that interlibrary loans could take place with fast reader service. Unfortunately, these data were not always found.

    It is hoped that this bibliography will help the electronic data processing expert, and those who wish to select the appropriate programming language from the many available, to find a way through the language Babel.

    We wish to offer special thanks to Mr. Klaus G. Saur and the staff of Verlag Dokumentation for their publishing work.

    Graz / Austria, May, 1973
          in Kaneff, S. (ed) Picture Language Machines: Proceedings of a Conference held at the Australian National University, Canberra on 24-28 February, 1969 view details