UL/1(ID:5347/ul:001)

Early non-procedural query language 


for User Language/1

Non-procedural language for retrieving information from data

T William Olle RCA 1967




Related languages
INFOL => UL/1   Evolution of

References:
  • Olle, William "UL/1: a non-procedural language for retrieving information from data bases" view details Abstract: The differences between procedural and non-procedural languages are discussed and a case made for a continued trend towards the non-procedural. Three levels of user interface are defined for the information retrieval language and the role of each user discussed. The main part of the paper describes a user-oriented language developed for data base problems.  It has four divisions: establishment, interrogation, update and revision. An important component of all divisions, called the criterion language, is used for retrieval, validation and update. The criterion language permits a user to express conditions on several properties of a data item, including existence, value, picture, length and content. Extract: INTRODUCTION
    1. INTRODUCTION
    The diversity of terminology in current use causes earlier work in the area discussed in this paper to be classified under a number of headings. Generalized file processing, data management, data base management and information retrieval are phrases all of which are used as descriptive terms for software products which have contributed to the concepts to be discussed.
    Data management is perhaps the most ambiguous term since it is used by IBM in the 360 operating system concepts [1] to refer to that part of the operating system which handles the transfer of data files and data records without specific regard for their information content. On the other hand, the Systems Development Corporation's time-shared data management system [2-4] is a system which is designed to permit a non-programmer to extract any information contained in a data base without giving him any control over the way in which data are stored or transferred in the data base. To avoid confusion, the term data base management is preferred to describe the SDC system, reserving data management for the IBM usage.
    Extensive work has been done in the area of data base management and the most useful evaluations are contained in the proceedings [5,6] of the two Systems Development Corporation symposia on data base management. This author, in an earlier paper [7], classified systems which could be called generalized first in terms of the logical level of enquiry possible and secondly,

    for systems which are on the most complex level of generality   in terms of the five main design features: language, data structure, implementation mode, data base storage medium and internal data organization.  The latter feature is referred to as storage structure throughout the paper.
    Extract: DATA DESCRIPTION
    2. DATA DESCRIPTION
    Although the present paper is primarily concerned with the language features, data structure and storage structure require an explanation in order to clarify the language discussion. Historically, using procedural languages, there has been an evolvement along functional lines. A program to carry out a specific function has one or more files associated with it. It is necessary that the program contains a description of the data as it is organized in each file. Hence, the description of the data resides completely with the program.  Each time the file is accessed, the data description must be fed in; whether or not it is compiled each time is immaterial.
    The procedural language programmer is completely aware of the storage structure, since he is required to specify it in his program.  Therefore, it has never been necessary to identify or separate the meanings of internal storage structure and external data structure.  The pioneering work of Bachman and Williams with IDS [8] over the past five years has set the trend in the direction of a separation. In IDS/COBOL [9], as implemented for the General Electric 600 series, part of the data description is stored with the file and part with the COBOL program. Although some IDS users have developed schemes   or storing the whole data description with the file, the COBOL programmer who wishes to optimize his use of the file needs to understand how the IDS scheme is implemented. IDS, which was conceived to permit more effective use of disc storage, establishes the trend away from the purely functional. There may be several programs, and hence several programmers, using the same data base. The COBOL language is enhanced with certain non-procedural commands to permit the programmer to access records through pre-stored chains.
    Other systems such as TDMS [2-4], GIS [10, 11], INFOL [12, 13], MANAGE [14], and the Bolt, Beranek and Newman system [15] go further away from functionalism by having the complete data description stored with the file, so that it is ho longer necessary for each person accessing the file to specify its description. Data description data are best thought of as all data in the file which are not part of the actual file data but which help to organize the actual file data. Data description data would include pointers, links, separators, primary and secondary indexes and all the representations of which a user need not be aware in order to access a file using a non-procedural language.
    Data structure and storage structure are separate concepts. The first is something of which the user must be aware. The fact that a given data item is part of a repeating group, which may occur several times for each occurrence of a higher level group, is a data structure concept. The sophisticated user may find it useful to be aware of certain facets of storage structure, for instance the fact that a secondary index has been formed and is maintained with respect to a given data item, facilitates a rapid answer to questions based on that data item.
    Different systems give the user different levels of control over the storage structure.  IDS/ COBOL, being an essentially procedural language, gives the user almost complete control. INFOL, TDMS, and the Bolt, Beranek and Newman system, which are completely non-procedural, give the user no control. In many applications, this is not a serious restriction. Responsibility for an efficient storage structure rests with the implementors of the above systems and the non-procedural programmer/user does not necessarily have to concern himself with it.
    Extract: DATA DESCRIPTION
    5.  LANGUAGE COMPONENTS
    A language, called UL/1, embodying the principles discussed above has been designed and is the main topic of this paper. UL/1 stands for User Language/1 where the name is intended to emphasize that the language is not designed for programmers but for non-programmers, more conveniently referred to as users.
    Although the principal components of the language have already been mentioned in the preceding discussion, they are described  in more detail in this section. It is convenient to follow COBOL in regarding the complete language is consisting of a number of divisions, with each division having several sections.
    The divisions are establishment, interrogation, update and revision, where establishment and revision can be regarded as privileged and for the possible exclusive use of the data base administrator.  These division names follow those of INFOL [12] where they are called major phases.
    Unlike in COBOL, certain sections can be used in two or more divisions. The concept of a program comprising all of the four divisions, as in COBOL, does not hold either. A run may consist of the use of one of the four divisions although it is reasonable to implement in such a way that interrogation and update can be carried out in one access to a file.  This was, in fact, implemented in INFOL for the Control Data 3600 and 3800.
    5.1.  Establishment
    As described in section 4, establishment is a formal file-oriented process which results in a file being added to a data base in a form standard to the system. Standard in this sense means that the data description data are stored with the file - or at least in the data base - so that it can be used in an interrogation or update to  the file. The establishment division consists of sections to specify an identification for each data item, a type such as alphanumeric or numeric for each data item, a validation criterion to edit data entering the file, lists of the expanded forms for certain data items which may be abbreviated in the file in a coded form and a specification of the data structure relationships between data items. Establishment also requires the provision of the set of data records which comprise the first edition of the file.
    5.2.  Interrogation
    The process of interrogation is that most frequently described in papers dealing with generalized data base systems, and is hence the easiest to discuss. In a truly generalized system, the user'should be able to ask any questions of the data base.  Simply expressed, this involves placing criteria on any data item in the record and extracting any set of data items for the records which satisfy the overall criterion specified. The extraction process may result in a report containing values extracted from the  data base, a frequency count of the number of  item values
    in certain classes, or, on the simplest level, a count of the number of records which satisfy the criterion placed on the record.
    When values are extracted, as opposed to counts of values, they may be included in a printed report. As indicated in section 4, the format of the report may be standard for the system or it may be user'specified.  Furthermore, the extracted values may be included in a mechanized sub-file which is to be used as input to a program written in a procedural language such as COBOL or FORTRAN.
    5.3.  Update
    The updating of a data file is a process which depends considerably on implementation mode, although the update language need not. Updating may take place on the record level, data item level, or on the level of a character string contained in a data item. The facility to modify on the character string level is one of the links between data base systems and text editing systems. If the data base system can handle long character strings, such as whole documents, and the character level updating facilities are powerful enough, then two systems which are usually separately conceived, such as IBM's Datatext and GIS, could well be merged.
    In updating, which involves deleting or modifying existing records, a record is identified by specifying the value or values of one or more special data items. These data items, usually only one, are such that the value set in each record is unique in the file. Examples of such data items are social security number, employee number and part number. It is a less frequent requirement to be able to update a file selectively by specifying a criterion which a record must satisfy in order for an update to take place. The facilities for specifying an update criterion are identical to those for specifying a retrieval criterion.
    5.4.  Revision
    The concept of revision is the most novel of the four divisions, but it is nevertheless significant and must not be confused with update. It should be regarded as privileged and for possible exclusive use by the data base administrator. It is closely allied to establishment in the sense that many of the sections are the same. In revision, the user may add or subtract data items for each record.  This is not the same as changing the values of already defined data items, which is an update function. Another revision function may be to alter the structure as was previously defined in the establishment division or in a previous revision. It is also possible to redefine the validation criterion which was previously specified. Revision does not require data to be specified to the file as is done in establishment and update.  The revision division may result in a change to the data description data stored with the file. It may also result i   the removal of actual data from the file if certain data items are removed. A change to the validation criterion which makes the criterion more stringent can have the result that data previously put in the file no longer satisfies the validation criterion.  This is a problem for the data base administrator. Any data which is considered invalid can be changed or removed in an update division. Conceptually, however, it is desirable that a revision division, which changes the validation criterion, should result in a report listing all data in the file which have become invalid. Extract: DATA DESCRIPTION
    6.  CRITERION LANGUAGE
    An important section which   may be used in all divisions is the criterion language. In the above description of the separate divisions, reference was made in the discussion on establishment and revision to a validation criterion, in the interrogation division to a retrieval criterion and in the update division to an update criterion.  The fact that the facilities.for specification of update criteria can be the same as those for retrieval criteria, was recognized in the design of INFOL, TDMS, and apparently in GIS. However, in each case there are separate facilities for validation, or editing as it is often called   for data entering the file. This is a considerable waste both in terms of language design, implementation and loss of potential power to the user. Most systems validate data items in terms of their form using an editing mask or picture, while retrieval and updating take place depending on the satisfaction of conditions placed on value. Logical complexity of concatenated criteria is usually possible for retrieval, and therefore for update, but not in validation. By incorporating i: to the criterion language the facility for placing criteria on form as well as value, a powerful language is available for all three purposes.
    To outline in more detail the concepts embodied in this idea of a criterion language, the more important properties of a data item, on which conditions may be place, are described.
    6.1. Existence
    Before a condition on other properties of a data item can be evaluated for a record, it must be certain that it is present in that record. In validation, an existence criterion would mean that the data item is required to exist for the record to enter the file. Assuming the very desirable language and system property of handling incomplete data, then there is a requirement for specifying explicit conditions on the existence and non-existence of data items. In validation, an existence criterion, logically connected with criteria on values and on other properties, is a powerful tool for ensuring the correctness of the data in the file.
    6.2.  Value
    Procedural languages have been developed largely for the handling of values of data items. The value is indeed the most important property. If a data item has no value, then conceptually it also lacks most other properties. Criteria on the value of a data item are relational, which means that the relationship of the data item to a user'specified reference quantity is tested.  To specify such a criterion, six standard relational operators are permitted:
    equals                            EQ
    does not equal                NE
    greater than                   GT
    less than                       LT
    greater than or equal to GE
    less than or equal to       LE
    6.3.   Picture
    Criteria on a data item's picture are most useful to the data base administrator in developing a validation criterion which each record must satisfy for admission to the file. However, the facility to give a choice of two or more pictures, one of which a data item must satisfy, is achieved by using the logical connector OR which is required in the criterion language in any case. This effect can be achieved in the Bolt, Beranek and Newman system [15]. A picture criterion must use only the relational operator EQ.
    6.4.  Length
    Criteria on the length of a data item are useful in the validation criterion and also to the data base administrator who can obtain information about the distribution of lengths of data items in the file in order to exercise available facilities for organizing the file. Being able to place criteria on the length of a data item implies that the system handles variable length data. The length of a data item may vary within a maximum prespecified by the data base administrator in the validation criterion, otherwise within the overall limits imposed by the language implementation. A length criterion may use any of the six standard relational operators.
    6.5.  Repeats
    Some data items are single-valued, such as a date of birth or an employee number   others are multiple-valued, such as a skills profile or a descriptor set.  The number of values  or number of repeats in a multiple-valued item is identified as the repeats.  In INFOL, this was called TOTAL which caused confusion since it implied some form of summation. In COBOL, PL/ 1 and GIS the maximum permitted number of repeats of a multiple-valued item or of a repeating group has to be specified. Conceptually there is no reason why specification of a maximum should be required, although it is easier to implement.  The data base administrator may limit the number of repeats for a multiple-valued data it 'm in the validation criterion. A specifier user may place a retrieval criterion on, for example, the number of languages spoken by an employee where this number is not an explicit data item. Again any of the six relational operators may be used.
    6.6.   Content
    Criteria on the content of the value of a data item are not particularly useful in validation criteria. In interrogation, and to a much lesser extent in update, the facility to ask v nether a value, such as that of document title   contains some substring is extremely useful.  This facility is available in GIS, but not in INFOL or TDMS. As with existence criteria described earlier, it is important to be able to state the condition both positively and negatively. Relational operators are not used with content criteria but with the special forms CONTAINS and DOES NOT CONTAIN.
    6.7.  Other properties
    There are other properties on which it is useful to place criteria.  Most of these are properties of multiple-valued items only. One facility is the ANY facility available in INFOL, which reduces the effort required to specify certain otherwise lengthy criteria.  For example, to find a document having any two of a set of four classification descriptors would require the listing of the six different pairings of descriptors possible. It is also desirable to place criteria on quantities derived from the set of values of multiple-valued numeric items. Such quantities may
    be derived using a system-supplied procedure such as SUM or MEAN exactly as in INFOL.  In addition, a user'supplied procedure may be invoked as the subject of the criterion.
          in Morrell, A. J. H. (Ed.): Information Processing 68, Proceedings of IFIP Congress 1968, Edinburgh, UK, 5-10 August 1968 view details
  • Codd, E.F. "A database sublanguage founded on the relational calculus" pp35-68 view details
          in [ACM] Proceedings on the ACM SIGFIDET Workshop on Data Description, Access, and Control, San Diego, California (November 1971) view details
  • Coulouris, GF; Evanst, JM and Mitchell, RW "Towards content-addressing in data bases" view details Abstract: The characteristics and performance of existing data base management systems are discussed and evaluated, and some benefits to be expected from hardware-aided content-addressing systems are identified. An approach to the design of a hardware-aided content-addressed file system is proposed. Extract: Inverted files
    Inverted files
    Some generalised data management systems (Bleier and Vorhaus, 1968 ; Olle, 1968) have implemented a content-addressing facility on existing direct-access storage using the 'inverted file' technique. A file can be 'inverted' with respect to one or more of the item types appearing in its records. Inversion with respect to an item is performed by the construction of an index table of all values taken by the item. Against each value in the index table is stored a reference to each record having that value. Given a value from the item in question, the records containing that value can be accessed by performing a look-up operation for the value in the index table, then picking up the relevant record references.
    For any given set of applications, a file may have to be inverted with respect to several items. Thus the employee file would have to be 'partially inverted' with respect to the items AGE, SALARY and JOB in order to provide a satisfactory response to queries such as the one given in the example of Section 2. For full flexibility of record access however, a file must be inverted with respect to all of its item types. Such a file is said to be 'totally inverted'.
    The index tables associated with an inverted file must be stored and accessed each time the file is used. These tables are frequently of greater size than the original file. Update operations on inverted files are costly because they involve modifications to several index tables.
          in The Computer Journal 15(2) 1972 view details