Mercury(ID:5559/mer002)

Began: 1999

Query language for Jupiter source code repositories

References:

Anthony Cox and Charles L. A. Clarke, "Representing and Accessing Extracted Information", ICSM pp 12-21 2001 view details Abstract: Source code repositories best support program comprehension tasks when they can successfully answer the many questions a maintainer conceives. To provide the flexibility needed to answer these questions, the Jupiter repository  system has been developed. Jupiter, using Maia, a Model  based on Annotations, Indices and Attributes, can store any  of the syntactic, type and flow information extractable from  a program. Jupiter's query language, Mercury, formed by  merging an existing query language with Scheme, is used to  access the repository and manipulate query solutions. Together, these components provide a foundation on which to  build systems for solving the queries that occur during program comprehension.

Extract: Introduction
Introduction
When program maintainers are unfamiliar with the code undergoing maintenance, they must first explore the code to  gain the knowledge necessary to perform their task. There  are several manners in which this program comprehension  can be performed. Maintainers can use a suite of specialized exploration tools, each with its own unique capabilities and interface. Alternatively, code can be entered into a  database, or source code repository, and explored by querying the database. When the repository's query facilities are  sufficient to support maintainer's needs, the latter approach  is superior as only a single tool needs to be obtained, learned and applied.

For a source code repository to be useful for program comprehension it must provide maintainers with easy to  use query facilities that permit the posing of any conceivable question. Furthermore, the repository must store sufficient information to answer these queries without imposing  restrictions on the amount of code or information stored.  More concisely, the repository must demonstrate the properties of accessibility, representability and scalability.  Accessibility refers to the capabilities of the repository's  query language. The language must be capable of accessing  the data and describing any manipulations needed to satisfy maintainers' needs. If the query solution is not in the  repository then there must be a mechanism to construct it,  when possible, using the information that is in the repository. A repository has representability when it can store any  of the diverse range of information extractable from source  code. Examples of extractable information are abstract syntax, type assignments and control-flow. Finally, a repository  is scalable when there are no restricting limitations on the  amount of extracted information or code that is stored.

To satisfy these three requirements, we have developed the Jupiter source code repository system. Using Maia, a  Model based on Annotations, Indices and Attributes, Jupiter  can store a wide variety of extractable information. Maia  supplements source code with annotations -- a form of  markup that uses indices to reference other code fragments  or attribute values. Annotations are not required to satisfy  any data type definition or other layout requirements. Thus,  Maia avoids representational limitations by not imposing  any syntactic or semantic requirements on annotations.  To access information in Jupiter a functional query language called Mercury is provided. Mercury, created from  the merger of an existing query language, GCL, and  a programming language, Scheme, can easily access  stored data and then manipulate it algorithmically. Information that the repository constructor did not choose to extract  is therefore obtainable using Mercury.

Jupiter is based on an extended version of the MultiText structured text database system that has been used to  support large digital library and WWW search applications.  Using a specialized index and adaptive algorithms, simple  queries can be executed in under a second over 100 GB of  data. As a result of using a large capacity database,  Jupiter avoids any database related scalability issues.  While Jupiter is usable on its own as a program comprehension tool it can also function as a backend for other  tools. For example, MARS, the predecessor to Jupiter,  was used to extend the Software Bookshelf with search  capabilities.

The use of XML to support software maintenance is currently being investigated. Formats, such as GXL, have been proposed for the representation of source code and information extracted by program analysis. As XML style  markup can be viewed as structured text, Jupiter is capable  of providing a low-level database implementation for these  representations. Furthermore, Jupiter's representational capability exceeds that of XML since data with partial or irregular structure is also representable.  The remainder of this paper describes the Jupiter system  and its use for program comprehension. In particular, we  will focus on the use of Mercury for posing queries that occur as part of the comprehension process. A brief description of Maia is first provided so that Mercury programs can  be understood. Extract: Related Work
Related Work

The use of markup to support software engineering has been previously examined. Cross and Hendrix used  GRASP-ML, an SGML based markup language, to identify  the syntactic structure of Ada code and enable the extraction  of control structure diagrams. Cowan et al. also used  SGML to describe the syntactic structure of code but for the  purpose of supporting comprehension of C code by enhancing the code's readability. Neither approach investigated the  use of a queryable database.

CHIME is a tool for adding hyperlinks to source code so that the code may be browsed using existing WWW  browsers. In CHIME, only relationships are identified as no  facilities are provided to markup structural entities within  the code. All hyperlinks are inserted on demand and not  stored with the source code.

HyperSoft is a program maintenance tool for large legacy systems based on a hypertext model for structured  text. In HyperSoft markup is dynamically added to source  code according to the user's selection of an `access structure'. One such access structure is a function calling structure that links function names in calls to their names in definitions. For each file in the system, its abstract syntax tree  and symbol table are stored in a text file and used to dynamically generate hyperlinks for the selected access structure.  JavaML represents a Java program using XML.

While `marking up' the textual representation was considered, JavaML avoided this approach in the belief that XML tools are better at accessing markup than the text content.  Instead, a JavaML representation identically matches the  abstract syntax tree of the program, with attributes used to  store the text that formulates each leaf of the tree. JavaML  is a representation system designed to facilitate the use of  other XML tools on Java code and does not provide any  tools of its own, apart from conversion utilities.  HSML, the Hot Spot Markup Language, uses a similar philosophy in locating the results of analyses with respect to their originating locations in computer source code. This approach is advocated as software maintenance invariably requires modification of the source code and not of  some high level representation such as graphs or diagrams.  Malton et al. stress their belief in viewing source code  as text. Their use of markup is to decompose a source file  into a set of factors such as comments, code, and compiler  directives. As is the case with Jupiter's annotations, factors may be improperly nested making them unsuitable for  representation using XML. Jupiter should be viewed as an  implementation model for these works and, with respect to  HSML, as providing a finer level of representational granularity -- character as opposed to line position.

Graphs have been widely explored as a technique for the representation of information extracted from programs.  GRAS, a graph-oriented database system explicitly  designed for software engineering, uses a formal specification language called PROGRES to describe attributed graphs. GRAS has been used in Rigi and can  be represented in ASCII using Rigi Standard Format (RSF).  GraX relies on directed graphs with attributed and  typed edges (TGraphs). GUPRO is a repository system  based on GraX with a customized query language called  GReQL. GReQL suffers the same limitation as other specialized query languages in that it cannot express data manipulations to be performed after query solution. GUPRO's  prime difference to Jupiter is in the data model used in  the database. Using an "object-based repository'', GUPRO  requires the definition of a "conceptual model'' (database  schema) fully describing the data. If one wants to enter  source code from 3 languages, then one must unify all extracted information such that it can be described using a single schema. Jupiter is more flexible with its lack of any  required annotation schema.

The Portable Bookshelf uses an ASCII representation called Tuple-Attribute language (TA) to record the typed graphs used internally. Recently, a graph-based format in XML known as GXL has been proposed. GXL  can be viewed as a merger of GraX, PROGRES, TA and  RSF. The graph database system GraphLog has also  been used to store and query source code.  In general, a graph based representation for source code  becomes inadequate when multiple sets of information are  extracted from the program. When multiple graphs (or  trees) exist that do not have common nodes it is harder to  properly represent the two graphs. As well, maintenance  programmers modify source code, not graphs, adding the  need to maintain a map between the graph representation  and the code.

The Software Refinery provides much of the same functionality as Jupiter, with the added advantage of being  an established commercial product. Refine differs in that it  does not focus on the relationship between extracted information and locations in the source. Using an object-oriented  database, Refine stores graphical data such as abstract syntax and control flow. Jupiter differs in using a textual model  and mapping all extracted data, such as abstract syntax, type  assignments and control flow, back into the originating text.  Jupiter also permits the storage and querying of unparsed  material such as comments or programmer attached annotations.

ACACIA uses ASCII files to store data. The cql query language is used for access and serves as an interface between the database and the Ciao graphical  repository navigator. ACACIA requires that extracted data  match a predefined entity-relationship based model, specifically designed for C++. Jupiter is more general in applicability being language independent.  Instead of focusing on the data representation, it is possible to design a repository around its query facilities. SCA  and PQL are examples of custom query languages  explicitly designed for querying source code repositories.  The main drawback of this approach is the requirement for  the language designer to anticipate the needs of future users.  By providing both a general purpose text retrieval language  and a functional programming language, Mercury avoids  this issue.

Similar to Mercury, Haskell/DB is formed by merging a query language, in this case an SQL compiler, with a functional programming language. Using monad comprehensions, the two are merged in a manner that permits  type-checking of the SQL components. Haskell/DB has not  been used for source code.  GENOA is a system that permits flexible analysis  scripts to be written. Script writers only have access to the  code's abstract syntax since flow and relationship information is not available. The lack of a database component  forces GENOA to parse source code every time an analysis script is executed. Jupiter also differs from GENOA by  having no limitations imposed by file boundaries; the entire  database may be consulted during execution of a Mercury  program.