Mercury(ID:5559/mer002)Query language for Jupiter source code repositories References: Extract: Introduction Introduction When program maintainers are unfamiliar with the code undergoing maintenance, they must first explore the code to gain the knowledge necessary to perform their task. There are several manners in which this program comprehension can be performed. Maintainers can use a suite of specialized exploration tools, each with its own unique capabilities and interface. Alternatively, code can be entered into a database, or source code repository, and explored by querying the database. When the repository's query facilities are sufficient to support maintainer's needs, the latter approach is superior as only a single tool needs to be obtained, learned and applied. For a source code repository to be useful for program comprehension it must provide maintainers with easy to use query facilities that permit the posing of any conceivable question. Furthermore, the repository must store sufficient information to answer these queries without imposing restrictions on the amount of code or information stored. More concisely, the repository must demonstrate the properties of accessibility, representability and scalability. Accessibility refers to the capabilities of the repository's query language. The language must be capable of accessing the data and describing any manipulations needed to satisfy maintainers' needs. If the query solution is not in the repository then there must be a mechanism to construct it, when possible, using the information that is in the repository. A repository has representability when it can store any of the diverse range of information extractable from source code. Examples of extractable information are abstract syntax, type assignments and control-flow. Finally, a repository is scalable when there are no restricting limitations on the amount of extracted information or code that is stored. To satisfy these three requirements, we have developed the Jupiter source code repository system. Using Maia, a Model based on Annotations, Indices and Attributes, Jupiter can store a wide variety of extractable information. Maia supplements source code with annotations -- a form of markup that uses indices to reference other code fragments or attribute values. Annotations are not required to satisfy any data type definition or other layout requirements. Thus, Maia avoids representational limitations by not imposing any syntactic or semantic requirements on annotations. To access information in Jupiter a functional query language called Mercury is provided. Mercury, created from the merger of an existing query language, GCL, and a programming language, Scheme, can easily access stored data and then manipulate it algorithmically. Information that the repository constructor did not choose to extract is therefore obtainable using Mercury. Jupiter is based on an extended version of the MultiText structured text database system that has been used to support large digital library and WWW search applications. Using a specialized index and adaptive algorithms, simple queries can be executed in under a second over 100 GB of data. As a result of using a large capacity database, Jupiter avoids any database related scalability issues. While Jupiter is usable on its own as a program comprehension tool it can also function as a backend for other tools. For example, MARS, the predecessor to Jupiter, was used to extend the Software Bookshelf with search capabilities. The use of XML to support software maintenance is currently being investigated. Formats, such as GXL, have been proposed for the representation of source code and information extracted by program analysis. As XML style markup can be viewed as structured text, Jupiter is capable of providing a low-level database implementation for these representations. Furthermore, Jupiter's representational capability exceeds that of XML since data with partial or irregular structure is also representable. The remainder of this paper describes the Jupiter system and its use for program comprehension. In particular, we will focus on the use of Mercury for posing queries that occur as part of the comprehension process. A brief description of Maia is first provided so that Mercury programs can be understood. Extract: Related Work Related Work The use of markup to support software engineering has been previously examined. Cross and Hendrix used GRASP-ML, an SGML based markup language, to identify the syntactic structure of Ada code and enable the extraction of control structure diagrams. Cowan et al. also used SGML to describe the syntactic structure of code but for the purpose of supporting comprehension of C code by enhancing the code's readability. Neither approach investigated the use of a queryable database. CHIME is a tool for adding hyperlinks to source code so that the code may be browsed using existing WWW browsers. In CHIME, only relationships are identified as no facilities are provided to markup structural entities within the code. All hyperlinks are inserted on demand and not stored with the source code. HyperSoft is a program maintenance tool for large legacy systems based on a hypertext model for structured text. In HyperSoft markup is dynamically added to source code according to the user's selection of an `access structure'. One such access structure is a function calling structure that links function names in calls to their names in definitions. For each file in the system, its abstract syntax tree and symbol table are stored in a text file and used to dynamically generate hyperlinks for the selected access structure. JavaML represents a Java program using XML. While `marking up' the textual representation was considered, JavaML avoided this approach in the belief that XML tools are better at accessing markup than the text content. Instead, a JavaML representation identically matches the abstract syntax tree of the program, with attributes used to store the text that formulates each leaf of the tree. JavaML is a representation system designed to facilitate the use of other XML tools on Java code and does not provide any tools of its own, apart from conversion utilities. HSML, the Hot Spot Markup Language, uses a similar philosophy in locating the results of analyses with respect to their originating locations in computer source code. This approach is advocated as software maintenance invariably requires modification of the source code and not of some high level representation such as graphs or diagrams. Malton et al. stress their belief in viewing source code as text. Their use of markup is to decompose a source file into a set of factors such as comments, code, and compiler directives. As is the case with Jupiter's annotations, factors may be improperly nested making them unsuitable for representation using XML. Jupiter should be viewed as an implementation model for these works and, with respect to HSML, as providing a finer level of representational granularity -- character as opposed to line position. Graphs have been widely explored as a technique for the representation of information extracted from programs. GRAS, a graph-oriented database system explicitly designed for software engineering, uses a formal specification language called PROGRES to describe attributed graphs. GRAS has been used in Rigi and can be represented in ASCII using Rigi Standard Format (RSF). GraX relies on directed graphs with attributed and typed edges (TGraphs). GUPRO is a repository system based on GraX with a customized query language called GReQL. GReQL suffers the same limitation as other specialized query languages in that it cannot express data manipulations to be performed after query solution. GUPRO's prime difference to Jupiter is in the data model used in the database. Using an "object-based repository'', GUPRO requires the definition of a "conceptual model'' (database schema) fully describing the data. If one wants to enter source code from 3 languages, then one must unify all extracted information such that it can be described using a single schema. Jupiter is more flexible with its lack of any required annotation schema. The Portable Bookshelf uses an ASCII representation called Tuple-Attribute language (TA) to record the typed graphs used internally. Recently, a graph-based format in XML known as GXL has been proposed. GXL can be viewed as a merger of GraX, PROGRES, TA and RSF. The graph database system GraphLog has also been used to store and query source code. In general, a graph based representation for source code becomes inadequate when multiple sets of information are extracted from the program. When multiple graphs (or trees) exist that do not have common nodes it is harder to properly represent the two graphs. As well, maintenance programmers modify source code, not graphs, adding the need to maintain a map between the graph representation and the code. The Software Refinery provides much of the same functionality as Jupiter, with the added advantage of being an established commercial product. Refine differs in that it does not focus on the relationship between extracted information and locations in the source. Using an object-oriented database, Refine stores graphical data such as abstract syntax and control flow. Jupiter differs in using a textual model and mapping all extracted data, such as abstract syntax, type assignments and control flow, back into the originating text. Jupiter also permits the storage and querying of unparsed material such as comments or programmer attached annotations. ACACIA uses ASCII files to store data. The cql query language is used for access and serves as an interface between the database and the Ciao graphical repository navigator. ACACIA requires that extracted data match a predefined entity-relationship based model, specifically designed for C++. Jupiter is more general in applicability being language independent. Instead of focusing on the data representation, it is possible to design a repository around its query facilities. SCA and PQL are examples of custom query languages explicitly designed for querying source code repositories. The main drawback of this approach is the requirement for the language designer to anticipate the needs of future users. By providing both a general purpose text retrieval language and a functional programming language, Mercury avoids this issue. Similar to Mercury, Haskell/DB is formed by merging a query language, in this case an SQL compiler, with a functional programming language. Using monad comprehensions, the two are merged in a manner that permits type-checking of the SQL components. Haskell/DB has not been used for source code. GENOA is a system that permits flexible analysis scripts to be written. Script writers only have access to the code's abstract syntax since flow and relationship information is not available. The lack of a database component forces GENOA to parse source code every time an analysis script is executed. Jupiter also differs from GENOA by having no limitations imposed by file boundaries; the entire database may be consulted during execution of a Mercury program. |