CPL(ID:6870/cpl009)

Querying language for Kleisli 


Functional query language for the Kleisli system

Kleisli commercialized as KRIS from Kris Technology Inc Menlo Park, CA 94025, USA



Related languages
Machiavelli => CPL   Influence

References:
  • Davidson, Susan; Overton, Chris; Tannen, Val; Wong, Limsoon "BioKleisli: A Digital Library for Biomedical Researchers" Journal of Digital Libraries 1997 view details
  • Buneman, Peter; Crabtree, Jonathan; Davidson, Susan; Tannen, Val; Wong, Limsoon "BioKleisli" Bioinformatics, edited by S. Letovsky (Kluwer Academic Publishers) 1998 view details
  • Jing Chen, Limsoon Wong, Louxin Zhang "A protein patent query system powered by Kleisli" pp593-595 view details DOI
          in ACM SIGMOD International Conference on Management of Data and Symposium on Principles of Database Systems Seattle, Washington, United States (PODS 98) view details
  • Limsoon Wong "Kleisli, a functional query system" Journal of Functional Programming 10(1) January 2000 pp19-56 view details Abstract: Kleisli is a modern data integration system that has made a significant impact on bioinformatics data integration. This paper contains a brief introduction to the Kleisli system and an example to illustrate its uses in the bioinformatics arena. The primary query language provided by Kleisli is called CPL, which is a functional query language whose surface syntax is based on the comprehension syntax. Kleisli is itself implemented using the functional language SML. So this paper also describes the influence of functional programming research that benefits the Kleisli system, especially the less obvious ones at the implementation level
          in ACM SIGMOD International Conference on Management of Data and Symposium on Principles of Database Systems Seattle, Washington, United States (PODS 98) view details
  • Limsoon Wong "The functional guts of the Kleisli query system" pp1-10 view details Abstract: Kleisli is a modern data integration system that has made a significant impact on bioinformatics data integration. The primary query language provided by Kleisli is called CPL, which is a functional query language whose surface syntax is based on the comprehension syntax. Kleisli is itself implemented using the functional language SML. This paper describes the influence of functional programming research that benefits the Kleisli system, especially the less obvious ones at the implementation level. DOI Extract: Introduction
    Introduction
    The Kleisli system is an advanced broad-scale integration technology that has proved useful in the bioinformatics arena. Many bioinformatics problems require access to data sources that are high in volume, highly heterogeneous and complex, constantly evolving, and geographically dispersed.
    Solutions to these problems usually involve multiple carefully sequenced steps and require information to be passed smoothly between the steps. Kleisli is designed to handle these requirements directly by providing a high-level query language, CPL, that can be used to express complicated transformation across multiple data sources in a clear and simple way.
    Many key ideas in the Kleisli system are influenced by functional programming research, as well as database query language research. Its high-level query language CPL is a functional programming language that has a built-in notion of "bulk" data types suitable for database programming and has many built-in operations required for modern bioinformatics. Kleisli is itself implemented on top of the functional programming language Standard ML of New Jersey (SML/NJ). Even the data format that Kleisli uses to exchange information with the external world is derived from an idea in type inference.
    This paper provides an overview of the Kleisli system; a summary of its impact on query language theory; and a description of the in uence of functional programming research that benefits the Kleisli system, especially the less obvious ones at the implementation level. The organization of the paper is as follows. Section 2 has three subsections that give an overview of the architecture, data model, and query language (CPL) of Kleisli. Section 3 also has three subsections that single out three areas in the implementation of Kleisli and discuss how they are influenced by functional programming research. In particular, how type inference gives rise to Kleisli's self-describing data exchange format, how monad gives rise to Kleisli's internal abstract representation of queries and simple optimization rules, and how higher-order functions give rise to a simple implementation of Kleisli's powerful optimizer. Section 4 discusses the impact of Kleisli on bioinformatics data integration. In particular, the first Kleisli query written for this purpose is reproduced here to illustrate the smoothness of Kleisli's interface to relational and non-relational bioinformatics sources and its optimizations.
    This paper is largely extracted from a much longer and detailed paper in J. Funct. Prog. Additional topics discussed in the longer paper include laziness, concurrency, "dependency-like" typing, and optimizations for relational databases. Extract: Collection Programming Language
    Collection Programming Language
    The syntax of CPL is similar to that of the ODMG standard for object-oriented database languages. An interesting feature of the syntax of CPL is the heavy use of the comprehension syntax, which showed up long ago in functional programming languages and later formalized by Wadler. A typical comprehension in CPL syntax is {x * x | \x <- S, odd(x)} which returns a set consisting of the squares of all odd numbers in the set S. This is similar to the notation found in functional languages, the main difference being that the binding occurrence of x is indicated by preceding it with a backslash, and that the expression returns a set rather than a list. As in functional languages, \x <- S is called a "generator", and odd(x) is called a "filter."
    Rather than giving the complete syntax, we illustrate CPL by a few examples on a set of feature tables DB. Example 2.3. This query extracts the titles and features of those elements of DB whose titles contain tyrosine as a substring.
    { (#title: x.#title, #feature: x.#feature)
    | \x <- DB, x.#title string-islike "%tyrosine%" };
    This query is a simple project-select query. A project-select query is a query that operates on one (flat) relation or set. Thus the transformation that such a query can perform is limited to selecting some elements of the relation and extracting or projecting some fields from these elements. Except for the fact that the source data and the result may not be in first normal form, these queries can be expressed in a relational query language. However, CPL can perform more complex restructurings such as nesting and unnesting not found in common relational database languages like SQL, as shown in the following examples.

    Example 2.4. This query flattens DB completely. The \a <--- f.#anno has similar meaning to \x <- DB, but it works on list instead of set. Thus it binds a to each item in the list f.#anno.
    {(#title:x.#title, #feature:f.#name,
    #start:f.#start, #end:f.#end,
    #anno-name:a.#anno_name, #anno-descr:a.#descr)
    | \x <- DB, \f <- x.#feature, \a <--- f.#anno};

    Example 2.5. This query demonstrates how to do nesting in CPL. The subquery DB' is the restructuring of DB
    by pairing each entry with its source organism. The sub-query ORG then extracts all organism names. The main query
    groups entries in DB' by organism names. It also sorts the output list by alphabetical order of organism names, because [u | \u <- ORG] converts the set ORG into a duplicate-free sorted list.
    let \DB' == {(#entry:x, #organism:a.#descr)
    | \x <- DB, \f <- x.#feature, \a <--- f.#anno,
    a.#anno_name = "organism"} in
    let \ORG == {y.#organism | \y <- DB'}
    in [(#organism:z, #entries: {v.#entry
    | \v <- DB', v.#organism = z})
    | \z <--- [u | \u <- ORG]];

    The inspiration for CPL came primarily from [6] that presented structural recursion as a query language. However, structural recursion has two difficulties. The first is that not every syntactically acceptable structural recursion program is logically well defined [7]. The second is that structural recursion has too much expressive power because it can express queries that require exponential time and space. While programming languages always take Turing completeness for granted, the attitude in database programming is radically di erent. In the context of querying databases, due to their immense size, queries are restricted to those which are practical in the sense that they should be within a low complexity class such as LOGSPACE, PTIME, or TC0.
    In fact, one may even want to prevent any query that has worse than O(n . log n) complexity, unless one is confident that the query optimizer has high probability of optimizing the query to no more than O(n . log n) complexity. Thus database query languages such as SQL are designed in such a way that joins are easily recognized, as joins are the only operations in a typical database query language that require O(n2) complexity if evaluated naively.
    Thus Tannen and Buneman suggested a natural restriction on structural recursion to reduce its expressive power and to guarantee its well-definedness. Their restriction cuts structural recursion down to homomorphisms on the commutative idempotentmonoid of sets, revealing a telling correspondence to monads [33]. A nested relational calculus, which is denoted here by NRC, was then designed around this restriction [8]. NRC is essentially the simply-typed lambda calculus extended by a construct for building records, a construct for decomposing records by field selection, a construct for building sets, a construct for decomposing sets by means of the restriction on structural recursion. Specifically, the construct for decomposing sets is Sfe1 j x 2 e2g, which forms a set by taking the big union of e1[o=x] over each o in the set e2. NRC (suitably extended) is implemented by the NRC Module of Kleisli and is the abstract counterpart of CPL, a la Wadler's equations relating monads and comprehensions[33].
    In order to show that NRC is a good basis for a query language, its relationship to existing query languages must be investigated. Furthermore, it has to enable solution to existing open problems in query language theory, it has to enable generalization of existing results in query language theory, it has to facilitate practical implementation, it has to allow for good query optimization, and it has to enable new applications.

          in Proceedings of the fifth ACM SIGPLAN International Conference on Functional programming 2000 view details
    Resources