Palingol(ID:6698/pal008)


Nucleic acids description language

Institut Curie, Paris, France.


References:
  • Billoud B, Kontic M, Viari A "Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database" Nucleic Acids Res. 1996 Apr 15;24(8):1395-403 view details Abstract: At the DNA/RNA level, biological signals are defined by a combination of spatial structures and sequence motifs. Until now, few attempts had been made in writing general purpose search programs that take into account both sequence and structure criteria. Indeed, the most successful structure scanning programs are usually dedicated to particular structures and are written using general purpose programming languages through a complex and time consuming process where the biological problem of defining the structure and the computer engineering problem of looking for it are intimately intertwined. In this paper, we describe a general representation of structures, suitable for database scanning, together with a programming language, Palingol, designed to manipulate it. Palingol has specific data types, corresponding to structural elements-basically helices-that can be arranged in any way to form a complex structure. As a consequence of the declarative approach used in Palingol, the user'should only focus on 'what to search for' while the language engine takes care of 'how to look for it'. Therefore, it becomes simpler to write a scanning program and the structural constraints that define the required structure are more clearly identified. External link: online copy Extract: THE PALINGOL LANGUAGE
    THE PALINGOL LANGUAGE
    General principles
    From the user's point of view, a constraint (or logic) programming language (like Prolog) mostly differs from a procedural language (like Fortran or C) in the fact that with constraint programming, the user'should only specify what he or she wants to do, whereas he or she should also specify how to do it in a procedural language. In order to work properly, a logic programming language must have its own general purpose search procedure built into the language, called the language "engine". Palingol is a constraint programming language whose data types and search engine have been particularly adapted for secondary structures. The engine is based on a classical Branch and Bound mechanism whose main outlines will be described next.

    The Branch-and-Bound
    Starting with the list of all helices provided by the HelixSearch program, Palingol's engine first tries to build a list of candidates by evaluating local constraints on each individual helix. Since this list is ordered, this is performed in the following way: (let us suppose, as an example, that we are looking for a structure with three helices) Palingol first looks through the list for a candidate as the first helix, then for the second helix candidate, starting in the list after the first, then for the third, starting after the second etc. Once the third helix candidate has been found, the sub-list of three candidates is then checked against the global constraints. Then the process continues with another third candidate starting just after the previous one. This goes on until the list is exhausted for possible third candidates and the process recurses to the next second candidate and, when the list is exhausted for possible second candidates, to the next first candidates. At this step, we have done an exhaustive exploration of the search space. But this exploration can be efficiently bounded by a simple consideration. Keeping in mind that we are looking for local structures it appears efficient to stop the search for the n th candidate as soon as it is too far away from the ( n -1) th . In a tRNA for instance, if one assumes a maximum intron size of 120 nt, then looking at a potential T[Psi]C arm candidate located 150 bases downstream from the current anticodon arm is totally useless. These additional bounding constraints are called "span constraints" in Palingol. Although, strictly speaking, they are not required in a Palingol program, they are very important, since they can speed up the search process by several orders of magnitude. Thus we shall describe them more precisely later.

    Language syntax
    We will not present all technical details of the language syntax but rather outline its general aspects. The syntax of Palingol is strongly inspired by functional programming languages (like LISP) and is based on parenthesized expressions of the following form:

    (operator argument argument … )

    where "operator" stands for one of the built-in operators of the language and "argument" is either: (i) a constant, (ii) a named variable or (iii) another parenthesized expression (without limitation on the nesting level).

    The number of arguments in an expression depends upon the operator. The `value of an expression' is the result of the operator's action on the arguments and what we call the `type of an expression', is actually the type of its result. In addition to the traditional types (boolean, numerical, string), Palingol makes use of a new and specific type called `physical'. It is intended to represent all the basic information pertaining to an elementary helix that has been computed by the `Helix Search' program. The user can access these data by using three physical constants: head , tail and loop . In addition, the user can gain access to the complete sequence being processed through the physical constant fullsequence .

    Because of the practical importance of these physical elements, we should emphasize now some of the operators that compute various values by acting on them. Two numerical operators start and end take one physical argument and return the position of the physical element in the sequence. For instance: (start head) is a numerical expression, whose value is the position of the first symbol belonging to the head. The string operator sequence takes one physical argument and returns the string of characters composing the physical element. For instance, (sequence tail) is a string expression whose value is the sub-sequence of the tail. Palingol provides a lot of other operators covering a wide range of traditional and more specific operations. Classical boolean and numerical operations ( and , or , add , sub , div etc.) are of course present along with most classic string operations (substring extraction, string searching, etc.). Some biologically specific string operators have been included (complementation, inversion etc.). Finally Palingol also provides still more specific operators like pattern matching with IUPAC codes, or consensus matrix computations. It should be noted that a constraint is always expressed in Palingol as a boolean expression: the constraint is said to be satisfied if the value of the corresponding expression is true. A set of constraints is therefore expressed as logically connected boolean expressions. Since and is the most usual connector, it is considered to be implicit when the boolean expressions are simply juxtaposed. Thus writing

    (expression1)

    (expression2)

    (expression3)

    is equivalent to writing: (and (and (and (expression1) (expression2)) (expression3))) . In a similar way, the parser of Palingol is smart enough to add some missing operators when the context is unambiguous. As mentioned above, the search engine in Palingol tries to satisfy the constraints which have been specified by the user. These constraints are expressed as boolean expressions. When evaluating an expression composed of several sub-expressions, the engine stops as soon as it is sure that the final result will be true or false regardless of the values of the sub-expressions which have not yet been evaluated. This process of stopping evaluation is called pruning. By default the Palingol engine does pruning on and and or expressions. Because of this, it is important to specify the order of evaluation of the expressions: this is from left to right at a given parenthesis level and from deeper to upper parenthesis levels. For instance in (and (and (expression1) (expression2)) (expression3)) , expression1 is evaluated first, then expression2 (if needed), and finally expression3 (if needed). By using the implicit and which has been described above, this just means that the expressions are evaluated from top to bottom. Note that this has an important consequence to the optimization of Palingol programs: if several anded expressions can be evaluated in several different orders, then it is more efficient to place the `strongest' condition first (the strongest condition is the one which is most often false), since the evaluation will stop sooner.

    Palingol allows the user to store intermediate results into variables (whose names start with a $ sign). Two operators are provided to set and get variables: (set $variable value) and (get $variable) . Variables do not need to be declared, their type is automatically determined when set and automatically cast when get. (set $variable value) and (get$variable) are treated as boolean expressions which always have the value true (they are `side effect' expressions). This allows these expressions to be inserted anywhere within an implicit and structure without stopping the evaluation.






    Figure 3 . A Palingol program layout. The main part is composed of three subparts respectively called helix , span and cross part. The helix part(s) specify local constraints acting on each elementary helix, the span part is optional and is used to prune the Palingol engine's search; the cross part specifies global constraints acting across different helices.

    Although this may appear quite contradictory with constraint programming, Palingol also provides control structures, namely the if..then..else and the while structure. It should be pointed out that they are not used to control the engine process itself, but mostly to simplify the writing of constraints.

    Structure of a program
    The overall structure of a Palingol program can be decomposed into three principal parts: (i) the prologue sections (optional), (ii) the main section and (iii) the epilogue sections (optional). The main section is the only required one. It contains the description of the structure and is composed of three parts corresponding to various constraint levels: the helix part(s), the span part and the cross part. This overall layout is given in Figure 3 a. The helix part(s) describe local constraints for each helix in the structure. The helix sections are ordered as indicated above. There must be as many distinct helix sections as there are distinct helices in the required structure. Each helix section is composed of one boolean expression (usually containing several boolean expressions implicitly anded). Once a sub-list of n candidates (where n is the number of helix sections) has been found by the engine, this list is submitted to the global constraints described in the cross section. Again, this section is composed of one boolean expression (usually containing several boolean expressions implicitly anded). Note that there is no specific `print result section', this is done within the cross section, as a side effect of the print operator. Finally, the optional span section has been added to speed up the overall searching process. As previously mentioned, it describes the maximum distance allowed between two consecutive helices. More precisely, the span section takes the special form given in Figure 3 b, where the i th line specifies the maximal distance allowed between the i th and ( i +1) th helix. This distance is computed as the algebraic difference between `start head' of the ( i +1) th helix and %keyword of the i th helix, where `keyword' can be any of start_head , end_head , start_tail or end_tail .






    Figure 4 . The traditional representation and description of an iron responsive element (IRE).


    The optional prologue and epilogue sections are paired, each prologue section having its epilogue counterpart, namely the start / end sections and the before / after section. The start (prologue) and end (epilogue) sections each contain one boolean expression (the value is ignored) which is evaluated once at program startup and ending respectively. This is useful for setting up variables (like counters) or for printing some general information. The before (prologue) and after (epilogue) sections each contain one boolean expression which is evaluated once for each new sequence in the sequence database (i.e. for each new list of helices provided by HelixSearch). Only the value of the before boolean expression is meaningful; if it is false then the main section is not evaluated. This may be used to skip over specific entries in the data bank, by checking that a specific pattern is present in the sequence, for instance.

    Like most the compilers and interpreters, the lexical and grammatical analyzer of Palingol, requires pre-processing. This is done by a small program called Galopin whose main responsibility is to rewrite the user'supplied Palingol program in a suitable form for the analyzer, mostly filtering purely syntactic errors. This preprocessor also provides additional features like file inclusion directives and macros. This allows the user to build libraries of program pieces that can be reused.
    Resources