FT(ID:6704/ft:002)for Features Table Related languages
References: INTRODUCTION While the widespread availability of sequence databases has been of great value to molecular biologists, most database usage is limited to a few simple tasks: searching for entries by keyword, retrieval of entries, and sequence similarity searches. More sophisticated projects often require the creation of large database subsets, representing particular taxa, organs, tissues, or other groupings which merit comparison. One of the earliest studies of this type analyzed 124 mRNA sequences from E. coli to infer a set of rules for identification of ribosome binding sites [1]. More recently, 369 Alul dispersed repetitive elements were categorized into subfamilies to enable reconstruction of their evolutionary history [2]. Such projects require not only the ability to organize sequences into discrete groups, but also to extract specific subsequences from each database entry for analysis of comparable features. A sequence query language, that is, a language in which expressions, upon evaluation, yield sequence, would offer many advantages in dataset construction. The sequences themselves need not be stored, but rather, the instructions necessary to recreate the dataset. Interestingly, the most ambitious attempts at writing sequence query languages predate GenBank [3] itself. Schroeder and Blattner [4] described DNA*, which permitted concatenation and complementation of DNA sequences using a terse syntax. Another approach was that of DELILA (DEoxyribonucleic acid LIbrary LAnguage, [5]. DELILA encompassed both a hierarchical syntax for description of genomes, as well as a query language in which named features served as reference points within a coordinate system. Because both languages predated the current databases, they do not contain syntax for reference to database entries. While more recent tools have been able to parse GenBank entries for direct use of data fields by other programs [6], automated access to the features annotated in the Features Table has been difficult to realize. The development of the Feature Table language (FT) [7] as an integral part of database annotation was a fundamental step in making sequence data more useable because each feature in a GenBank entry is now annotated in a standard, machineparsable syntax. The universality of this language now makes it possible to specify any DNA sequence using an expression, as given by the relation expression - sequence This task has been implemented in the FEATURES program, which is part the XYLEM package, to be described in this paper (Table I). While fully accessible through a menu-driven interface, the simplest form of the FEATURES command is features expression > sequence meaning that FEATURES can take a FT expression as input and write a sequence to the output. For example, given the following feature annotated in the GenBank entry with primary accession number M74750: terminator 609..650 /label =T7-terminator typing the command features M74750:T7-terminator would return the sequence ataaccccttggggcctctaaacgggtcttgaggggttttt representing that part of the sequence spanning bases 609 to 650, as identified by the field 'label =T7-terminator'. Table I. List of XYLEM programs and functions High-level tools FINDKEY Search for one or more keywords in database FETCH Retrieve one or more entries from database FEATURES Extract features by feature key or expression Low-level tools SPL1TDB Split a database into annotation, sequence and index IDENTIFY Used by FINDKEY to identify entries containing keywords GETLOC Used by FETCH to retrieve entries from a split database GETOB Used by FEATURES to parse Feature Table expressions UDS Update an. existing dataset with new versions of entries DBSTAT Calculate amino acid frequencies in a protein database RIBOSOME Translate file of nucleic acid sequences into protein SHUFFLE Given a random seed, shuffles each sequence in a file REFORM Multiple alignment printing tool GBUPDATE Download GenBank databe by FTP; calls SPLITDB PIRUPDATE Download PIR database by FTP; calls SPL1TDB The XYLEM tools (Table I) automate the management of online databases, as well as the construction of sequence database subsets. Even non-expert users should be able to create datasets for use in multiple alignments, phylogenetic studies, structure comparisons and other types of analyses. |