|
|
|||||||
ARTICLES |

,
* Department of Mathematics and Computer Science
and
Department of Biology, Wesleyan
University, Middletown, CT 06459
Submitted September 16, 2003; Revised June 22, 2004; Accepted June 24, 2004
| ABSTRACT |
|---|
|
|
|---|
Key Words: relational database Drosophila splice sites information content undergraduate bioinformatics
| INTRODUCTION |
|---|
|
|
|---|
The current development of the new genomic-scale data sets leads to two natural questions: What are the best ways to represent large-scale data sets? What are the best ways to analyze large-scale data sets? The challenge in the field of bioinformatics is to learn to think in terms of large data sets (e.g., not one gene at a time, but thousands of genes) and to develop analytical approaches to extract biological information from large data sets. Genomic-scale data sets are being developed for many kinds of biological information, including gene expression, protein structure and function, and DNA and protein sequence data. We can think of genomic sequence data sets as defining special sets of sequences (sequence spaces) that can be contrasted with sets of random sequences (random sequence space). By identifying the constraints that define the sequence spaces, we hope to discover potential properties or patterns in the biological processes underlying the data sets that can then be tested experimentally. The central goal is to extract biological meaning through informatic analysis of the data, but how can we represent these sequence spaces, and how can we identify the constraints? We discuss below how the intrinsic structure and standard analytical power of relational databases provide a powerful setting for informatic analyses of genomic and related data sets.
Ideally, we would like to have a transparent framework that provides biology students the excitement of exploring large data sets, looking for underlying patterns, and analyzing the data without the need for sophisticated programming expertise. This framework should facilitate the development of new analytical approaches and ways of understanding complex genomic information. As we describe here, relational databases provide just such a setting. Indeed, their intrinsic structure provides a powerful setting for informatic analyses of genomic and related data sets that can be readily used by students.
In this article, we discuss our use of a Drosophila splice-site database to encourage students to think informatically. In the case study, we present examples of informatic analysis with the database to illustrate its analytical power. A full description of this analysis is published elsewhere (Weir and Rice, 2004). In the sections on Student Experiences and Assessment, we discuss student experiences when being introduced to informatic ways of thinking through use of the database. Although only used once, as a component in a small-enrollment class, this initial feedback indicated that hands-on experience manipulating large data sets provides students from biology and computer science backgrounds with a valuable experience in starting to think informatically. In Appendix A, we discuss general properties of relational databases that make them an ideal framework for representing and analyzing large biological data sets. We also discuss the use of stored procedures that permit more complex analyses of data sets. In Appendix B, we describe the design of the Drosophila splice-site database, focusing on how its design facilitated adding new analytical methods as our analysis of splice sites proceeded. We note that the essence of the database design is captured in Figure 1, and the details presented in these appendices are not essential for our discussion. However, the reader will find it useful to review the discussion of databases in the appendiceshow data is stored, retrieved, and analyzedin assessing the value of databases in bioinformatics curricula.
|
We emphasize that our primary goal in introducing students to relational databases of biological data sets was to nurture informatic thinking skills. Learning programming languages, principles of database design, and writing stored procedures were not primary goalsthese skills can be developed in other courses. By using versatile stored procedures available in our splice-site database, biology students could explore and informatically analyze a large biological data set without being stifled by the mechanics of programming. Figure 2 shows examples of the kinds of genomic-scale questions that students were able to address (see section on Student Term Projects). In contrast to working with many biological databases that are typically queried one element at a time, students could address questions about genomic-scale relationships within populations of sequences.
|
| DEVELOPMENT OF A GENOMIC DATABASE: A CASE STUDY |
|---|
|
|
|---|
The database was developed using the Microsoft SQL Server 2000 database management system and stores data about exons, introns, and splice-site regions. These types of data were computed using a custom algorithm that matches transcript (cDNA) sequences (Stapleton et al., 2002) with genomic DNA sequences (downloaded from the Berkeley Drosophila Genome Project). The splice-site data were stored in database tables as described in Appendices A and B. The data organization is summarized in the database schema (Figure 1, see Appendix B). A collection of stored procedures was developed to perform a variety of specific data analyses, including the computation of information content at nucleotide positions near splice sites, frequency distributions of nucleotides, and intron-length distributions (see following discussion). In addition, the database contains special metadata tables that support a generic Web interface written in the Perl language (Appendix B). By updating these tables, new stored procedures can be made available to users without making any changes in the Web interface.
The relational database greatly facilitated the analysis of sequence conservation at splice sitesan example of a constrained sequence space (see Introduction). In particular, several of the stored procedures permitted efficient testing of different specialized contexts that enhance sequence conservation. Once features that appeared to influence splice function were identified (such as intron length), they were added as parameters in procedures to test other contexts. The data set needs to be sufficiently large (in our case, 10,057 introns) so that when it is partitioned on the basis of a parameter such as intron length, the resulting subsets are of reasonable size. Within these subsets of the data, we can look for other parameters that also influence splice function. Clearly, very large data sets are crucial for this approach, which will become increasingly useful as more large genome data sets become available.
The following example illustrates how one of the stored procedures can be
used to calculate information at nucleotide positions near donor splice sites.
Information, as defined by Schneider
(Stephens and Schneider 1992),
provides a sensitive measure for quantifying sequence conservation at a given
position in an alignment. It is defined by the quantity

Ignoring the correction factor, information is defined as the difference between two uncertainties: The above quantity in brackets represents the uncertainty based on the actual nucleotide frequencies, while the value 2 represents the uncertainty if each nucleotide is equally likely to occur. Because there are four possible bases at each nucleotide position, there are between 0 and 2 bits of information at each position: 2 bits if only one nucleotide is present (the quantity in brackets tends to zero as one nucleotide predominates), and 0 bits if each nucleotide is equally likely to occur (the quantity in brackets tends to 2 as each nucleotide becomes equally likely). For example, Figures 3C and 3D illustrate the information values and nucleotide profiles at 20 nucleotide positions flanking the donor sites of our full data set of 10,057 splice sites (based on the Web interface output in Figure 3B).
|
The analytical approach presented above provides an example of how the relational database encourages one to think informatically about properties of the splicing process by partitioning sequence space. One can imagine extending the partitioning approach by selecting subsets of splice sites with different degrees of mismatch to the consensus sequences and examining compensating changes elsewhere in the splice sites. Indeed, as discussed in Weir and Rice (2004), this approach reveals that poorer matches to consensus are associated with enhanced A content and, to a lesser degree, U content in the vicinity of the splice site. This may facilitate splicing by reducing RNA secondary structure or by increasing association of spliceosome components. One can also imagine analyzing sequence space using different measures such as di-nucleotide or trinucleotide content, or by assessing dependencies with measures such as mutual information (a measure of covariation, in this case between pairs of nucleotide positions). We have begun designing stored procedures to allow the implementation of these and other analytical approaches.
| STUDENT EXPERIENCES |
|---|
|
|
|---|
Bioinformatic Analysis without Programming
By providing stored procedures for execution in a Web interface, students
can often carry out data analyses without actually needing to write programs.
In particular, by designing a set of versatile stored procedures for analyzing
splice-site data, we have provided our students with a rich environment for
analyzing a large genomic data set. The key to our approach is the design of
stored procedures with multiple parameters, each of which is relevant for
analyzing splice-site data (e.g., minimum and maximum intron lengths, types of
introns, or selected tables of genes). These parameters provide students with
a built-in versatility by allowing either a simple analysis based on varying
one parameter or exploiting the combinatorial power by simultaneously varying
several parameters in a data analysis (see Appendix B for a listing of stored
procedures).
Designing Queries
The broad range of attributes represented by the fields in the database
tables provided many opportunities for quite open-ended analysis by students.
In addition to the set of stored procedures, we also provided students with a
Web-based window in which they can directly enter SQL queries. The SQL
language is quite transparent and intuitive, and even with rather simple
queries, it is possible to extract interesting data relationships from the
database. By studying the primary database schema
(Figure 1, Appendix B) and a
few examples of queries, it was possible for students to make quick
progress.
For example, the query returns the identifiers, ranks, and lengths of all exons between 2,900 and 3,000 bases in length listed in ascending order by length (Figure 4).
|
select id, rank, finishposition - startposition + 1 as lengthfrom tblNewKnownExons
where finishposition - startposition + 1 between 2900 and 3000
order by length
Testing queries using the SQL query Web interface provides an exciting opportunity for students to explore the splice-site sequence space. They can ask a broad range of questions (e.g., whether exon lengths correlate with information content at splice sites), or they can ask whether there are relationships between general nucleotide content and either intron or exon length. The possible questions are almost limitless.
The SQL query window was used by students in term projects. In our next offering of the course, we will use the query window in a new computer lab module so that all students in the class are introduced to the versatility of the database for studying relationships within biological data. Relationships between attributes of the data set can be assessed through the "joining" of tables within the databasea process that illuminates all possible associations between components of two tables that satisfy given constraints. Unlike a single flat-file spreadsheet, which can be difficult to navigate for more than one component (gene) at a time, the versatility and flexibility of being able to combine multiple nonredundant tables in many different ways allows one to ask many different kinds of questions about relationships within the data. For example, through appropriate joining (cross-referencing) of tables, one can assess across the whole data set whether poor matches to splice-site consensus sequences are associated with general nucleotide content trends (see Development of a Genomic Database). It is this transparent analytical power of relational databases that allows complex biological relationships to be uncovered.
Assessing Data Quality and Algorithm Design
The Web teaching module in our bioinformatics course
(http://mweir.web.wesleyan.edu/igs350/Drosophila_splice_sites.htm)
introduced students to the database and focused on the quality of data
resulting from our computation of splice sites. Data quality is a crucial
issue in large data sets, as it constrains the potential quality of analysis.
The transparency of a well-designed genomic database facilitates the
assessment of data quality. Indeed, the SQL language makes it easy to apply
constraints to extract a higher-quality data set that is a subset of the full
data set. For example, the quality of the data in our splice-site database
depended on our algorithm for computing splice sites. As a result of
nucleotide polymorphisms (between cDNA and genomic sequences), our algorithm
incorrectly predicted some very short introns and exons. Indeed, in the
teaching module, students discover that limiting their analysis to cDNAs with
introns and exons over 20 nucleotides in length significantly improves the
quality of their data set, as measured by higher conformity to the two-base
consensus sequences found at each end of the intron. The teaching module
introduces students to assessment of the quality of the
splice-site-computation algorithm and to considerations of ways to improve the
design of the algorithm by taking into account biological context (e.g., the
handling of polymorphisms).
Our experience emphasizes the importance of introducing biology students to thinking algorithmicallyusing precise notation to describe a step-by-step processthereby encouraging students to think systematically about biological analyses. This can apply to algorithms used in analysis of biological data (e.g., the splice-site computation algorithm described above) or to algorithms developed to model biological machines. Encouraging students to design algorithms that describe the actions of biological mechanisms (e.g., translation by a ribosome) helps the students to build a bridge between bioinformatics and experimental biology and to think about the informatic as well as structural foundations for the biological functions they are studying. Designing algorithms and observing their behavior can reveal steps in a process that were not apparent using apparently systematic thinking. In a different part of our course, we asked students to describe algorithms for biological processes based on finite state machines.
Student Term Projects
In our bioinformatics course, small groups of students, typically including
both life-science majors and computer-science majors, worked together on term
projects. Their projects included using the Drosophila splice-site database to
address questions related to splicing (see
Figure 2).
Some of these questions could be addressed using the available stored procedures, and others required some SQL programming. The questions encouraged students to think about different ways to partition the data set and to test for patterns in the data (sequence space) that might reflect constraints on the spliceosome machine. When students use the stored procedures or when they design their own queries in the Web interface, we encourage them to refer to the database schema (Figure 1; Appendix B). By understanding the data structure implied by the schema, students have a framework for considering how relationships within the data can be extracted. This encourages them to consider how different schema frameworks and table designs facilitate asking different kinds of questions. For example, are the database tables well designed, so that it is straightforward to extract the relevant data for analysis (e.g., to analyze information content of aligned sequences or dinucleotide content)? This exposure to database structure encourages students to think about formulating different representations of biological data to answer different questions. In future offerings of the course, we plan to use the database schema as an explicit tool in the lecture component of the course for introducing students to the importance of data representation decisions.
| ASSESSMENT |
|---|
|
|
|---|
After the end of the course, students were invited to answer an anonymous Web-based questionnaire concerning the splice-site computer lab session. Questions were designed to assess our goals in using databases in teaching, including facilitating students' thinking informatically about large-scale data sets and data set quality, understanding the concept of information, and understanding the algorithm used to compute splice sites. Responses were obtained from nine students (not including Gladstone) and are summarized in Figure 5 and its legend.
|
| CONCLUSIONS |
|---|
|
|
|---|
| ACCESSING MATERIALS |
|---|
|
|
|---|
| APPENDIX A RELATIONAL DATABASES |
|---|
|
|
|---|
The standard representational framework of a relational database (O'Neil and O'Neil, 1999) is ideally suited for organizing and representing genomic data. Different types of properties of genes (such as functional annotation and sequence data) can be assigned to different tables in the database, and in each table, different attributes of the property (such as gene name, associated protein, and transcription start position) can be stored. We can think about the genome as a whole being represented by the database, where each gene property is represented by specific values of standardized attributes, rather than trying to think about thousands of genes with unorganized, separate characteristics.
The power of the relational database structure is that organizing complex genomic-scale data in this framework exposes the data set to uncovering relationships between properties of the data. The structured query language (SQL) of relational databases provides the ability to answer straightforward questions about the data without needing to write extensive code defining search algorithms, because the organization naturally leads to common analytical approaches which can be expressed in the SQL language. A relational database allows you to perform a task such as, "list all Drosophila genes that have a transcription start site less than 100 KB from the end of the left arm on chromosome 3" by executing an SQL query of the form
select genenamefrom tableGenes
where transcriptionstart, 100000 and arm = '3L'.
Moreover, sophisticated queries can often be written in a simple procedural language supported by the database that extends standard SQL. A standard approach is to use this language to write stored procedures that permit a more in-depth data analysis by allowing many different combinations of parameter values to be tested.
A stored procedure that summarizes the preceding query might have the name ListGenes and parameters corresponding to the arm and distance from the start of the arm, but it might also have a parameter that restricts the types of genes that will be listed. For example, the following code specifies the execution of ListGenes and lists all ribosomal genes that have a transcription start site less than 50 KB from the end of arm 2R:
exec ListGenes 50000, '2R', 'ribosomal'.
In summary, relational databases allow users to formulate and answer basic questions about the data and allow database developers to create more sophisticated procedures for analyzing the data. In fact, the design of queries and stored procedures is an ongoing processthe results of a given analysis lead to the need to develop new and progressively more sophisticated stored procedures. In our case, we have developed an approach for recording new stored procedures so they are automatically made available to students through a Web interface to the database (see Appendix B).
| APPENDIX B DESIGNING THE SPLICE-SITE DATABASE |
|---|
|
|
|---|
The four primary tables that are used to store the computed splice-site data are shown in Figure 1. By combining the data in these tables in a variety of ways, complex queries and procedures can be written to analyze virtually any aspect of the splice-site data. For example, the splice-site database currently provides eight specialized procedures with a variety of parameters for analyzing data (listed in Figure 6A). Below, we provide an overview of the design of the procedure user_countpattern that counts the matches to a pattern in splice-site regions. Students in our bioinformatics courses also have the capability of running their own custom queries in the database by using a special structured query language query window in the Web interface (see Designing Queries).
|
In each table diagram, the set of gold keys represents the primary key. This is a collection of fields that uniquely determine the values of the other fields in the table. For example, in the intron table tblNewKnownIntrons, the primary key consists of the fields id, rank, and release. In other words, each intron entry is determined by an integer identifier, the rank of the intron (1, first; 2, second; etc.), and the release number of the data set from the Berkeley Drosophila Genome Project (http://www.fruitfly.org) that was used to compute the splice-site data. Similarly, the fields id and release represent the primary key of the table tblNewKnowncDNA. This pair uniquely determines the cDNA-gene correspondence represented by the cDNA transcript ctstring and the gene cgstring.
The lines between table diagrams indicate relationships that are called referential integrity constraints. These constraints enforce data dependencies between certain pairs of tables. For example, the line between the tables tblNewKnowncDNA and tblNewKnownExons indicates that every value of id in the second table must also be present in the first table. Similarly, every triple of values id, intronrank, release found in tblNewKnownSpliceSites must be present as values of id, rank, and release in tblNewKnownIntrons.
|
The stored procedure computes the number of exact matches to the consensus sequence in the restricted data set as well as the starting positions of the matches in the range from -10 to 10. The results of executing the procedure are shown in Figure 7. As expected, the majority of the matches (1,855) are found starting at the donor position -1. In these cases, a G is found at position -1, a G is found at position 1, a T is found at position 2, either an A or a G is found at position 3, and so on.
The manner in which the actual computation is performed by the procedure is outlined here: First, the constraint on the minimum splice length (20) is used to create a new temporary table containing the entries in the cDNA table that satisfy the constraint. Second, the entries in the temporary table are joined with the entries in the intron table that have matching ids and release numbers and satisfy the minimum and maximum length constraints 0 and 1,000, respectively, to create a second temporary table. Third, the entries in the second temporary table are joined with the entries in the splice-site table that have matching ids, ranks, and release numbers to create a temporary splice-site table. Finally, the entries in the temporary splice-site table are used to generate the nucleotide sequences found at the positions 10 to 10 for both donor and acceptor sites. The sequences that match the pattern "GGT[AG]AGT," starting at some position in the range from -10 to 10 are computed using a built-in pattern matching routine. The starting positions and the numbers of these sequences are displayed in Figure 7.
The database also contains a collection of metadata tables. In general, the term metadata refers to the fact that the tables contain information about genomic data and procedures as opposed to actually containing the genomic data. In the splice-site database, the metadata tables store a variety of information about the stored procedures. For example, there are tables that contain the names of the stored procedures displayed in the Web interface, the names of the parameters for these procedures, and lists of possible choices for the actual values of certain parameters. There is also a metadata table that contains the names of other tables that store lists of specialized genes (e.g., the genes on a particular chromosome arm or the genes used in a particular experiment).
The metadata tables support several specialized views (dynamically generated tables) and procedures that compute summary information. A scripting program uses the summary information from the metadata to generate the layout of the Web pages that provide an environment for executing the stored procedures. This approach allows the script to be written in a generic manner independent of specific stored procedures and permits additional stored procedures to be added to the database and exported for use without making any changes to the code. Details of this approach are described at http://igs.wesleyan.edu/metatables.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Corresponding author. E-mail address:
mweir{at}wesleyan.edu.
| REFERENCES |
|---|
|
|
|---|
Fields, C. (1990). Information content of
Caenorhabditis elegans splice site sequences varies with intron length.Nucleic Acids Res.
18,1509
-1512.
Kanehisa, M. (2000). Post-Genomic Informatics, Oxford University Press, Oxford, UK.
Lim, L.P., and Burge, C.B. (2001). A computational
analysis of sequence features involved in recognition of short introns.Proc. Natl. Acad. Sci. USA
98,11193
-11198.
Mount, S.M., Burks, C., Hertz, G., Stormo, G.D., White, O., and
Fields, C. (1992). Splicing signals in Drosophila: intron size,
information content, and consensus sequences. Nucleic Acids
Res. 20,4255
-4262.
Murray, A. (2000). Whither genomics? Genome Biology 1, comment003.1 -003.6.
O'Neil, E., and O'Neil, P. (1999). Database principles, programming, performance. In: Morgan Kaufman Series in Data Management Systems, series ed. J. Gray. San Francisco, CA: Morgan Kaufmann Publishers, Inc.
Stapleton, M., Liao, G., Brokstein, P., Hong, L., Carninci, P.,
Shiraki, T., Hayashizaki, Y., Champe, M., Pacleb, J., Wan, K., Yu, C.,
Carlson, J., George, R., Celniker, S., and Rubin, G.M. (2002).
The Drosophila gene collection: identification of putative full-length cDNAs
for 70% of D. melanogaster genes. Genome Res.12
, 1294-1300.
Stephens, R.M., and Schneider, T.D. (1992). Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J. Mol. Biol.228 , 1124-1136.[CrossRef][Medline]
Weir, M., and Rice, M. (2004). Ordered partitioning
reveals extended splice site consensus information. Genome Res.14
, 67-78.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | ARCHIVE | SEARCH | TABLE OF CONTENTS |