Abstract
It is a challenging task to predict with high reliability whether plant genomic sequences contain a polyadenylation (polyA) site or not. In this paper, we solve the task by means of a systematic machine-learning procedure applied on a dataset of 1000 Arabidopsis thaliana sequences flanking polyA sites. Our procedure consists of three steps. In the first step, we extract informative features from the sequences using the highly informative k-mer windows approach. Experiments with five classifiers show that the best performance is approximately 83%. In the second step, we improve performance to 95% by reducing the number of features using linear discriminant analysis, followed by applying the linear discriminant classifier. In the third step, we apply the transductive confidence machines approach and the receiver operating characteristic isometrics approach. The resulting two classifiers enable presetting any desired performance by dealing carefully with sequences for which it is unclear whether they contain polyA sites or not. For example, in our case study, we obtain 99% performance by leaving 26% of the sequences unclassified, and 100% performance by leaving 40% of the sequences unclassified. This is clearly useful for experimental verification of putative polyA sites in the laboratory. The novel methods in our machine-learning procedure should find applications in several areas of bioinformatics.
Original language | English |
---|---|
Pages (from-to) | 1229-1245 |
Number of pages | 17 |
Journal | Journal of Computational Biology |
Volume | 14 |
Issue number | 9 |
DOIs | |
Publication status | Published - Nov 2007 |
Keywords
- dimensionality reduction
- guaranteed classification performance
- k-mer frequencies
- machine learning
- plant polyA sites
- sequence analysis
- RNA
- PREDICTION
- SIGNALS
- DNA
- CLASSIFICATION
- ALGORITHMS
- SELECTION
- MACHINE
- MOTIFS