On the reliable identification of plant sequences containing a polyadenylation site

Ilkka Havukkala*, Stijn Vanderlooy

*Corresponding author for this work

    Research output: Contribution to journalArticleAcademicpeer-review

    2 Citations (Web of Science)


    It is a challenging task to predict with high reliability whether plant genomic sequences contain a polyadenylation (polyA) site or not. In this paper, we solve the task by means of a systematic machine-learning procedure applied on a dataset of 1000 Arabidopsis thaliana sequences flanking polyA sites. Our procedure consists of three steps. In the first step, we extract informative features from the sequences using the highly informative k-mer windows approach. Experiments with five classifiers show that the best performance is approximately 83%. In the second step, we improve performance to 95% by reducing the number of features using linear discriminant analysis, followed by applying the linear discriminant classifier. In the third step, we apply the transductive confidence machines approach and the receiver operating characteristic isometrics approach. The resulting two classifiers enable presetting any desired performance by dealing carefully with sequences for which it is unclear whether they contain polyA sites or not. For example, in our case study, we obtain 99% performance by leaving 26% of the sequences unclassified, and 100% performance by leaving 40% of the sequences unclassified. This is clearly useful for experimental verification of putative polyA sites in the laboratory. The novel methods in our machine-learning procedure should find applications in several areas of bioinformatics.

    Original languageEnglish
    Pages (from-to)1229-1245
    Number of pages17
    JournalJournal of Computational Biology
    Issue number9
    Publication statusPublished - Nov 2007


    • dimensionality reduction
    • guaranteed classification performance
    • k-mer frequencies
    • machine learning
    • plant polyA sites
    • sequence analysis
    • RNA
    • DNA
    • MOTIFS

    Cite this