Splicing refers to the elimination of non-coding regions in transcribed pre-messenger ribonucleic acid (RNA). Discovering splice sites is an important machine learning task that helps us not only to identify the basic units of genetic heredity but also to understand how different proteins are produced. Existing methods for splicing prediction have produced promising results, but often show limited robustness and accuracy. In this paper, we propose a deep belief network-based methodology for computational splice junction prediction. Our proposal includes a novel method for training restricted Boltzmann machines for class-imbalanced prediction. The proposed method addresses the limitations of conventional contrastive divergence and provides regularization for datasets that have categorical features. We tested our approach using public human genome datasets and obtained significantly improved accuracy and reduced runtime compared to state-of-the-art alternatives. The proposed approach was less sensitive to the length of input sequences and more robust for handling false splicing signals. Furthermore, we could discover non-canonical splicing patterns that were otherwise difficult to recognize using conventional methods. Given the efficiency and robustness of our methodology, we anticipate that it can be extended to the discovery of primary structural patterns of other subtle genomic elements.

Boosted Categorical Restricted Boltzmann Machine


Taehoon Lee and Sungroh Yoon, in Proceedings of International Conference on Machine Learning (ICML), Lille, France, July 2015. [paper] [link]



We tested our approach with the datasets listed in Tables 1 and 2.
  • GWH genome-wide data [zip]
  • The splice signals from these sequences are all canonical, and all the sequences have dimer GT or AG in the middle. That is, each sequence from GWH-donor has dimer GT in nucleotide positions 200 and 201, and each se- quence from GWH-acceptor has dimer AG in positions 198 and 199.
  • UCSC genome browser database [zip]
  • We generated three examples by taking the sequences centered at the left, middle, and right boundaries of each exon. They correspond to acceptor, non-site, and donor examples, respectively. While the GWH consists of only canonical junctions, the UCSC dataset includes non-canonical ones.


Contact Information

If you have any questions or suggestions, please do not hesitate to contact us.