What is NASCUP?

NASCUP (Necleic Acid Sequence Classification by Universal Probability) is a new classification method that captures the probabilistic structure of a sequence family as a compact context-tree model and uses it efficiently to test prosimity and membership of a query sequence. Nascup crucially utilizes the notion of universal probability from information theory in model-building and classification processes, delivering BLAST-like accuracy in orders-of-magnitude reduced runtime for large-scale databases.

Main Algorithm
NASCUP chooses the context tree (among all possible context trees that may arise from the context graph) that has the maximum universal probability. This model-building method closely resembles the context-tree maximizing data compression algorithm, which is known to achieve the optimal compression performance, as well as tree-based decision algorithms. NASCUP, however does not account for the description complexity of the tree model itsel, which is crucial in compression or tree-based decision making, but is irrelevant in classification. Given a new sequence whose family membership is unknown, NASCUP compares the (conditional) probabilities of the sequence given context trees for the sequence families. These probabilities are once again computed according to universal probability assignments, and the family with highest probability is selected.


Release History

version 0.8.1 (October, 2015)
  • First release of NASCUP
  • Download : NASCUP

P A P E R

P R O G R A M

D A T A S E T s

Functional non-coding RNA
RF 1,320 families, 170,881 seqs
Microbial Taxonomy
- rRNA database
RD 134 families, 3,838 seqs
GG 464 families, 23,142 seqs
SS 313 families, 17,625 seqs
SL 107 families, 4,593 seqs
- pyrosequencing data
AR 60 families, 44,407 seqs
DV 23 families, 55,466 seqs
Coding/non-coding RNA
CN 2 families, 103,136 seqs
HS 2 families, 112,180 seqs
Full Greengenes
BGG 60,717 - 560,969 seqs