Experimental setup

To run experiments :
  • Environment
    A linux mahcine
    Ubuntu 12.04 2.2 GHz Intel Xeon E5-4620, and 512 GB memory
  • Methods and Commands
  • BLAST (v4.0 in C++) [with default]
  • USEARCH (v8.1.1861)
  • : usearch -usearch_local -strand plus -id 0.8 -thread 1 -top_hit_only
  • UBLAST (v8.1.1861)
  • : usearch -makedb_ublast
    : usearch -ublast -evalue 1e-9 -strand plus -top_hit_only
  • BLAT (v.35 in C) [with default]
  • caBLAST (v1.2.1 in C)
  • : cablast-compress / cablast-search
    * cablast-search.c modified for BLAST tabular output
  • RDP (v2.11 in java) [train, classify with default]
  • HMMER (3.1.b2 in C) [hmmbuild, hmmpress, nhmmscan with default]
  • ICM (PhymmBL v4.0 in C++)
  • : build-icm -d 6 -w 12
    : simple-score -N
  • gzip (v1.4 in C)
  • : tar -czf
  • NASCUP (v0.8.1 in C++)
  • : nascup_build -d 6 -m kt / nascup_scan -m kt
    * Not defined or basic options are default.
    * A thread option for parallel processing is fixed to 1 (single thread)
    * BLAST and its variants use the output option for tabular type


To run NASCUP, you should follow command instructions.
Model building : nascup_build -i train.fasta -o model.ctm [Options]
Classification : nascup_scan -c model.ctm -i test.fasta -o result.out [Options]

[Options for model building]
-d <int> The size of depth used to represent the context
-m <str> The method for model building, should be one of KT, ZR, ML, MLC or KMER
-c For CTM(Context Tree Model) which traverse all possible path
By default VMM(Variable-depth Markov models) is used.

[Options for classification]
-m <str> The method for classification, should be one of KT, ZR, MLC or EUC

For model building using CTM-ML, d=7
$ ./nascup_build -i train.fasta -o model.ctm -c -m ML -d 7
For classification using ZR
$ ./nascup_scan -c model.ctm -i test.fasta -o result.out -m ZR

* Without any options, model is built by VMM-KT 6 and classificaiton is by KT.



D A T A S E T s

Functional non-coding RNA
RF 1,320 families, 170,881 seqs
Microbial Taxonomy
- rRNA database
RD 134 families, 3,838 seqs
GG 464 families, 23,142 seqs
SS 313 families, 17,625 seqs
SL 107 families, 4,593 seqs
- pyrosequencing data
AR 60 families, 44,407 seqs
DV 23 families, 55,466 seqs
Coding/non-coding RNA
CN 2 families, 103,136 seqs
HS 2 families, 112,180 seqs
Full Greengenes
BGG 60,717 - 560,969 seqs