Discover and annotate the virome.
Works on your laptop or HPC (compatible with MacOS and Linux)
Cenote-Taker 3 is a virus bioinformatics tool that scales from individual genomes sequences to massive metagenome assemblies to:
-
Identify sequences containing genes specific to viruses (virus hallmark genes)
-
Annotate virus sequences including:
---a) adaptive ORF calling
---b) a large catalog of HMMs from virus gene families for functional annotation
---c) Hierarchical taxonomy assignment based on hallmark genes
---d) mmseqs2-based CDD database search
---e) tabular (.tsv) and interactive genome map (.gbf) outputs
Also, Cenote-Taker 3 is very fast, many many times faster than Cenote-Taker 2 for large datasets, and faster than comparable annotation using pharokka with more function annotation for virus genes (in my hands)
Image of example genome map:
-
Discovering virus contigs in metagenomic data
-
Annotating virus sequences without highly similar well-annotated reference
-
Finding prophages (or proviruses) in microbial genomes
-
Not for read-level classification of known viruses (see Marker-MAGu or EsViritu for this task)
-
Not ideal for annotating virus genomes that are highly similar to known references (e.g. phage lambda with a few mutations).
Most recent versions
Cenote-Taker 3 scripts: v3.4.3
Cenote-Taker 3 Databases: v3.1.1
This should work on MacOS and Linux
Versions used in test installations
mamba 1.5.8
conda 24.7.1
mamba is better/faster than conda for almost all solving/installation tasks
- Use
mambato install the bioconda package
macOS (specify osx-64 platform regardless of which chip you have)
I'm also noticing a macOS-specific issue with newer mmseqs versions, so use mmseqs2=15.6f452
mamba create --platform osx-64 -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.4.3 mmseqs2=15.6f452
linux
mamba create -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.4.3
Using conda instead
macOS (specify osx-64 platform regardless of which chip you have)
conda create --platform osx-64 -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.4.3 mmseqs2=15.6f452
linux
conda create -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.4.3
- Activate the conda environment.
conda activate ct3_env
You should be able to type cenotetaker3 and get_ct3_dbs in terminal to bring up help menu now
- Change to a directory where you'd like to install databases and run database script, specify DB directory with
-o.
Total DB file size of 3.0 GB after file decompression
cd ..
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T
With optional hhsuite databases
Warning: due to inconsistent server speed, these downloads may take over 2 hours.
You may download one or more hhsuite DB.
The data footprint is:
| Database | Size |
|---|---|
| CDD | 6.1 GB |
| pfam | 4.6 GB |
| pdb70 | 56 GB |
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T --hhCDD T --hhPFAM T --hhPDB T
- Set the database directory as a conda environmental variable.
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
-
Clone this GitHub repo
-
Using
mamba(package manager withinconda) and the provided yaml file, make the environment:
mamba env create -f Cenote-Taker3/environment/ct3_env.yaml
- Activate the conda environment.
conda activate ct3_env
- Change to repo and
pipinstall command line tool.
cd Cenote-Taker3
pip install .
You should be able to type cenotetaker3 and get_ct3_dbs in terminal to bring up help menu now
- Change to a directory where you'd like to install databases and run database script, specify DB directory with
-o.
Total DB file size of 3.0 GB after file decompression
cd ..
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T
With optional hhsuite databases
Warning: due to inconsistent server speed, these downloads may take over 2 hours.
You may download one or more hhsuite DB.
The data footprint is:
| Database | Size |
|---|---|
| CDD | 6.1 GB |
| pfam | 4.6 GB |
| pdb70 | 56 GB |
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T --hhCDD T --hhPFAM T --hhPDB T
- Set the database directory as a conda environmental variable.
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
Make sure conda environment is activated
cenotetaker3 -h
cenotetaker3 -c Cenote-Taker3/test_data/testcontigs_DNA_ct2.fasta -r test_ct3 -p T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T --lin_minimum_hallmark_genes 2
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3pr -p T --caller prodigal
cenotetaker3 -c my_virus_contigs.fna -r my_virs_ct3 -p F -am T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T -db virion rdrp dnarep
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T --reads my_reads/*fastq
{run_title}/
| {run_title}_virus_summary.tsv <- main summary file for each virus
| {run_title}_virus_sequences.fna <- all virus genome seqs
| {run_title}_virus_AA.faa <- all virus AA seqs
| {run_title}_prune_summary.tsv <- summary of pruning of each sequence
| final_genes_to_contigs_annotation_summary.tsv <- annotation info, all genes
| run_arguments.txt <- arguments used in this run
│ {run_title}_cenotetaker.log <- main log file
│
└───sequin_and_genome_maps/
│ │ {run_title}*gbf <- genome maps
│ │ {run_title}*fsa <- genome sequence
│ │ {run_title}*gtf <- feature table gtf format
│ │ {run_title}*tbl <- feature table sequin format
│ │ {run_title}*sqn <- non-human-readable sequin file for GenBank sub
│ │ {run_title}*cmt <- sequin comment file
│
└───ct_processing/
│ --- many intermediate files ---
CheckV for virus genome completeness estimation.
BACPHLIP for phage lifestyle prediction (only use complete/near-complete phage genomes).
VContact3 for genome clustering and taxonomy.
iPHoP for prokaryotic virus host prediction.
Cenote-Taker 3 is under active development, so please open an issue if anything seems unusual or any errors occur. It's likely that I've not tested every parameter combination, and bugs will be a simple fix.
Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome.
Michael J. Tisza, Joseph F. Petrosino, Sara J. Javornik Cregeen


