The Genome Taxonomy Network¶
The Genome Taxonomy Network, or GTNet, is a taxonomic classifier that uses a deep neural network to label DNA sequences with the Genome Taxonomy Database taxonomy.
Installation¶
GTNet is available on the Python Package Index.
pip install gtnet
GPU acceleration¶
GTNet uses PyTorch, so it is capable of GPU acceleration with CUDA. As long as CUDA is available on your system, GTNet will detect if CUDA is available and make GPU acceleration available.
If your system is equipped with NVIDIA GPUs, but are unsure if CUDA is installed, we recommend installing PyTorch and the CUDA Toolkit using Conda.
For example, if you would like to run PyTorch with CUDA Toolkit 11.8, you can run the following commands:
conda create -n gtnet-env
conda activate gtnet-env
conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia
pip install gtnet
Running GTNet¶
GTNet comes with multiple commands. The simplest way of running GTNet is to use the classify
command.
gtnet classify genome.fna > genome.tax.csv
This command generates one classification for the entire file, and should be used to get classification for metagenome bin.
Use the -s/--seqs
flag to get classifications for the individual sequences in genome.fna
Attention
The first time you run classify
and predict
(see below), the model file will be downloaded and stored in the same
directory that the gtnet package is installed in. Therefore, for the this to be successful, you must have write privileges
on the directory that gtnet is installed in.
gtnet classify --seqs genome.fna > genome.seqs.tax.csv
The classify
command can take multiple fasta files, and will produce line per file in the output. For example, the following
command will contain two lines:
gtnet classify bin1.fna bin2.fna > bins.tax.csv
GTNet steps¶
GTNet consists of two main steps: 1) get scored predictions of taxonoimc assignments and 2) filter scored predictions. The previous command combines these two commands into a single command with a default false-positive rate. The two steps have been separated into two commands for those who want to experiment with different false-positive rates.
Getting predictions¶
To get predictinos for all sequences in a Fasta file, use the predict
subcommand. This command also accepts multiple fasta files
and the -s/--seqs
argument for getting predictions for individual sequences.
gtnet predict genome.fna > genome.tax.raw.csv
Filtering predictions¶
After getting predicted and scored taxonomic classifications, you can filter the raw classifications to a desired false-positive rate.
gtnet filter --fpr 0.05 genome.tax.raw.csv > genome.tax.csv
The filter
command supports predictions for whole files and individual sequences.
GPU acceleration¶
If CUDA is available on your system, the classify
and predict
commands will have the option -g/--gpu
to enable
using the available GPU to accelerate neural network calculations.
API Documentation¶
gtnet.classify module¶
- gtnet.classify.classify(argv=None)¶
Get taxonomic classification for each sequence in a Fasta file.
- Parameters:
argv (Namespace, default=sys.argv) – The command-line arguments to use for running this command
gtnet.predict module¶
- gtnet.predict.predict(argv=None)¶
Get network predictions for each sequence in Fasta file
- Parameters:
argv (Namespace, default=sys.argv) – The command-line arguments to use for running this command
- gtnet.predict.run_torchscript_inference(fastas, model, conf_models, window, step, vocab, seqs=False, n_chunks=10000, device=device(type='cpu'), logger=None)¶
Run Torchscript inference
- Parameters:
fastas (str) – The path to the Fasta file with sequences to do inference on
model (RecursiveScriptModule) – The Torchscript model to run inference with
conf_models (dict) – A dictionary with the confidence model for each taxonomic level. Each model should be a RecursiveScriptModule. The expected keys in this dict are ‘domain’, ‘phylum’, ‘class’, ‘order’, ‘family’, ‘genus’ and ‘species’.
window (int) – The length of the sliding window to use for doing inference
step (int) – The length of the step of the sliding window to use for doing inference
vocab (str) – The vocabulary used for training model
n_chunks (int, default=10000) – The length of the step of the sliding window to use for doing inference
device (device, default=torch.device('cpu')) – The Pytorch device to run inference on
logger (Logger) – The Python logger to use when running inference
gtnet.filter module¶
- gtnet.filter.get_cutoffs(rocs, fpr)¶
Get score cutoffs to achieve desired false-positive rate
- gtnet.filter.filter(argv=None)¶
Filter raw taxonomic classifications
- gtnet.filter.filter_predictions(pred_df, cutoffs)¶
Filter taxonomic classification predictions
- Parameters:
pred_df (DataFrame) – The DataFrame containing predictions and confidence scores for each taxonomic level
cutoffs (dict) – A dictionary containing the confidence score cutoff for each taxonomic level
gtnet.utils module¶
- gtnet.utils.parse_logger(string)¶
- gtnet.utils.get_logger()¶
- class gtnet.utils.DeployPkg¶
Bases:
object
A class to handle loading and manipulating the deployment package
- classmethod check_pkg()¶
- path(path)¶
Map paths to be relative to current working directory
- property manifest¶
- __getitem__(key)¶
- gtnet.utils.load_deploy_pkg(for_predict=False, for_filter=False, contigs=False)¶
- class gtnet.utils.GPUModel(model, device)¶
Bases:
Module
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(x)¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- gtnet.utils.check_cuda(parser)¶
- gtnet.utils.check_device(args)¶
- gtnet.utils.write_csv(output, args)¶
gtnet package¶
Submodules¶
gtnet.main module¶
- gtnet.main.print_help()¶
- gtnet.main.run()¶
gtnet.sequence module¶
- class gtnet.sequence.FastaSequenceEncoder(window, step, vocab=None, padval=None, min_seq_len=100, device=device(type='cpu'))¶
Bases:
object
- encode(seq)¶
- classmethod get_dna_map(vocab=None)¶
Create data structures for mapping DNA sequence to
- Returns
vocab: the DNA vocabulary used for building the data structures basemap: a 128 element array for mapping ASCII character values to encoded values rcmap: an array for mapping between complementary characters of encoded values
- classmethod get_revcomp_map(vocab)¶
- class gtnet.sequence.FastaReader(encoder, *fastas, parallel=False)¶
Bases:
Process
Module contents¶
Updating GTNet¶
As the GTDB taxonomy is updated, GTNet will also need to be updated. This amounts to retraining the network with the new taxonomy and updating the gtnet software to use the new model and taxonomy.
Training a new model¶
Software for training GTNet is available in the deep-taxon repository.
Uploading to OSF¶
Once a model is trained, calibrated, and packaged, the deployment package needs to be made publicly available. GTNet is currently carried hosted on OSF.
Updating the gtnet software¶
After training a new model and packaging the model, the DeployPkg
class will need to be
updated with the new URL and checksum of the new deployment package. This can be done starting around
here
in the code.
GTNet Performance¶
Attention
This page is currently under construction. The results presented here may not accurately reflect what is said in text.
Taxonomic classifiers fall into two main categories: fast-and-incomplete or slow-and-complete. GTNet strives to be both fast and complete. In this page, we demonstrate GTNet capabilities by comparing to state-of-the-art methods from each of these categories. We compare to Sourmash, a fast-and-incomplete method, and CAT, a slow-and-complete method.
Our choice of tools for comparison should not be perceived as a criticism or an endorsement for either tool. These tools were chosen based on their ease of use for labelling contigs with the GTDB taxonomy and the algorithmic approaches underlying these tools.
Here are accuracy comparisons for a subset of non-representative GTDB taxa.

Here are speed comparisons for a subset of 40 non-representative genomes.

Copyright¶
The Genome Taxonomy Network (GTNet) Copyright (c) 2022, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software, please contact Berkeley Lab’s Intellectual Property Office at IPO@lbl.gov.
NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.
License¶
The Genome Taxonomy Network (GTNet) Copyright (c) 2022, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
(1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
(2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
(3) Neither the name of the University of California, Lawrence Berkeley National Laboratory, U.S. Dept. of Energy nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
You are under no obligation whatsoever to provide any bug fixes, patches, or upgrades to the features, functionality or performance of the source code (“Enhancements”) to anyone; however, if you choose to make your Enhancements available either publicly, or directly to Lawrence Berkeley National Laboratory, without imposing a separate written license agreement for such Enhancements, then you hereby grant the following license: a non-exclusive, royalty-free perpetual license to install, use, modify, prepare derivative works, incorporate into other computer software, distribute, and sublicense such enhancements or derivative works thereof, in binary and source code form.