About EUKulele

EUKulele is a python package that enables the rapid taxonomic annotation of metagenomic and metatranscriptomic data using a Last Common Ancestor (LCA) approach to align samples to a reference database. Assignments are chosen using either (a) the consensus annotation above a user-specified threshold, by default 75% identical predicted taxonomic annotations, or (b) the last conserved taxonomic level between the identified matches. EUKulele has been designed to be flexible and reproducible. The software is flexible in that it allows any sequences the user is interested in to be added as a reference, and also provides the user the freedom to choose their own levels of taxonomic specificity and confidence in alignment matches. The pipeline is reproducible because the specific parameters used to run EUKulele and the output can be saved, containing information about which database was run using which version, allowing users to share metadata and easily defend their taxonomic annotation results. In addition to estimating a taxonomic annotation for each putative transcript (“contig”), EUKulele can predict the taxonomic identity of eukaryotic Metagenome Assembled Genomes (MAGs) using the consensus of the contigs in each MAG. EUKulele also offers users a secondary taxonomic annotation approach limited only to sequenced reads annotated as core eukaryotic genes.

A variety of curated protein databases are available to use with EUKulele, which are downloaded and formatted automatically. A custom local database can also be created by the user, using user-provided sequences and accompanying taxonomy.

Functionality

EUKulele :cite:`eukulele` is an open-source Python-based package designed to simplify the process of taxonomic identification of marine eukaryotes in meta-omic samples. User-provided metatranscriptomic or metagenomic samples are aligned against a database of the user’s choosing, with an aligner of the user’s choice (BLAST :cite:`kent2002blat` or DIAMOND :cite:`buchfink2015fast`). The “blastx” utility is used by default if metatranscriptomic samples are only provided in nucleotide format, while the “blastp” utility is used for metagenomic samples and metatranscriptomic samples available as translated protein sequences. Optionally, the user may indicate a preference to translate nucleotide input sequences using the TransDecoder software :cite:`haastransdecoder`, with the output provided to “blastp”. Any consistently-formatted database may be used, but three published microbial eukaryotic database options are provided by default: MMETSP :cite:`keeling2014marine,caron2017probing`, PhyloDB :cite:`phylodb`, and EukProt :cite:`richter2020eukprot`. The package returns comma-separated files containing all of the contig matches from the metatranscriptome or metagenome, as well as the total number of transcripts that matched, at each taxonomic level, from supergroup to species. If a quantification tool has been used to estimate the number of counts associated with each transcript ID, counts may also be returned. Additionally, the software returns barplots displaying the relative composition of each sample at each taxonomic level, according to the number of transcripts or number of estimated counts if provided from Salmon (an external transcript quantification tool :cite:`patro2017salmon`).

EUKulele will assess the relative ‘completeness’ of a given taxonomic group by taking a user-inputted list of names at some taxonomic level to determine BUSCO completeness and redundancy :cite:`simao2015busco`. For example, if the user was interested whether there was a set of relatively complete contigs available for genus Phaeocystis within their metagenomic sample, they could pass Phaeocystis, along with its taxonomic level, “genus”, to EUKulele. By default, EUKulele will assess the BUSCO completeness of the most commonly encountered classifications at each taxonomic level.

Usage and Dependencies

The package is written in Python, but may be installed as a Python module via PyPI, as a standalone tool via conda, or through download of the EUKulele tarball through GitHub.

After a desired database is either specified by the user from a previous install, or downloaded by the program, the user-selected alignment tool will create a database from the reads in the database peptide file. That database is aligned against the sample metatranscriptomic or metagenomic reads, resulting in a transcript ID, a percentage identity, e-value, and bitscore, all of which are common metrics in bioinformatics for assessing the quality of an alignment comparison between sequences. The user can specify which metric should be used for filtering out low-quality matches.

The alignment output is compared to an accompanying phylogenetic reference specific to the database (which can be generated via a script included in the package). Taxonomy is estimated at six levels of taxonomic resolution, labeled as they are defined in the MMETSP :cite:`keeling2014marine` from “species” to “supergroup”, and preliminary visualizations are provided as part of the package output, which enable users to get a quick sense of the diversity in their metagenomic or metatranscriptomic sample. For metagenomic samples, a consensus taxonomic annotation is assigned based on the majority assignment of the contigs in the metagenome-assembled genome (MAG). For the metatranscriptomic option, only the taxonomic breakdown of the mixed community detected in the assembly will be returned. If counts from Salmon :cite:`patro2017salmon` are provided, EUKulele also provides and visualizes the counts associated with each taxonomic classification.

Subsequently, BUSCO :cite:`simao2015busco` is used to identify the core eukaryotic genes present in each sample. Using the list of genes identified as “core”, a secondary taxonomic estimation step (and consensus assignment step, for MAGs) is performed to compare the taxonomic assignment predicted using all of the genes in comparison to the assignment made using only the genes that would be expected to be found in most reference transcriptomes. This approach is particularly useful for MAGs, and offers a method for avoiding conflicting or spurious matches made due to strain-level inconsistencies. For metatranscriptome samples, BUSCO completeness can be used to estimate the completeness of taxonomic groups to better inform their downstream interpretation.