Parameters

A full list of parameters can be found in the table at the bottom of this page. However, in practice, only a few parameters will be relevant for most users of EUKulele. These are the required ones:

  • mets_or_mags: Whether the user intends to run the analysis for metatranscriptomic samples (“mets”) or metagenomic samples (“mags”)
Full list of EUKulele parameters
Flag Configuration File Entry Meaning
--config N/A The path to the configuration file which should be used to retrieve the equivalent of command-line arguments.
-m/--mets_or_mags mets_or_mags A required flag to indicate whether metatranscriptomic (“mets”) or metagenomic (“mags”) samples are being used as input.
-s/--sample_dir samples A required flag to indicate where the samples (metagenomic or metatranscriptomic, depending on “mets_or_mags” flag) are located.
-o/--out_dir output The path to the directory where output will be stored. Defaults to a folder called output in the present working directory.
--reference_dir reference A flag to indicate where the reference FASTA is stored, or a keyword argument for the dataset to be downloaded and used. Only used if not downloading automatically.
--ref_fasta ref_fasta The name of the reference FASTA file in reference_dir; defaults to reference.pep.fa if not specified, or is set according to the downloaded file if using a keyword argument.
--database database An optional additional argument for specifying the database name. If the database specified is one of the supported databases (currently, “mmetsp”, “eukprot”, or “phylodb”, it will be downloaded automatically. Otherwise, MMETSP is used as a default.
--run_transdecoder run_transdecoder (set to 0 or 1) An argument for the user to specify whether or not TransDecoder should be used to translate input nucleotide sequences, prior to blastp being used (i.e., the equivalent protein-protein alignment with the tool of choice). If included in command line or set to 1 in configuration file, TransDecoder is run. Otherwise, blastp is run if protein files are found (according to files in the sample directory ending in --p_ext (below), or blastx is run if only nucleotide format files are found.
--nucleotide_extension/--n_ext nucleotide_extension The file extension for samples in nucleotide format (metatranscriptomes). Defaults to .fasta.
--protein_extension/--p_ext protein_extension The file extension for samples in protein format (metatranscriptomes). Defaults to .faa.
-f/--force_rerun force_rerun If included in a command line argument or set to 1 in a configuration file, this argument forces all steps to be re-run, regardless of whether output is already present.
--use_salmon_counts use_salmon_counts If included in a command line argument or set to 1 in a configuration file, this argument causes classifications to be made based both on number of classified transcripts and by counts.
--salmon_dir salmon_dir If --use_salmon_counts is true, this must be specified, which is the directory location of the salmon output/quantification files.
--names_to_reads names_to_reads A file that creates a correspondence between each transcript name and the number of salmon-quantified reads. Can be generated manually via the names_to_reads.py script, or will be generated automatically if it does not exist.
--transdecoder_orfsize transdecoder_orfsize The minimum cutoff size for an open reading frame (ORF) detected by TransDecoder. Only relevant if --use_transdecoder is specified.
--alignment_choice alignment_choice A choice of aligner to use, currently BLAST or DIAMOND.
--cutoff_file cutoff_file A YAML file, provided in src/EUKulele/static/, that contains the percent identity cutoffs for various taxonomic classifications. Any path may be provided here to a user-specified file.
--filter_metric filter_metric Either evalue, pid, or bitscore (default evalue) - the metric to be used to filter hits based on their quality prior to taxonomic estimation.
--consensus_cutoff consensus_cutoff The value to be used to decide whether enough of the taxonomic matches are identical to overlook a discrepancy in classification based on hits associated with a contig. Defaults to 0.75 (75%).
--busco_file busco_file Overrides specific organism and taxonomy parameters (next two entries below) in favor of a tab-separated file containing each organism/group of interest and the taxonomic level of the query.
--organisms organisms A list of organisms/groups to test the BUSCO completeness of matching contigs for.
--taxonomy_organisms taxonomy_organisms The taxonomic level of the groupings indicated in the list of --organisms; also a list.
--individual_or_summary / -i individual_or_summary Defaults to summary. Whether BUSCO assessment should just be performed for the top organism matches, or whether the list of organisms + their taxonomies or BUSCO file (above parameters) should be used (individual). When -i is specified, individual mode is chosen.
--busco_threshold busco_threshold The threshold for BUSCO completeness for a set of contigs to be considered reasonably BUSCO-complete.
--tax_table tax_table The name of the formatted taxonomy table; defaults to “tax-table.txt.”. If this file is not found, it can be generated from the reference FASTA and original taxonomy file using the provided script create_protein_file.py, or the database specified will be automatically downloaded, if it is one of the supported databases.
--protein_map protein_map The name of the JSON file containing protein correspondences; defaults to “protein-map.json”. If this file is not found, it can be generated from the reference FASTA and original taxonomy file using the provided script create_protein_file.py, or the database specified will be automatically downloaded, if it is one of the supported databases.