{
“cells”: [
{

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“# Using RNA read countsn”, “### Using EUKelele to assign protistan taxonomy and generate relative community community based on RNA read counts”

]

}, {

“cell_type”: “code”, “execution_count”: 1, “metadata”: {}, “outputs”: [], “source”: [

“import pandas as pd”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“When working with an environmental transcriptomic data set, we often want to assign taxonomy using reference databases populated with cultured isolates. It can be useful to have a single assembly which all individual transcripts within samples can relate (map) back to in order to perform gene expression comparisons across samples. In this example we ran EUKulele with a single metatranscriptomic assembly that was created by combining 40 individual samples using the metatranscriptomic assembly pipeline “eukrhythmic” (https://github.com/AlexanderLabWHOI/eukrhythmic). We use the EUKulele output, combined with a separately compiled read counts table produced with salmon (https://salmon.readthedocs.io/en/latest/salmon.html), to visualize the breakdown of taxonomic groups within each sample.”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“EUKulele was performed using the MMETSP database and diamond as the alignment choice (default), as below:n”, “n”, “EUKulele -s /output/transdecoder_mega_merge –protein_extension .pep -m mets

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“First, we load in the the annotated contig table located in the EUKulele output directory: n”, “n”, “output/taxonomy_estimated/sample-estimated-taxonomy.out

]

}, {

“cell_type”: “code”, “execution_count”: 2, “metadata”: {}, “outputs”: [

{
“data”: {
“text/html”: [
“<div>n”, “<style scoped>n”, ” .dataframe tbody tr th:only-of-type {n”, ” vertical-align: middle;n”, ” }n”, “n”, ” .dataframe tbody tr th {n”, ” vertical-align: top;n”, ” }n”, “n”, ” .dataframe thead th {n”, ” text-align: right;n”, ” }n”, “</style>n”, “<table border=”1” class=”dataframe”>n”, ” <thead>n”, ” <tr style=”text-align: right;”>n”, ” <th></th>n”, ” <th>Unnamed: 0</th>n”, ” <th>transcript_name</th>n”, ” <th>classification_level</th>n”, ” <th>full_classification</th>n”, ” <th>classification</th>n”, ” <th>max_pid</th>n”, ” <th>ambiguous</th>n”, ” </tr>n”, ” </thead>n”, ” <tbody>n”, ” <tr>n”, ” <th>0</th>n”, ” <td>0</td>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_0</td>n”, ” <td>species</td>n”, ” <td>Stramenopiles; Ochrophyta; Bacillariophyta; Ba…</td>n”, ” <td>Skeletonema grethea</td>n”, ” <td>100.0</td>n”, ” <td>0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>1</th>n”, ” <td>0</td>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_10</td>n”, ” <td>species</td>n”, ” <td>Hacrobia; Haptophyta; Prymnesiophyceae; Prymne…</td>n”, ” <td>Chrysochromulina ericina</td>n”, ” <td>95.5</td>n”, ” <td>0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>2</th>n”, ” <td>0</td>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_100</td>n”, ” <td>species</td>n”, ” <td>Hacrobia; Haptophyta; Prymnesiophyceae; Prymne…</td>n”, ” <td>Chrysochromulina ericina</td>n”, ” <td>100.0</td>n”, ” <td>0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>3</th>n”, ” <td>0</td>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_1005</td>n”, ” <td>species</td>n”, ” <td>Alveolata; Dinoflagellata; Dinophyceae; Proroc…</td>n”, ” <td>Prorocentrum minimum</td>n”, ” <td>100.0</td>n”, ” <td>0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>4</th>n”, ” <td>0</td>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_1008</td>n”, ” <td>family</td>n”, ” <td>Stramenopiles; Ochrophyta; Bacillariophyta; Ba…</td>n”, ” <td>Araphid-pennate</td>n”, ” <td>77.1</td>n”, ” <td>0</td>n”, ” </tr>n”, ” </tbody>n”, “</table>n”, “</div>”

], “text/plain”: [

” Unnamed: 0 transcript_name \n”, “0 0 megahit_NarBay_A_megahit_NarBay_A_k111_0 n”, “1 0 megahit_NarBay_A_megahit_NarBay_A_k111_10 n”, “2 0 megahit_NarBay_A_megahit_NarBay_A_k111_100 n”, “3 0 megahit_NarBay_A_megahit_NarBay_A_k111_1005 n”, “4 0 megahit_NarBay_A_megahit_NarBay_A_k111_1008 n”, “n”, ” classification_level full_classification \n”, “0 species Stramenopiles; Ochrophyta; Bacillariophyta; Ba… n”, “1 species Hacrobia; Haptophyta; Prymnesiophyceae; Prymne… n”, “2 species Hacrobia; Haptophyta; Prymnesiophyceae; Prymne… n”, “3 species Alveolata; Dinoflagellata; Dinophyceae; Proroc… n”, “4 family Stramenopiles; Ochrophyta; Bacillariophyta; Ba… n”, “n”, ” classification max_pid ambiguous n”, “0 Skeletonema grethea 100.0 0 n”, “1 Chrysochromulina ericina 95.5 0 n”, “2 Chrysochromulina ericina 100.0 0 n”, “3 Prorocentrum minimum 100.0 0 n”, “4 Araphid-pennate 77.1 0 “

]

}, “execution_count”: 2, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“taxa=pd.read_table(‘merged_merged-estimated-taxonomy.out’)n”, “taxa.head()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“This file shows the result of the alignment, with each contig matching an annotation in the database listed alongside the level of classification achieved (classification_level & classification), full classification description as presented in the database (full_classification), the maximum percentage identity as calculated by the aligner (max_pid), and whether there were discrepancies assigning the taxonomic cutoff (ambiguous). Descriptions of the EUKulele output are provided here: https://eukulele.readthedocs.io/en/latest/

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Next we want to separate out the classification levels in the “full_classification” column so that we can collapse counts based on the taxonomic level of interest. Be aware that the classification levels are specific to the taxonomic database used in the alignment, and the original references should be consulted to determine the appropriate levels:”

]

}, {

“cell_type”: “code”, “execution_count”: 5, “metadata”: {}, “outputs”: [

{
“data”: {
“text/html”: [
“<div>n”, “<style scoped>n”, ” .dataframe tbody tr th:only-of-type {n”, ” vertical-align: middle;n”, ” }n”, “n”, ” .dataframe tbody tr th {n”, ” vertical-align: top;n”, ” }n”, “n”, ” .dataframe thead th {n”, ” text-align: right;n”, ” }n”, “</style>n”, “<table border=”1” class=”dataframe”>n”, ” <thead>n”, ” <tr style=”text-align: right;”>n”, ” <th></th>n”, ” <th>Name</th>n”, ” <th>Supergroup</th>n”, ” <th>Division</th>n”, ” <th>Class</th>n”, ” <th>Order</th>n”, ” <th>Family</th>n”, ” <th>Genus</th>n”, ” <th>Species</th>n”, ” </tr>n”, ” </thead>n”, ” <tbody>n”, ” <tr>n”, ” <th>0</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_0</td>n”, ” <td>Stramenopiles</td>n”, ” <td>Ochrophyta</td>n”, ” <td>Bacillariophyta</td>n”, ” <td>Bacillariophyta_X</td>n”, ” <td>Polar-centric-Mediophyceae</td>n”, ” <td>Skeletonema</td>n”, ” <td>Skeletonema grethea</td>n”, ” </tr>n”, ” <tr>n”, ” <th>1</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_10</td>n”, ” <td>Hacrobia</td>n”, ” <td>Haptophyta</td>n”, ” <td>Prymnesiophyceae</td>n”, ” <td>Prymnesiales</td>n”, ” <td>Chrysochromulinaceae</td>n”, ” <td>Chrysochromulina</td>n”, ” <td>Chrysochromulina ericina</td>n”, ” </tr>n”, ” <tr>n”, ” <th>2</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_100</td>n”, ” <td>Hacrobia</td>n”, ” <td>Haptophyta</td>n”, ” <td>Prymnesiophyceae</td>n”, ” <td>Prymnesiales</td>n”, ” <td>Chrysochromulinaceae</td>n”, ” <td>Chrysochromulina</td>n”, ” <td>Chrysochromulina ericina</td>n”, ” </tr>n”, ” <tr>n”, ” <th>3</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_1005</td>n”, ” <td>Alveolata</td>n”, ” <td>Dinoflagellata</td>n”, ” <td>Dinophyceae</td>n”, ” <td>Prorocentrales</td>n”, ” <td>Prorocentraceae</td>n”, ” <td>Prorocentrum</td>n”, ” <td>Prorocentrum minimum</td>n”, ” </tr>n”, ” <tr>n”, ” <th>4</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_1008</td>n”, ” <td>Stramenopiles</td>n”, ” <td>Ochrophyta</td>n”, ” <td>Bacillariophyta</td>n”, ” <td>Bacillariophyta_X</td>n”, ” <td>Araphid-pennate</td>n”, ” <td>None</td>n”, ” <td>None</td>n”, ” </tr>n”, ” </tbody>n”, “</table>n”, “</div>”

], “text/plain”: [

” Name Supergroup Division \n”, “0 megahit_NarBay_A_megahit_NarBay_A_k111_0 Stramenopiles Ochrophyta n”, “1 megahit_NarBay_A_megahit_NarBay_A_k111_10 Hacrobia Haptophyta n”, “2 megahit_NarBay_A_megahit_NarBay_A_k111_100 Hacrobia Haptophyta n”, “3 megahit_NarBay_A_megahit_NarBay_A_k111_1005 Alveolata Dinoflagellata n”, “4 megahit_NarBay_A_megahit_NarBay_A_k111_1008 Stramenopiles Ochrophyta n”, “n”, ” Class Order Family \n”, “0 Bacillariophyta Bacillariophyta_X Polar-centric-Mediophyceae n”, “1 Prymnesiophyceae Prymnesiales Chrysochromulinaceae n”, “2 Prymnesiophyceae Prymnesiales Chrysochromulinaceae n”, “3 Dinophyceae Prorocentrales Prorocentraceae n”, “4 Bacillariophyta Bacillariophyta_X Araphid-pennate n”, “n”, ” Genus Species n”, “0 Skeletonema Skeletonema grethea n”, “1 Chrysochromulina Chrysochromulina ericina n”, “2 Chrysochromulina Chrysochromulina ericina n”, “3 Prorocentrum Prorocentrum minimum n”, “4 None None “

]

}, “execution_count”: 5, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“df = pd.concat([taxa[‘transcript_name’], taxa[‘full_classification’].str.split(’; ‘, expand=True)], axis=1)n”, “#Label columns in data framen”, “df.columns = [‘Name’, ‘Supergroup’,’Division’,’Class’,’Order’,’Family’,’Genus’,’Species’]n”, “df.head()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Next we create and read in a counts table created using standard salmon output, although a similar table could be generated using any read aligner. Counts from individual samples aligning to the fasta assembly are joined into one data frame. There are many ways to achieve this, and the approach below was adapted based on solutions posted on StackOverflow (https://stackoverflow.com/questions/44428429/replace-column-name-with-file-name-shell-script) and StackExchange (https://unix.stackexchange.com/questions/467523/awk-for-merging-multiple-files-with-common-columnn”, “). “

]

}, {

“cell_type”: “code”, “execution_count”: 6, “metadata”: {

“scrolled”: true

}, “outputs”: [

{

“name”: “stdout”, “output_type”: “stream”, “text”: [

“/vortexfs1/omics/alexander/ncohen/BATS2019-clio-metaT/EUKulele_NB/output/taxonomy_estimation/salmon_indiv_to_megan”, “/vortexfs1/omics/alexander/ncohen/BATS2019-clio-metaT/EUKulele_NB/output/taxonomy_estimationn”

]

}, {

“data”: {
“text/html”: [
“<div>n”, “<style scoped>n”, ” .dataframe tbody tr th:only-of-type {n”, ” vertical-align: middle;n”, ” }n”, “n”, ” .dataframe tbody tr th {n”, ” vertical-align: top;n”, ” }n”, “n”, ” .dataframe thead th {n”, ” text-align: right;n”, ” }n”, “</style>n”, “<table border=”1” class=”dataframe”>n”, ” <thead>n”, ” <tr style=”text-align: right;”>n”, ” <th></th>n”, ” <th>Name</th>n”, ” <th>Supergroup</th>n”, ” <th>Division</th>n”, ” <th>Class</th>n”, ” <th>Order</th>n”, ” <th>Family</th>n”, ” <th>Genus</th>n”, ” <th>Species</th>n”, ” <th>Unnamed: 1</th>n”, ” <th>SRR1810204</th>n”, ” <th>…</th>n”, ” <th>SRR1810207</th>n”, ” <th>SRR1810208</th>n”, ” <th>SRR1810209</th>n”, ” <th>SRR1810210</th>n”, ” <th>SRR1810211</th>n”, ” <th>SRR1810801</th>n”, ” <th>SRR181799</th>n”, ” <th>SRR1945044</th>n”, ” <th>SRR1945045</th>n”, ” <th>SRR1945046</th>n”, ” </tr>n”, ” </thead>n”, ” <tbody>n”, ” <tr>n”, ” <th>0</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_0</td>n”, ” <td>Stramenopiles</td>n”, ” <td>Ochrophyta</td>n”, ” <td>Bacillariophyta</td>n”, ” <td>Bacillariophyta_X</td>n”, ” <td>Polar-centric-Mediophyceae</td>n”, ” <td>Skeletonema</td>n”, ” <td>Skeletonema grethea</td>n”, ” <td>NaN</td>n”, ” <td>0.000</td>n”, ” <td>…</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>1</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_10</td>n”, ” <td>Hacrobia</td>n”, ” <td>Haptophyta</td>n”, ” <td>Prymnesiophyceae</td>n”, ” <td>Prymnesiales</td>n”, ” <td>Chrysochromulinaceae</td>n”, ” <td>Chrysochromulina</td>n”, ” <td>Chrysochromulina ericina</td>n”, ” <td>NaN</td>n”, ” <td>0.000</td>n”, ” <td>…</td>n”, ” <td>3.973</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>2</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_100</td>n”, ” <td>Hacrobia</td>n”, ” <td>Haptophyta</td>n”, ” <td>Prymnesiophyceae</td>n”, ” <td>Prymnesiales</td>n”, ” <td>Chrysochromulinaceae</td>n”, ” <td>Chrysochromulina</td>n”, ” <td>Chrysochromulina ericina</td>n”, ” <td>NaN</td>n”, ” <td>9.369</td>n”, ” <td>…</td>n”, ” <td>7.093</td>n”, ” <td>4.902</td>n”, ” <td>15.165</td>n”, ” <td>0.0</td>n”, ” <td>5.869</td>n”, ” <td>8.824</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>3</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_1005</td>n”, ” <td>Alveolata</td>n”, ” <td>Dinoflagellata</td>n”, ” <td>Dinophyceae</td>n”, ” <td>Prorocentrales</td>n”, ” <td>Prorocentraceae</td>n”, ” <td>Prorocentrum</td>n”, ” <td>Prorocentrum minimum</td>n”, ” <td>NaN</td>n”, ” <td>0.000</td>n”, ” <td>…</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>4</th>n”, ” <td>megahit_NarBay_A_megahit_NarBay_A_k111_1008</td>n”, ” <td>Stramenopiles</td>n”, ” <td>Ochrophyta</td>n”, ” <td>Bacillariophyta</td>n”, ” <td>Bacillariophyta_X</td>n”, ” <td>Araphid-pennate</td>n”, ” <td>None</td>n”, ” <td>None</td>n”, ” <td>NaN</td>n”, ” <td>0.000</td>n”, ” <td>…</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” </tbody>n”, “</table>n”, “<p>5 rows × 22 columns</p>n”, “</div>”

], “text/plain”: [

” Name Supergroup Division \n”, “0 megahit_NarBay_A_megahit_NarBay_A_k111_0 Stramenopiles Ochrophyta n”, “1 megahit_NarBay_A_megahit_NarBay_A_k111_10 Hacrobia Haptophyta n”, “2 megahit_NarBay_A_megahit_NarBay_A_k111_100 Hacrobia Haptophyta n”, “3 megahit_NarBay_A_megahit_NarBay_A_k111_1005 Alveolata Dinoflagellata n”, “4 megahit_NarBay_A_megahit_NarBay_A_k111_1008 Stramenopiles Ochrophyta n”, “n”, ” Class Order Family \n”, “0 Bacillariophyta Bacillariophyta_X Polar-centric-Mediophyceae n”, “1 Prymnesiophyceae Prymnesiales Chrysochromulinaceae n”, “2 Prymnesiophyceae Prymnesiales Chrysochromulinaceae n”, “3 Dinophyceae Prorocentrales Prorocentraceae n”, “4 Bacillariophyta Bacillariophyta_X Araphid-pennate n”, “n”, ” Genus Species Unnamed: 1 SRR1810204 … \n”, “0 Skeletonema Skeletonema grethea NaN 0.000 … n”, “1 Chrysochromulina Chrysochromulina ericina NaN 0.000 … n”, “2 Chrysochromulina Chrysochromulina ericina NaN 9.369 … n”, “3 Prorocentrum Prorocentrum minimum NaN 0.000 … n”, “4 None None NaN 0.000 … n”, “n”, ” SRR1810207 SRR1810208 SRR1810209 SRR1810210 SRR1810211 SRR1810801 \n”, “0 0.000 0.000 0.000 0.0 0.000 0.000 n”, “1 3.973 0.000 0.000 0.0 0.000 0.000 n”, “2 7.093 4.902 15.165 0.0 5.869 8.824 n”, “3 0.000 0.000 0.000 0.0 0.000 0.000 n”, “4 0.000 0.000 0.000 0.0 0.000 0.000 n”, “n”, ” SRR181799 SRR1945044 SRR1945045 SRR1945046 n”, “0 0.0 0.0 0.0 0.0 n”, “1 0.0 0.0 0.0 0.0 n”, “2 0.0 0.0 0.0 0.0 n”, “3 0.0 0.0 0.0 0.0 n”, “4 0.0 0.0 0.0 0.0 n”, “n”, “[5 rows x 22 columns]”

]

}, “execution_count”: 6, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“#Move over to the directory containing salmon outputn”, “%cd /vortexfs1/omics/alexander/ncohen/BATS2019-clio-metaT/EUKulele_NB/output/taxonomy_estimation/salmon_indiv_to_megan”, “! for i in *_quant/quant.sf; do awk -F, -v OFS=, ‘NR==1{split(FILENAME,a,”_quant”);$2= a[1] “”}1’ ${i} | awk ‘{gsub(/\NumReads\,/,””,$5)}1’> ${i}_cleaned; donen”, “! awk ‘{samples[$1] = samples[$1] OFS $NF}; END {print “Name”, samples[“Name”]; delete samples[“Name”]; for (name in samples) print name, samples[name]}’ */quant.sf_cleaned > table.tabn”, “counts=pd.read_table(‘table.tab’, sep = ” “) n”, “%cd /vortexfs1/omics/alexander/ncohen/BATS2019-clio-metaT/EUKulele_NB/output/taxonomy_estimationn”, “combined = df.join(counts.set_index(‘Name’), on=’Name’)n”, “combined.head()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“As a side note, it is also helpful during these environmental metatranscriptomic analyses to combine functional annotations alongside taxonomic identifications and read counts into one dataframe for downstream visualization, sharing, and exploration of the data. To do this, we can read in kegg annotations obtained by aligning our assembly against the KEGG database. [This is also performed within the eukrhythmic metatranscriptomic assembly pipeline using arKEGGio (https://github.com/AlexanderLabWHOI/eukrhythmic)]:”

]

}, {

“cell_type”: “code”, “execution_count”: 7, “metadata”: {}, “outputs”: [

{

“name”: “stdout”, “output_type”: “stream”, “text”: [

“/vortexfs1/omics/alexander/data/NB_subsampled_11Sept/keggn”

]

}, {

“name”: “stderr”, “output_type”: “stream”, “text”: [

“/vortexfs1/home/ncohen/.conda/envs/EUKulele/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (3,4,5,6,7,8,9,10,11,12) have mixed types.Specify dtype option on import or set low_memory=False.n”, ” interactivity=interactivity, compiler=compiler, result=result)n”

]

}, {

“name”: “stdout”, “output_type”: “stream”, “text”: [

“/vortexfs1/omics/alexander/ncohen/BATS2019-clio-metaT/EUKulele_NB/output/taxonomy_estimationn”

]

}, {

“data”: {
“text/html”: [
“<div>n”, “<style scoped>n”, ” .dataframe tbody tr th:only-of-type {n”, ” vertical-align: middle;n”, ” }n”, “n”, ” .dataframe tbody tr th {n”, ” vertical-align: top;n”, ” }n”, “n”, ” .dataframe thead th {n”, ” text-align: right;n”, ” }n”, “</style>n”, “<table border=”1” class=”dataframe”>n”, ” <thead>n”, ” <tr style=”text-align: right;”>n”, ” <th></th>n”, ” <th>KO</th>n”, ” <th>query_id</th>n”, ” <th>subject_id</th>n”, ” <th>perc_ident</th>n”, ” <th>length</th>n”, ” <th>mismatch</th>n”, ” <th>gapopen</th>n”, ” <th>qstart</th>n”, ” <th>qend</th>n”, ” <th>sstart</th>n”, ” <th>…</th>n”, ” <th>SRR1810207</th>n”, ” <th>SRR1810208</th>n”, ” <th>SRR1810209</th>n”, ” <th>SRR1810210</th>n”, ” <th>SRR1810211</th>n”, ” <th>SRR1810801</th>n”, ” <th>SRR181799</th>n”, ” <th>SRR1945044</th>n”, ” <th>SRR1945045</th>n”, ” <th>SRR1945046</th>n”, ” </tr>n”, ” </thead>n”, ” <tbody>n”, ” <tr>n”, ” <th>0</th>n”, ” <td>K03283</td>n”, ” <td>megahit_NarBay_B_megahit_NarBay_B_k101_36771</td>n”, ” <td>smin:v1.2.008389.t1</td>n”, ” <td>92.1</td>n”, ” <td>151</td>n”, ” <td>12</td>n”, ” <td>0</td>n”, ” <td>49</td>n”, ” <td>501</td>n”, ” <td>6</td>n”, ” <td>…</td>n”, ” <td>2.479</td>n”, ” <td>1.431</td>n”, ” <td>10.28</td>n”, ” <td>0.0</td>n”, ” <td>3.311</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>1</th>n”, ” <td>K03283</td>n”, ” <td>megahit_NarBay_B_megahit_NarBay_B_k101_36771</td>n”, ” <td>smin:v1.2.025479.t1</td>n”, ” <td>92.1</td>n”, ” <td>139</td>n”, ” <td>11</td>n”, ” <td>0</td>n”, ” <td>85</td>n”, ” <td>501</td>n”, ” <td>2</td>n”, ” <td>…</td>n”, ” <td>2.479</td>n”, ” <td>1.431</td>n”, ” <td>10.28</td>n”, ” <td>0.0</td>n”, ” <td>3.311</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>2</th>n”, ” <td>K03283</td>n”, ” <td>megahit_NarBay_B_megahit_NarBay_B_k101_36771</td>n”, ” <td>tgo:TGME49_273760</td>n”, ” <td>83.4</td>n”, ” <td>151</td>n”, ” <td>25</td>n”, ” <td>0</td>n”, ” <td>49</td>n”, ” <td>501</td>n”, ” <td>6</td>n”, ” <td>…</td>n”, ” <td>2.479</td>n”, ” <td>1.431</td>n”, ” <td>10.28</td>n”, ” <td>0.0</td>n”, ” <td>3.311</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>3</th>n”, ” <td>K03283</td>n”, ” <td>megahit_NarBay_B_megahit_NarBay_B_k101_36771</td>n”, ” <td>ddi:DDB_G0273249</td>n”, ” <td>80.1</td>n”, ” <td>151</td>n”, ” <td>30</td>n”, ” <td>0</td>n”, ” <td>49</td>n”, ” <td>501</td>n”, ” <td>4</td>n”, ” <td>…</td>n”, ” <td>2.479</td>n”, ” <td>1.431</td>n”, ” <td>10.28</td>n”, ” <td>0.0</td>n”, ” <td>3.311</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” <tr>n”, ” <th>4</th>n”, ” <td>K03283</td>n”, ” <td>megahit_NarBay_B_megahit_NarBay_B_k101_36771</td>n”, ” <td>ddi:DDB_G0273623</td>n”, ” <td>80.1</td>n”, ” <td>151</td>n”, ” <td>30</td>n”, ” <td>0</td>n”, ” <td>49</td>n”, ” <td>501</td>n”, ” <td>4</td>n”, ” <td>…</td>n”, ” <td>2.479</td>n”, ” <td>1.431</td>n”, ” <td>10.28</td>n”, ” <td>0.0</td>n”, ” <td>3.311</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” </tr>n”, ” </tbody>n”, “</table>n”, “<p>5 rows × 42 columns</p>n”, “</div>”

], “text/plain”: [

” KO query_id subject_id \n”, “0 K03283 megahit_NarBay_B_megahit_NarBay_B_k101_36771 smin:v1.2.008389.t1 n”, “1 K03283 megahit_NarBay_B_megahit_NarBay_B_k101_36771 smin:v1.2.025479.t1 n”, “2 K03283 megahit_NarBay_B_megahit_NarBay_B_k101_36771 tgo:TGME49_273760 n”, “3 K03283 megahit_NarBay_B_megahit_NarBay_B_k101_36771 ddi:DDB_G0273249 n”, “4 K03283 megahit_NarBay_B_megahit_NarBay_B_k101_36771 ddi:DDB_G0273623 n”, “n”, ” perc_ident length mismatch gapopen qstart qend sstart … SRR1810207 \n”, “0 92.1 151 12 0 49 501 6 … 2.479 n”, “1 92.1 139 11 0 85 501 2 … 2.479 n”, “2 83.4 151 25 0 49 501 6 … 2.479 n”, “3 80.1 151 30 0 49 501 4 … 2.479 n”, “4 80.1 151 30 0 49 501 4 … 2.479 n”, “n”, ” SRR1810208 SRR1810209 SRR1810210 SRR1810211 SRR1810801 SRR181799 SRR1945044 \n”, “0 1.431 10.28 0.0 3.311 0.0 0.0 0.0 n”, “1 1.431 10.28 0.0 3.311 0.0 0.0 0.0 n”, “2 1.431 10.28 0.0 3.311 0.0 0.0 0.0 n”, “3 1.431 10.28 0.0 3.311 0.0 0.0 0.0 n”, “4 1.431 10.28 0.0 3.311 0.0 0.0 0.0 n”, “n”, ” SRR1945045 SRR1945046 n”, “0 0.0 0.0 n”, “1 0.0 0.0 n”, “2 0.0 0.0 n”, “3 0.0 0.0 n”, “4 0.0 0.0 n”, “n”, “[5 rows x 42 columns]”

]

}, “execution_count”: 7, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“%cd /vortexfs1/omics/alexander/data/NB_subsampled_11Sept/keggn”, “kegg = pd.read_csv(‘cat.kegg.csv’, sep =’\t’)n”, “%cd /vortexfs1/omics/alexander/ncohen/BATS2019-clio-metaT/EUKulele_NB/output/taxonomy_estimationn”, “#Match columns containing contig IDs. In the counts dataframe, “Name” contains contig IDs. In the KEGG output, it is “query_id”.n”, “merged = kegg.join(combined.set_index(‘Name’), on=’query_id’) #Check this returns the correct counts/annotations n”, “merged.head()n”, “#merged.to_csv(‘counts_taxa_kegg.csv’) #Optional export of table to .csv file”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Next we subset the counts/taxonomy dataframe to retain only the sample counts columns and the classification level of interest (in this case, Phylum):”

]

}, {

“cell_type”: “code”, “execution_count”: 8, “metadata”: {}, “outputs”: [

{
“data”: {
“text/html”: [
“<div>n”, “<style scoped>n”, ” .dataframe tbody tr th:only-of-type {n”, ” vertical-align: middle;n”, ” }n”, “n”, ” .dataframe tbody tr th {n”, ” vertical-align: top;n”, ” }n”, “n”, ” .dataframe thead th {n”, ” text-align: right;n”, ” }n”, “</style>n”, “<table border=”1” class=”dataframe”>n”, ” <thead>n”, ” <tr style=”text-align: right;”>n”, ” <th></th>n”, ” <th>SRR1810204</th>n”, ” <th>SRR1810205</th>n”, ” <th>SRR1810206</th>n”, ” <th>SRR1810207</th>n”, ” <th>SRR1810208</th>n”, ” <th>SRR1810209</th>n”, ” <th>SRR1810210</th>n”, ” <th>SRR1810211</th>n”, ” <th>SRR1810801</th>n”, ” <th>SRR181799</th>n”, ” <th>SRR1945044</th>n”, ” <th>SRR1945045</th>n”, ” <th>SRR1945046</th>n”, ” <th>Supergroup</th>n”, ” </tr>n”, ” </thead>n”, ” <tbody>n”, ” <tr>n”, ” <th>0</th>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>Stramenopiles</td>n”, ” </tr>n”, ” <tr>n”, ” <th>1</th>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>3.973</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>Hacrobia</td>n”, ” </tr>n”, ” <tr>n”, ” <th>2</th>n”, ” <td>9.369</td>n”, ” <td>1.001</td>n”, ” <td>0.0</td>n”, ” <td>7.093</td>n”, ” <td>4.902</td>n”, ” <td>15.165</td>n”, ” <td>0.0</td>n”, ” <td>5.869</td>n”, ” <td>8.824</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>Hacrobia</td>n”, ” </tr>n”, ” <tr>n”, ” <th>3</th>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>Alveolata</td>n”, ” </tr>n”, ” <tr>n”, ” <th>4</th>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.000</td>n”, ” <td>0.000</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>0.0</td>n”, ” <td>Stramenopiles</td>n”, ” </tr>n”, ” </tbody>n”, “</table>n”, “</div>”

], “text/plain”: [

” SRR1810204 SRR1810205 SRR1810206 SRR1810207 SRR1810208 SRR1810209 \n”, “0 0.000 0.000 0.0 0.000 0.000 0.000 n”, “1 0.000 0.000 0.0 3.973 0.000 0.000 n”, “2 9.369 1.001 0.0 7.093 4.902 15.165 n”, “3 0.000 0.000 0.0 0.000 0.000 0.000 n”, “4 0.000 0.000 0.0 0.000 0.000 0.000 n”, “n”, ” SRR1810210 SRR1810211 SRR1810801 SRR181799 SRR1945044 SRR1945045 \n”, “0 0.0 0.000 0.000 0.0 0.0 0.0 n”, “1 0.0 0.000 0.000 0.0 0.0 0.0 n”, “2 0.0 5.869 8.824 0.0 0.0 0.0 n”, “3 0.0 0.000 0.000 0.0 0.0 0.0 n”, “4 0.0 0.000 0.000 0.0 0.0 0.0 n”, “n”, ” SRR1945046 Supergroup n”, “0 0.0 Stramenopiles n”, “1 0.0 Hacrobia n”, “2 0.0 Hacrobia n”, “3 0.0 Alveolata n”, “4 0.0 Stramenopiles “

]

}, “execution_count”: 8, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“#Subset out taxonomic level of interest and sample ID columnsn”, “subset = combined.loc[:, ‘SRR1810204’:’SRR1945046’]n”, “subset[“Supergroup”]=combined[“Supergroup”]n”, “subset.head()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We import visualization libraries and a color palette”

]

}, {

“cell_type”: “code”, “execution_count”: 9, “metadata”: {}, “outputs”: [], “source”: [

“import matplotlib.pyplot as pltn”, “import seaborn as snsn”, “color = sns.set_palette(sns.color_palette(“husl”, 10))”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Next we collapse (sum) rows with the same taxonomic annotation. This table will be valuable for downstream applications such as reporting relative community abundance percentages:”

]

}, {

“cell_type”: “code”, “execution_count”: 10, “metadata”: {}, “outputs”: [], “source”: [

“#Group counts by Phylum in each samplen”, “x = subset.groupby([‘Supergroup’]).sum()n”, “#Save grouped dataframe to .csv file (optional)n”, “#x.to_csv(‘counts_phylum.csv’)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Lastly, we plot the results as a stacked barplot. This shows directly comparable community composition across samples using read counts. We conclude that we have a high relative abundance of Stramenopiles in these samples based on the transcript pool.”

]

}, {

“cell_type”: “code”, “execution_count”: 11, “metadata”: {}, “outputs”: [

{
“data”: {

“image/png”: “n”, “text/plain”: [

“<Figure size 432x288 with 1 Axes>”

]

}, “metadata”: {

“needs_background”: “light”

}, “output_type”: “display_data”

}

], “source”: [

“plot = x.T #Transpose dataframen”, “plot.head()n”, “plot.plot.bar(stacked=True, legend=True).legend(loc=(1, 0))n”, “color”

]

}

], “metadata”: {

“kernelspec”: {
“display_name”: “Python 3”, “language”: “python”, “name”: “python3”

}, “language_info”: {

“codemirror_mode”: {
“name”: “ipython”, “version”: 3

}, “file_extension”: “.py”, “mimetype”: “text/x-python”, “name”: “python”, “nbconvert_exporter”: “python”, “pygments_lexer”: “ipython3”, “version”: “3.6.9”

}

}, “nbformat”: 4, “nbformat_minor”: 2

}