In this workshop we will perform an analysis of the black band disease metagenome assembled in workshop 3. At the end of that workshop you should have assembled reads from nanopore sequencing of black band disease, and then classified the assembled contigs into bins using metabat2.
In theory each of the bins from our metagenomic assembly corresponds to a species or strain forming part of the black band disease community. These genomes are valuable resources that can be used to enhance analysis of other data such as RNA sequencing data. In addition (and relevant here), they can be used to make deductions about the metabolic processes occurring within a black band disease lesion. The full process of making these deductions involves quite a bit of biochemical knowledge, literature searching and cross referencing of outputs from many statistical tools. Here we will focus on some preliminary analyses that reveal basic processes within the mat and that provide information on the genes and pathways present.
The overall goal of functional analysis of metagenomes is to answer the following questions
Our starting point for this workshop is a bbd assembly on which we have already performed binning with MetaBat and a quality check with checkm. Files that you will need from these analysis and their expected locations are;
| Analysis Step | File or folder |
|---|---|
| Flye | ~/workshop3_metasm/flye/bbd |
| MetaBAT | ~/workshop3_metasm/metabat/bins_bbd |
| CheckM | ~/workshop3_metasm/checkm/checkm_results_bbd.txt |
If any of these steps did not work for you, don’t worry. You can retrieve pre-analysed results from a shared folder (shown later).
Start by creating a new project for this workshop. You should be familiar with this from your previous workshops. Make sure you name your project workshop4_anno and place it in your home directory (the default).
If your analyses from workshop3 worked you can start by linking these in to your new project as follows
ln -s ../workshop3_metasm/flye .
ln -s ../workshop3_metasm/metabat .
ln -s ../workshop3_metasm/checkm .
If your analyses from workshop3 didn’t work you should link in prepared data.
mkdir -p flye metabat checkm
cd ~/workshop4_anno/flye/
ln -s /pvol/data/mg/bbd/flye/bbd .
cd ~/workshop4_anno/metabat
ln -s /pvol/data/mg/bbd/metabat/bins_bbd .
cd ~/workshop4_anno/checkm
ln -s /pvol/data/mg/bbd/checkm/checkm_results_bbd.txt .
Note: Creating symbolic links will not work if you already have some partially completed results. If you do then you will need to delete those first. Take great care with this. It’s probably best to ask an instructor for help.
Change directory to the checkm folder.
cd ~/workshop4_anno/checkm
Now take a look at the results from our checkm analysis. These are summarised in the file checkm_results_bbd.txt
You can view the results in the terminal using cat, head, more or less commands. These all represent different ways to view a text file.
For example
head checkm_results_bbd.txt
If your screen is wide enough this should display the top 10 lines in the file in a nicely formatted way. If your screen is too narrow the lines will wrap and it will be hard to read. Take a look at the Marker lineage column for the top few bins. This column isn’t a full taxonomic classification but it does seem to give family, class or phylum level classification for the top most complete bins. As you go further down the list you will eventually see the marker lineage shift to a very broad level such as k_Bacteria. Try viewing a few more results by telling the head command to show more lines
head -n 20 checkm_results_bbd.txt
In the analyses that follow we would like to focus on the most complete bins because these are likely to contain the majority of useful information on function within the bbd community. These bins will also usually correspond to the most abundant (and therefore best assembled) members of the community. Here we are just using the checkm completeness to assess these factors but a more complete analysis might also look at the size of contigs within the bins and their read coverage.
Let’s focus on bins with a completeness greater than 15%. To extract the bin IDs for these we can use awk. This command should print only the lines where the 13th column has a value greater than 15.
cat checkm_results_bbd.txt | awk '$13>15'
This seems to be the bins we want but we just want the bin ID not the entire line. To get this we can use the print command
cat checkm_results_bbd.txt | awk '$13>15{print $1}'
Now let’s send this output to a file so that we can use it in the next step
cat checkm_results_bbd.txt | awk '$13>15{print $1}' > bbd_topbins.txt
First create a folder for gtdbtk in the top level of your workshop4_anno project. Then change directory to this folder.
cd ~/workshop4_anno
mkdir gtdbtk
cd gtdbtk
Now create a subfolder within the gtdbtk folder called genomes. This is where we will put a copy of each of the top bins we identified in step 1.
mkdir genomes
This next bit of code is a bit more advanced. Don’t worry if you don’t understand it all. The idea here is that we will iterate over our list of topbins. For each of the topbins we want to copy a file corresponding to that bin into our genomes folder.
We’ll build up the command in steps. First just get the while loop working
while read bin;do echo $bin;done < ../checkm/bbd_topbins.txt
We’re not doing any copying of files yet. Instead, we have used echo $bin to simply print the name of the bin. We need to turn this bin name into a path specifying the fasta file for the bin. So for every bin we need to prepend the path to its folder ../metabat/bins_bbd/ and then append the file extension .fa
while read bin;do echo "../metabat/bins_bbd/${bin}.fa";done < ../checkm/bbd_topbins.txt
Now we are ready to perform the copy
while read bin;do cp "../metabat/bins_bbd/${bin}.fa" genomes;done < ../checkm/bbd_topbins.txt
Check that it worked by listing files within the genomes directory. You should see one file for each of the bins
ls genomes
These bins form input for gtdbtk. We can now run the classification workflow. This is another very resource intensive step. We will need to run it by creating a slurm job script. Hopefully the workflow for this is familiar by now.
First create a blank shell script file and save it inside your gtdbtk directory. Name your file run_gtdbtk.sh
Copy the following code into your run_gtdbtk.sh file and save it.
#!/bin/bash
#SBATCH --time=60
#SBATCH --ntasks=2 --mem=24gb
echo "Starting gtdbtk in $(pwd) at $(date)"
shopt -s expand_aliases
alias gtdbtk='apptainer run -B /pvol:/pvol,/pvol/data/mg/gtdbtkdata/release207_v2/:/refdata /pvol/data/sif/gtdbtk.sif'
gtdbtk classify_wf -x fa --cpus 2 --genome_dir genomes --out_dir out --skip_ani_screen
echo "Finished gtdbtk in $(pwd) at $(date)"
Then submit your job using
sbatch run_gtdbtk.sh
This will take quite a while. While you are waiting you can continue with step 3.
Our ultimate goal is to use the BBD metagenome to improve our understanding of metabolic processes within the mat. A crucial step toward this is identifying the genes present within the genomes of bbd taxa. We will be using a program called prokka for this purpose. Prokka will not only identify the gene sequences, it also uses a range of methods to identify homologous genes with known functions. The idea behind this is that if two genes share a common ancestor (ie are homologous) their functions are likely to be similar. This in turn allows Prokka to assign a putative name and function to some of the genes within our newly assembled metagenome bins.
As usual we will create a directory for our prokka work and change to it.
cd ~/workshop4_anno/
mkdir prokka
cd prokka
First we try running prokka for a single bin. This produces quite a bit of output. Watch the output as it goes by. This output gives a good idea of the various steps prokka goes through.
Next find out the first bin ID in your list
head -n 1 ../checkm/bbd_topbins.txt
In my case this was metabat.5 but your result might be different. Be sure to substitute your own bin id where appropriate in the commands below.
bin=metabat.5
prokka --outdir $bin --prefix $bin ../metabat/bins_bbd/${bin}.fa
This should run fairly quickly (around 1-2 minutes), and if it works you should end up with a folder called metabat.5. List the contents of this folder. It should look like this
ls metabat.5
metabat.5.err metabat.5.faa metabat.5.ffn metabat.5.fna metabat.5.fsa metabat.5.gbk metabat.5.gff metabat.5.log metabat.5.sqn metabat.5.tbl metabat.5.tsv metabat.5.txt
There are many output files here. For our purposes the most important one is metabat.5.tsv which is a tab separated file with the name of each gene and its inferred name and functional info (if available)
Now we are ready to run prokka on all the metagenomes. This is another computationally intensive step, so it’s time for another slurm script.
Create a new blank text file and save it in your prokka directory. Call it run_prokka.sh and copy the following code into it.
#!/bin/bash
#SBATCH --time=60
#SBATCH --ntasks=2 --mem=2gb
echo "Starting prokka in $(pwd) at $(date)"
shopt -s expand_aliases
alias prokka='apptainer run -B /pvol/:/pvol /pvol/data/sif/prokka.sif prokka'
while read bin;do
prokka --outdir $bin --prefix $bin "../metabat/bins_bbd/${bin}.fa";
done < ../checkm/bbd_topbins.txt
echo "Finished prokka in $(pwd) at $(date)"
Submit your job with
sbatch run_prokka.sh
While this is running you can keep working on step 4.
Infering metabolic pathways can be quite complex and is difficult to do in a completely objective or automated fashion. In our case we are lucky because we can refer to the general model of BBD developed by Sato et al.

According to this figure we expect to find that the cyanobacteria should be the main organism capable of photosynthesis. We also expect to find some organisms capable of sulfate reduction. Nominally this should be species of Desulfovibrio, however, there may be other organisms with this capability and since our assembled bins for Desulfovibrio are quite incomplete we might not be able to find sulfate-reducing enzymes in this group.
We will use a tool called minpath to take functional info for individual genes and use this to infer the presence of broader biological pathways (involving multiple genes/steps). We will run this on each of the genomes (bins) in our analysis. Minpath will look at all of the annotated genes in our data and infer the minimal set of biological pathways that can explain the presence of all these genes. It is important to note that this is just one approach (parsimony) to inferring pathways and other approaches also exist.
First create a directory for minpath results and change into it.
cd ~/workshop4_anno
mkdir minpath
cd minpath
Inputs for minpath will be prepared for each bin based on the prokka .tsv file. Start by looking at the .tsv file for a single bin
head ../prokka/metabat.5/metabat.5.tsv
You should see 9 genes displayed. Many will be missing annotations (called hypothetical protein) but some will have a gene name (eg manA), an EC (Enzyme Commission) number (eg 5.3.1.8), a COG (Clusters of Orthologous Genes) id (eg COG1482) and a protein name (eg Mannose-6-phosphate isomerase ManA).
For input to minpath we only want to keep a unique ID for each gene (column 1) and its EC number (column 5). We can transform the tsv into this format using awk like this
head ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '{print $1,$5}'
Note that here we are using -F '\t' to tell awk that our input is tab separated. We also want to filter the output to only include lines with an EC number
head ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{print $1,$5}'
Next we need to ensure that the output is tab separated and we also need to remove the header line
head ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{OFS="\t";print $1,$5}' | grep -v '^locus'
This of course only processes the first 10 lines … we need to change head to cat to include all genes
cat ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{OFS="\t";print $1,$5}' | grep -v '^locus'
Finally we send the output to a file so that it can be used as input to minpath
cat ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{OFS="\t";print $1,$5}' | grep -v '^locus' > metabat.5.ec
Now run minpath on this file as follows;
minpath -ec metabat.5.ec -report metabat.5.ec.report -details metabat.5.ec.details
Take a look at the top few lines of output
head metabat.5.ec.report
As an example consider the first line
path PWY-7321 any n/a naive 1 minpath 1 fam0 4 fam-found 2 name ecdysteroid metabolism (arthropods)
Important parts of this line are
Now go back to check on your prokka job. You can do this by using the command squeue. Look for a job named run_prok and your username. When your job is done it will not appear in this list. You can also look at the output log file
tail ~/workshop4_anno/prokka/slurm-XXX.out
When your job is done it should show Finished prokka on the last line of output.
When prokka is done you can run the following code to prepare minpath inputs for all bins.
while read bin;do
cat ../prokka/$bin/${bin}.tsv | awk -F '\t' '$5{printf("%s\t%s\n", $1,$5)}' | grep -v '^locus' > ${bin}.ec
done < ../checkm/bbd_topbins.txt
Now run minpath on every bin. This should run pretty fast.
while read bin;do
minpath -ec ${bin}.ec -report ${bin}.ec.report -details ${bin}.ec.details
done < ../checkm/bbd_topbins.txt
There is a huge amount of information in the outputs from minpath. To keep things tractable we are just going to look at two pathways.
The first pathway is named photosynthesis light reactions and has the metacyc code PWY-101. This pathway includes the essential components of photosynthesis that occur in light, Photosystem I and Photosystem II.
Use the following command to search for this pathway across all the bins
grep 'PWY-101' *.ec.report
Cross reference your results with the output from gtdbtk to see the taxonomy of bins containing this pathway. Note that you will need to modify the commands below to use the correct bin numbers (these will be different for everyone).
grep 'metabat.26' ../gtdbtk/out/gtdbtk.bac120.summary.tsv
The second pathway is a little less obviously named. It is called superpathway of tetrathionate reduction and has metacyc code PWY-5360. This pathway is important because it includes enzymes required to reduce sulfite to hydrogen sulfide. Note that sulfite acts as the terminal electron acceptor in this pathway, however, in a broader sense the term “sulfate reducing” refers to micro-organisms that can produce sulfide from a range of oxidised sulfur compounds.