mb5370

In this workshop we will perform an analysis of the black band disease metagenome assembled in workshop 3. At the end of that workshop you should have assembled reads from nanopore sequencing of black band disease, and then classified the assembled contigs into bins using metabat2.

In theory each of the bins from our metagenomic assembly corresponds to a species or strain forming part of the black band disease community. These genomes are valuable resources that can be used to enhance analysis of other data such as RNA sequencing data. In addition (and relevant here), they can be used to make deductions about the metabolic processes occurring within a black band disease lesion. The full process of making these deductions involves quite a bit of biochemical knowledge, literature searching and cross referencing of outputs from many statistical tools. Here we will focus on some preliminary analyses that reveal basic processes within the mat and that provide information on the genes and pathways present.

Workshop outline

The overall goal of functional analysis of metagenomes is to answer the following questions

Who is there, or what taxonomic groups of microbes are present?
What can they do? (what genes do they have and what biochemical pathways do these support)

Our starting point for this workshop is a bbd assembly on which we have already performed binning with MetaBat and a quality check with checkm. Files that you will need from these analysis and their expected locations are;

Analysis Step	File or folder
Flye	`~/workshop3_metasm/flye/bbd`
MetaBAT	`~/workshop3_metasm/metabat/bins_bbd`
CheckM	`~/workshop3_metasm/checkm/checkm_results_bbd.txt`

If any of these steps did not work for you, don’t worry. You can retrieve pre-analysed results from a shared folder (shown later).

Step 0: Setup your project and starting data

Start by creating a new project for this workshop. You should be familiar with this from your previous workshops. Make sure you name your project workshop4_anno and place it in your home directory (the default).

If your analyses from workshop3 worked you can start by linking these in to your new project as follows

ln -s ../workshop3_metasm/flye .
ln -s ../workshop3_metasm/metabat .
ln -s ../workshop3_metasm/checkm .

If your analyses from workshop3 didn’t work you should link in prepared data.

mkdir -p flye metabat checkm

cd ~/workshop4_anno/flye/
ln -s /pvol/data/mg/bbd/flye/bbd .

cd ~/workshop4_anno/metabat
ln -s /pvol/data/mg/bbd/metabat/bins_bbd .

cd ~/workshop4_anno/checkm
ln -s /pvol/data/mg/bbd/checkm/checkm_results_bbd.txt .

Note: Creating symbolic links will not work if you already have some partially completed results. If you do then you will need to delete those first. Take great care with this. It’s probably best to ask an instructor for help.

Step 1: Select bins based on checkm results

Change directory to the checkm folder.

cd ~/workshop4_anno/checkm

Now take a look at the results from our checkm analysis. These are summarised in the file checkm_results_bbd.txt

You can view the results in the terminal using cat, head, more or less commands. These all represent different ways to view a text file.

For example

head checkm_results_bbd.txt

If your screen is wide enough this should display the top 10 lines in the file in a nicely formatted way. If your screen is too narrow the lines will wrap and it will be hard to read. Take a look at the Marker lineage column for the top few bins. This column isn’t a full taxonomic classification but it does seem to give family, class or phylum level classification for the top most complete bins. As you go further down the list you will eventually see the marker lineage shift to a very broad level such as k_Bacteria. Try viewing a few more results by telling the head command to show more lines

head -n 20 checkm_results_bbd.txt

In the analyses that follow we would like to focus on the most complete bins because these are likely to contain the majority of useful information on function within the bbd community. These bins will also usually correspond to the most abundant (and therefore best assembled) members of the community. Here we are just using the checkm completeness to assess these factors but a more complete analysis might also look at the size of contigs within the bins and their read coverage.

Let’s focus on bins with a completeness greater than 15%. To extract the bin IDs for these we can use awk. This command should print only the lines where the 13th column has a value greater than 15.

cat checkm_results_bbd.txt | awk '$13>15'

This seems to be the bins we want but we just want the bin ID not the entire line. To get this we can use the print command

cat checkm_results_bbd.txt | awk '$13>15{print $1}'

Now let’s send this output to a file so that we can use it in the next step

cat checkm_results_bbd.txt | awk '$13>15{print $1}' > bbd_topbins.txt

Step 2: Perform taxonomic classification with gtdbtk

First create a folder for gtdbtk in the top level of your workshop4_anno project. Then change directory to this folder.

cd ~/workshop4_anno
mkdir gtdbtk
cd gtdbtk

Now create a subfolder within the gtdbtk folder called genomes. This is where we will put a copy of each of the top bins we identified in step 1.

mkdir genomes

This next bit of code is a bit more advanced. Don’t worry if you don’t understand it all. The idea here is that we will iterate over our list of topbins. For each of the topbins we want to copy a file corresponding to that bin into our genomes folder.

We’ll build up the command in steps. First just get the while loop working

while read bin;do echo $bin;done < ../checkm/bbd_topbins.txt

We’re not doing any copying of files yet. Instead, we have used echo $bin to simply print the name of the bin. We need to turn this bin name into a path specifying the fasta file for the bin. So for every bin we need to prepend the path to its folder ../metabat/bins_bbd/ and then append the file extension .fa

while read bin;do echo "../metabat/bins_bbd/${bin}.fa";done < ../checkm/bbd_topbins.txt

Now we are ready to perform the copy

while read bin;do cp "../metabat/bins_bbd/${bin}.fa" genomes;done < ../checkm/bbd_topbins.txt

Check that it worked by listing files within the genomes directory. You should see one file for each of the bins

ls genomes

These bins form input for gtdbtk. We can now run the classification workflow. This is another very resource intensive step. We will need to run it by creating a slurm job script. Hopefully the workflow for this is familiar by now.

First create a blank shell script file and save it inside your gtdbtk directory. Name your file run_gtdbtk.sh

Copy the following code into your run_gtdbtk.sh file and save it.

#!/bin/bash
#SBATCH --time=60
#SBATCH --ntasks=2 --mem=24gb

echo "Starting gtdbtk in $(pwd) at $(date)"

shopt -s expand_aliases
alias gtdbtk='apptainer run -B /pvol:/pvol,/pvol/data/mg/gtdbtkdata/release207_v2/:/refdata /pvol/data/sif/gtdbtk.sif'

gtdbtk classify_wf -x fa --cpus 2 --genome_dir genomes --out_dir out --skip_ani_screen

echo "Finished gtdbtk in $(pwd) at $(date)"

Then submit your job using

sbatch run_gtdbtk.sh

This will take quite a while. While you are waiting you can continue with step 3.

Step 3. Find and annotate genes with prokka

Our ultimate goal is to use the BBD metagenome to improve our understanding of metabolic processes within the mat. A crucial step toward this is identifying the genes present within the genomes of bbd taxa. We will be using a program called prokka for this purpose. Prokka will not only identify the gene sequences, it also uses a range of methods to identify homologous genes with known functions. The idea behind this is that if two genes share a common ancestor (ie are homologous) their functions are likely to be similar. This in turn allows Prokka to assign a putative name and function to some of the genes within our newly assembled metagenome bins.

As usual we will create a directory for our prokka work and change to it.

cd ~/workshop4_anno/
mkdir prokka
cd prokka

First we try running prokka for a single bin. This produces quite a bit of output. Watch the output as it goes by. This output gives a good idea of the various steps prokka goes through.

Next find out the first bin ID in your list

head -n 1 ../checkm/bbd_topbins.txt

In my case this was metabat.5 but your result might be different. Be sure to substitute your own bin id where appropriate in the commands below.

bin=metabat.5
prokka --outdir $bin --prefix $bin  ../metabat/bins_bbd/${bin}.fa

This should run fairly quickly (around 1-2 minutes), and if it works you should end up with a folder called metabat.5. List the contents of this folder. It should look like this

ls metabat.5

metabat.5.err  metabat.5.faa  metabat.5.ffn  metabat.5.fna  metabat.5.fsa  metabat.5.gbk  metabat.5.gff  metabat.5.log  metabat.5.sqn  metabat.5.tbl  metabat.5.tsv  metabat.5.txt

There are many output files here. For our purposes the most important one is metabat.5.tsv which is a tab separated file with the name of each gene and its inferred name and functional info (if available)

Now we are ready to run prokka on all the metagenomes. This is another computationally intensive step, so it’s time for another slurm script.

Create a new blank text file and save it in your prokka directory. Call it run_prokka.sh and copy the following code into it.

#!/bin/bash
#SBATCH --time=60
#SBATCH --ntasks=2 --mem=2gb

echo "Starting prokka in $(pwd) at $(date)"

shopt -s expand_aliases
alias prokka='apptainer run -B /pvol/:/pvol /pvol/data/sif/prokka.sif prokka'

while read bin;do 
  prokka --outdir $bin --prefix $bin "../metabat/bins_bbd/${bin}.fa";
done < ../checkm/bbd_topbins.txt

echo "Finished prokka in $(pwd) at $(date)"

Submit your job with

sbatch run_prokka.sh

While this is running you can keep working on step 4.

Step 4: Infer metabolic pathways

Infering metabolic pathways can be quite complex and is difficult to do in a completely objective or automated fashion. In our case we are lucky because we can refer to the general model of BBD developed by Sato et al.

According to this figure we expect to find that the cyanobacteria should be the main organism capable of photosynthesis. We also expect to find some organisms capable of sulfate reduction. Nominally this should be species of Desulfovibrio, however, there may be other organisms with this capability and since our assembled bins for Desulfovibrio are quite incomplete we might not be able to find sulfate-reducing enzymes in this group.

We will use a tool called minpath to take functional info for individual genes and use this to infer the presence of broader biological pathways (involving multiple genes/steps). We will run this on each of the genomes (bins) in our analysis. Minpath will look at all of the annotated genes in our data and infer the minimal set of biological pathways that can explain the presence of all these genes. It is important to note that this is just one approach (parsimony) to inferring pathways and other approaches also exist.

First create a directory for minpath results and change into it.

cd ~/workshop4_anno

mkdir minpath
cd minpath

Inputs for minpath will be prepared for each bin based on the prokka .tsv file. Start by looking at the .tsv file for a single bin

head ../prokka/metabat.5/metabat.5.tsv

You should see 9 genes displayed. Many will be missing annotations (called hypothetical protein) but some will have a gene name (eg manA), an EC (Enzyme Commission) number (eg 5.3.1.8), a COG (Clusters of Orthologous Genes) id (eg COG1482) and a protein name (eg Mannose-6-phosphate isomerase ManA).

For input to minpath we only want to keep a unique ID for each gene (column 1) and its EC number (column 5). We can transform the tsv into this format using awk like this

head ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '{print $1,$5}'

Note that here we are using -F '\t' to tell awk that our input is tab separated. We also want to filter the output to only include lines with an EC number

head ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{print $1,$5}'

Next we need to ensure that the output is tab separated and we also need to remove the header line

head ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{OFS="\t";print $1,$5}' | grep -v '^locus'

This of course only processes the first 10 lines … we need to change head to cat to include all genes

cat ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{OFS="\t";print $1,$5}' | grep -v '^locus'

Finally we send the output to a file so that it can be used as input to minpath

cat ../prokka/metabat.5/metabat.5.tsv | awk -F '\t' '$5{OFS="\t";print $1,$5}' | grep -v '^locus' > metabat.5.ec

Now run minpath on this file as follows;

minpath -ec metabat.5.ec -report metabat.5.ec.report -details metabat.5.ec.details

Take a look at the top few lines of output

head metabat.5.ec.report

As an example consider the first line

path PWY-7321 any n/a  naive 1  minpath 1  fam0  4  fam-found  2  name  ecdysteroid metabolism (arthropods)

Important parts of this line are

PWY-7321 which is a pathway identifier in the metacyc database. You can use this to lookup the full pathway. Eg PWY-7321
fam0 4 tells us that four gene families are involved in this pathway
fam-found 2 tells us that we found 2 gene families from the pathway in this genome
name ecdysteroid metabolism (arthropods) is a name for the pathway

Now go back to check on your prokka job. You can do this by using the command squeue. Look for a job named run_prok and your username. When your job is done it will not appear in this list. You can also look at the output log file

tail ~/workshop4_anno/prokka/slurm-XXX.out

When your job is done it should show Finished prokka on the last line of output.

When prokka is done you can run the following code to prepare minpath inputs for all bins.

while read bin;do 
  cat ../prokka/$bin/${bin}.tsv | awk -F '\t' '$5{printf("%s\t%s\n", $1,$5)}' | grep -v '^locus' > ${bin}.ec 
done < ../checkm/bbd_topbins.txt

Now run minpath on every bin. This should run pretty fast.

while read bin;do 
  minpath -ec ${bin}.ec -report ${bin}.ec.report -details ${bin}.ec.details
done < ../checkm/bbd_topbins.txt

There is a huge amount of information in the outputs from minpath. To keep things tractable we are just going to look at two pathways.

The first pathway is named photosynthesis light reactions and has the metacyc code PWY-101. This pathway includes the essential components of photosynthesis that occur in light, Photosystem I and Photosystem II.

Use the following command to search for this pathway across all the bins

grep 'PWY-101' *.ec.report

Question 4.1

How many bins contain at least one gene from this pathway?
Which bin has all of the genes (ie a complete pathway)?

Cross reference your results with the output from gtdbtk to see the taxonomy of bins containing this pathway. Note that you will need to modify the commands below to use the correct bin numbers (these will be different for everyone).

grep 'metabat.26' ../gtdbtk/out/gtdbtk.bac120.summary.tsv

The second pathway is a little less obviously named. It is called superpathway of tetrathionate reduction and has metacyc code PWY-5360. This pathway is important because it includes enzymes required to reduce sulfite to hydrogen sulfide. Note that sulfite acts as the terminal electron acceptor in this pathway, however, in a broader sense the term “sulfate reducing” refers to micro-organisms that can produce sulfide from a range of oxidised sulfur compounds.

Question 4.2

How many bins contain the pathway PWY-5360?
What family (in a taxonomic sense) do these bins belong to?