Metagenomics Workshop by Javad Sadeghi

Metagenomics Workshop

What is Windows Subsystem for Linux (WSL)?

The Windows Subsystem for Linux (WSL) is a feature of the Windows operating system that enables you to run a Linux file system, along with Linux command-line tools and GUI apps, directly on Windows, alongside your traditional Windows desktop and apps.

Prerequisites

You must be running Windows 10 or Windows 11 to use the commands below.

Command

Open PowerShell or Windows Command Prompt in administrator mode by right-clicking and selecting "Run as administrator", enter the wsl --install command, then restart your machine.

For more informatio see: https://learn.microsoft.com/en-us/windows/wsl/install

You can after installing wsl close your CMD and open it again and type type "wsl" and hit "Enter".

WSL stores your Windows drives in the /mnt folder, with the name of the drive as a subfolder. For example your C:\ drive will be present at /mnt/c/ for you to use. Keeping this in mind, you can swap to your specific folder like so:cd /mnt/e/username/folder1/folder2

Linux, with its robust capabilities, is a preferred choice for many in the scientific community. Installing FastQC on a Linux system is straightforward. Follow these steps to ensure a smooth installation: Update Your System: Start by updating your system’s package list to have the latest software references. Open your terminal and enter:

See on GITHUB

Install FastQC: With your package list updated, proceed to install FastQC. In the terminal, type:

This command initiates the download and installation of FastQC.

Verification: To confirm that FastQC has been installed correctly, run the following in your terminal:

This should display the help information and version number, indicating a successful installation.By completing these steps, FastQC will be installed and ready for use on your Linux system, setting the stage for high-quality sequence data analysis.

wsl --install

sudo apt update

sudo apt -y install fastqc

fastqc--help

Home

This genome-centric metagenomics workshop will teach you how to obtain provisional whole genomes of individual populations from a mixed microbial community using metagenomics. The workshop has three parts:

Part I: Participants will learn to install required software, check raw sequence read quality, perform read quality control, and trim their sequence data. Participants will also learn how to upload and dowload sequence files from online archive databases.

Part II: In this part you learn how to assemble and annotate contigs, bin contigs into provisional whole genome sequences.

Part III: Participants will learn to how to extract taxonomic information, functional annotations, and pathway information for each binned genomes.

Workshop Topic Highlights

Metagenome reads quality check and quality control
Assemble quality controlled reads into contigs
Annotate assembled contigs
Map quality controlled reads onto contigs
Bin contigs into provisional genome bins
Extract taxonomic, functional, and pathway annotations for each binned genomes

Software installation

In this section you would learn how to install several metagenomics software including used for Metagenomics assembly, binning, gene prediction and annotation and strain-level analysis.

First install brew by either going to https://brew.sh/ and follow the instruction on how to install brew on your system.

Or copy and past the following command in your terminal:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Once you install brew then run the following code to install fastqc (will take up to 5 mins to install fastqc):

brew install fastqc

After installing fastqc test your fastqc by running: fastqc --help

You can either bring your sequence of intrest or dowload the sequence from online resources. One of the most useful online resources to download archive sequence files is using NCBI SRA. To do so:

Go to NCBI SRA website (https://www.ncbi.nlm.nih.gov/sra) and search for species/project.
After finding the Biom Project/sequence that you want compy the SRA ID under RUN tab.
Go to your termina (all users (Mac, Ubunta) and install SRA downloader using below command.
After installing sra-toolkit use below code to dowaload the sequence with your SRA id that you found before (step C). For example I was interested in Human microbiome and found this sequnce on here (https://www.ncbi.nlm.nih.gov/sra/SRX25254909[accn]) and the SRA id was SRR29754043.
Now we have to change the SRA file to fastq file. Go to the downloaded folder using cd command then use below code to change sra to fastq file.
Now lest check the fastq file quality using fastqc software.

sudo apt install sra-toolkit

prefetch SRR29754043

fastq-dump --split-files SRR29754043 -O your/output/file

cd SRR29754043

cd your/output/folder

fastqc *fastq -O SRA_fastq

There are a variety of software for Trimming (e.g, removing adapter). Adapter sequences should be removed from reads because they interfere with downstream analyses, such as alignment of reads to a reference. SRA include few changes to the sequnce files, that are not compatible with some of the analysis. That is why we are going to use these sequences throughtout the workshop. However, for triming you can use your SRA sequences for pratice.

Find your adaptor (fastq or terminal). In terminal use grep command and input the sequence file that you think might be the adapter to see if you can find it. for example, using fastqc I found that my sequence from SRA has NexteraPE-PE adaptor and I used grep to see if this is true.
Go http://www.usadellab.org/cms/?page=trimmomatic.
Download Trimmomatic (select Version 0.39: binary)
Open the downloaded folder and go to adapters folder and open NexteraPE-PE.fa file (or any other adaptor that you have based on step A). Copy the sequnces.
Then run the below code in your terminal

It works with FASTQ (using phred + 33 or phred + 64 quality scores, depending on the Illumina pipeline used), either uncompressed or gzipp'ed FASTQ. Use of gzip format is determined based on the .gz extension.

For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.

grep AGATGTGTATAAGAGACAG SRR29754043_2.fastq

java -jar trimmomatic-0.39.jar PE S28_R1.gz S28_R2.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. More about Bowtie2 here: https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml.

There are multiple ways to download and isntall Bowtie2. The two most common ways:

a) Install bowtie2 using conda

b) Install bowtie2 using a package manager (recommended!)

C) Check installation

conda install -c bioconda bowtie2

sudo apt update

sudo apt install bowtie2

bowtie2 --help

Sometimes we have host DNA and we need to remove the host DNA before doing any downstream analysis. For example, if I am looking at mice gut microbiome, I do not need mice DNA, however, during the DNA extractions some of the mice gut intestinal epithelial DNA will be sequnece as well. To remove that we are going to use Bowtie2. Bowtie2 is a refrence base alinement and has huge application in RNA and WG sequencing. To remove host DNA run below code. We will follow the instruction provided here:

https://www.metagenomics.wiki/tools/short-read/remove-host-sequences

To delete host genome first we need to download host DNA genome and for this search for host genome in NCBI.
After downloading the host genome (here we use tomato because the samples are tomato microbiome) we need to index the host genome using below code.
The above code will generate several files. Then we use bowtie to remove host genome.

Useful links:

Building refrence (e.g, Multilple E.coli): https://www.metagenomics.wiki/tools/bowtie2/index

https://open.bioqueue.org/home/knowledge/showKnowledge/sig/bowtie2

Archives indexes: https://bowtie-bio.sourceforge.net/bowtie2/news.shtml

bowtie2 -x 0_Host_genome/Solanaceae/Solanaceae -1 S28_R1.gz -2 S28_R2.gz --un-conc-gz S28_removed_Solanaceae -S Mapped_and_unmapped_Solanaceae.sam -p 20 --very-sensitive-local

bowtie2-build Solanaceae.fna Solanaceae --threads 20

MEGAHIT: there are different ways to install MEGAHIT. The esiest way is using conda. On WSL system I did not have conda (and probably most of you also do not) so I had to install conda first.

A) install conda (again diffent ways to install conda): first go to https://repo.anaconda.com/archive/ and based on your system download (copy the link) the appropriate conda. For example for my WSL "Anaconda3-2024.06-1-Linux-x86_64.sh" is the appropriate package.

1. Open your terminal:

WSL: wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh

2. Run the installation script: bash Anaconda[YOUR VERSION].sh

3. Read the license agreement and follow the prompts to accept.

4. Close WSL and open it again

5. Type conda --version to see if the isntallation was successful.

6. Now use conda to install megahit.

There are many software for genome assembly. MEGAHIT is an ultra-fast and memory-efficient NGS assembler. It is optimized for metagenomes, but also works well on generic single genome assembly (small or mammalian size) and single-cell assembly.

To run MEGAHIT:

Input: metagenomics sample as paired-end fastq files _R1 and _R2

Important: Make sure your files are fastq files (You can use gunzip to unzip your files in terminal)!

The esiest way to do this is to use galaxy MetaBAT2! To do so use your final_contigs_file from megahit and upload it at: https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/metabat2/metabat2/2.15+galaxy1

More info:https://bitbucket.org/berkeleylab/metabat/src/master/

Or install MetaBAT2 using conda on Linux:

Put these three files in one folder:

final.contigs: from megahit

S28_sorted.bai (from day 1 output from bowtie)

S28_sorted.bam (from day 1 output from bowtie)

The above code will generate a fill named depth. After generating this file, run the second code:

WSL due to memory issues cannot run this code thus I have to use Linux for this!

Here is the code in Linux!

mv S1_removed_Solanaceae.1 S1_host_removed_R1.fastq

mv S1_removed_Solanaceae.2 S1_host_removed_R2.fastq

samtools index S28_sorted.bam S28_sorted.bai

samtools sort S28_mapped_and_unmapped.bam -o S28_sorted.bam

megahit -1 SAMPLE_R1.fastq -2 SAMPLE_R2.fastq -t 20 -o megahit_result

conda install bioconda::megahit

jgi_summarize_bam_contig_depths --outputDepth depth.txt *.bam

metabat2 -i final.contigs.fa -a depth.txt -o bins_dir/bin

A. Check to see if you have brew by typing either brew (and hit enter or brew --help).

If you do not you have brew (command not found) then you have to install brew first. For installing brew please see installing fastqc for Mac section on how to install brew for Mac. If you have brew run next step.

B) Use brew to isntall Bowtie 2

C) Test your installation by running:

brew --help

brew install bowtie2

bowtie2

conda --version

conda install bioconda::megahit

1. Install conda (make sure to install pkg not bash! I provided the link for pkg below)
https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.pkg

2. Check to see if your installation was successful by typing conda --version in your terminal.

(if asked you for Proceed, select y)

3. Now install megahit using:

conda --version

conda install bioconda/label/cf201901::metabat2

This software only works on Linux and did not work on WSL and Mac (for me)!

You can install the software in multiple ways!

1- Conda (both Max and WSL): https://anaconda.org/bioconda/metabat2

A. Change the file names for downstream analyses.

B. Change SAM file to BAM file.

Change Bam to sorted bam

Creating a BAM index file

samtools view -bS Mapped_and_unmapped_Solanaceae.sam > S28_mapped_and_unmapped.bam

conda install bioconda::samtools

conda install bioconda/label/cf201901::samtools

0_Data:

https://www.dropbox.com/scl/fo/xt23fj53v9eg8m9okvndy/AMkpQFhb6eOJ9B7_puzSfwA?rlkey=rxr6o5tn3um4b2d65stnc20et&st=53uzp1kn&dl=0

1_fastqc_results:

https://www.dropbox.com/scl/fo/deksjix5tt1ssoz5r446v/AM0Ta4zs-1iSq5GJ-grwJPo?rlkey=wo66gzt9jhvlwnwdr3xcmjsz1&st=i9tizve0&dl=0

2_Trimmomatic_removed_adaptor:

https://www.dropbox.com/scl/fo/mdxvi95c41knyry5na04l/ANGW2yk_xpwk57a3hD8qUgA?rlkey=kioqgrcqm2lsxmr3ugmyz9x7i&st=wf64wbfd&dl=0

3_Bowtie:

https://www.dropbox.com/scl/fo/a2b8noinc0sr6oa6icpdh/ALYTu_iOqg-tDekTwivWBqI?rlkey=5ekzstecfa2esawsrq5salejv&st=lb9vr5gd&dl=0

4_Sam_tools:

https://www.dropbox.com/scl/fo/clme5na8x92ao2jjlaw2p/AKgQFiO2jsJVqhS4H8oCkHg?rlkey=pely4stzvx564pe65urkkv7zx&st=atzc8brm&dl=0

5_Assembly_megahit:

https://www.dropbox.com/scl/fo/dtr14z6ijzcbndwcwum0t/AN3WiJh1lSAfGqXBROQSU6s?rlkey=jjdvfl8q5hd2xr6tsu50vofor&st=kmjuoju3&dl=0

6_Binning:

https://www.dropbox.com/scl/fo/s1fpvta4te3r85l9uffho/AJvHs-EV8diNP78dKqJpe3c?rlkey=o2b0ew12nuymln2wons5ymrt0&st=ubcgg97h&dl=0

7-CheckM

https://www.dropbox.com/scl/fo/c18a3277lmak0p0fdklet/AH7N4k4P6rrkQaADwFXMgJ4?rlkey=tbato504aw7byh56udg8lmo2h&st=fe3nyyzq&dl=0

8_Gene_Prediction

https://www.dropbox.com/scl/fo/j14zw3pynhehfsd0yvxy8/ABeOY9F5Fh8-g0GxYBR2RtI?rlkey=958zv9jeeyn7cqsnq4gfrxbf0&st=cimhhnd5&dl=0

9_Funtional_anotation_prokka

https://www.dropbox.com/scl/fo/szjbdlxu89y60o6hvu8xi/ANwCrYDtJ5KCjP_4sVj5ENM?rlkey=x9x2g5umixb57ysf0bzx9eywc&st=89xwi6nb&dl=0

11_Quantify_genes

https://www.dropbox.com/scl/fo/5mytb3ra1vbino9etnwzt/AHecX_-rQvNjrL3_xcDg7so?rlkey=4hk7szw33cm6vzead1opwflkv&st=cg0u405g&dl=0

brew install brewsci/bio/prokka

conda install -c conda-forge -c bioconda -c defaults prokka

brew install hmmer

brew install prodigal

brew install pplacer

pip3 install numpy

pip3 install matplotlib

pip3 install pysam

pip3 install checkm-genome

Also download the reference file and put it into your path

https://zenodo.org/records/7401545#.Y44ymHbMJD8

conda create -n checkm python=3.9

conda activate checkm

conda install -c bioconda numpy matplotlib pysam

conda install -c bioconda hmmer prodigal pplacer

pip3 install checkm-genome

Also download the reference file and put it into your path (export CHECKM_DATA_PATH=/path/to/my_checkm_data)

https://zenodo.org/records/7401545#.Y44ymHbMJD8

pip install HTSeq

Some of contigs are microbial genomes or might be viral genomes, and some are just fragments of one or multiple genomes. The idea is then to evaluate the completeness and the contamination of those bins to evaluate their quality and only consider genomes that are >50% complete and <10% contaminated.

checkm lineage_wf -x fa bins_dir/ METAG_checkm/ --threads 16 -f METAG-checkm.tsv --tab_table

prodigal -i my.metagenome.fna -o my.genes -a my.proteins.faa -p meta

prokka --outdir mydir --prefix mygenome final.contigs13.fa

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

htseq-count -r pos -t CDS -f bam S13.map.sorted.bam S13.gft > S13.count

htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with gene