What is Windows Subsystem for Linux (WSL)?

The Windows Subsystem for Linux (WSL) is a feature of the Windows operating system that enables you to run a Linux file system, along with Linux command-line tools and GUI apps, directly on Windows, alongside your traditional Windows desktop and apps.
Prerequisites
You must be running Windows 10 or Windows 11 to use the commands below.
Command
Open PowerShell or Windows Command Prompt in administrator mode by right-clicking and selecting "Run as administrator", enter the wsl --install command, then restart your machine.
For more informatio see: https://learn.microsoft.com/en-us/windows/wsl/install

Linux, with its robust capabilities, is a preferred choice for many in the scientific community. Installing FastQC on a Linux system is straightforward. Follow these steps to ensure a smooth installation: Update Your System: Start by updating your system’s package list to have the latest software references. Open your terminal and enter:
Install FastQC: With your package list updated, proceed to install FastQC. In the terminal, type:
This command initiates the download and installation of FastQC.
Verification: To confirm that FastQC has been installed correctly, run the following in your terminal:
This should display the help information and version number, indicating a successful installation.By completing these steps, FastQC will be installed and ready for use on your Linux system, setting the stage for high-quality sequence data analysis.

wsl --install

sudo apt update

sudo apt -y install fastqc

fastqc--help

Home

This genome-centric metagenomics workshop will teach you how to obtain provisional whole genomes of individual populations from a mixed microbial community using metagenomics. The workshop has three parts:



Part I: Participants will learn to install required software, check raw sequence read quality, perform read quality control, and trim their sequence data. Participants will also learn how to upload and dowload sequence files from online archive databases.




Part II: In this part you learn how to assemble and annotate contigs, bin contigs into provisional whole genome sequences.



Part III:  Participants will learn to how to extract taxonomic information, functional annotations, and pathway information for each binned genomes.


Workshop Topic Highlights

 


 

Software installation

In this section you would learn how to install several metagenomics software including used for Metagenomics assembly, binning, gene prediction and annotation and strain-level analysis.
 Install WSL Fastqc (WSL, Linux ubuntu)
First install brew by either going to https://brew.sh/ and follow the instruction on how to install brew on your system.

Or copy and past the following command in your terminal:  

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"


Once you install brew then run the following code to install fastqc (will take up to 5 mins to install fastqc):

brew install fastqc

After installing fastqc test your fastqc by running: fastqc --help
Fastqc (Mac)
Obtaining sequence data

You can either bring your sequence of intrest or dowload the sequence from online resources. One of the most useful online resources to download archive sequence files is using NCBI SRA. To do so:






sudo apt install sra-toolkit

prefetch SRR29754043

fastq-dump --split-files SRR29754043   -O your/output/file

cd SRR29754043

cd your/output/folder


fastqc  *fastq   -O SRA_fastq

Trimming sequence data

There are a variety of software for Trimming (e.g, removing adapter). Adapter sequences should be removed from reads because they interfere with downstream analyses, such as alignment of reads to a reference. SRA include few changes to the sequnce files, that are not compatible with some of the analysis. That is why we are going to use these sequences throughtout the workshop. However, for triming you can use your SRA sequences for pratice.









It works with FASTQ (using phred + 33 or phred + 64 quality scores, depending on the Illumina pipeline used), either uncompressed or gzipp'ed FASTQ. Use of gzip format is determined based on the .gz extension.


For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.

grep AGATGTGTATAAGAGACAG  SRR29754043_2.fastq

java -jar trimmomatic-0.39.jar PE S28_R1.gz S28_R2.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36

Removing host contamination
Bowtie2 (WSL)

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.  More about Bowtie2 here: https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml.

There are multiple ways to download and isntall Bowtie2. The two most common ways:

a) Install bowtie2 using conda





b) Install bowtie2 using a package manager (recommended!)







C) Check installation

conda install -c bioconda bowtie2

sudo apt update

sudo apt install bowtie2

bowtie2 --help

Sometimes we have host DNA and we need to remove the host DNA before doing any downstream analysis. For example, if I am looking at mice gut microbiome, I do not need mice DNA, however, during the DNA extractions some of the mice gut intestinal epithelial DNA will be sequnece as well. To remove that we are going to use Bowtie2. Bowtie2 is a refrence base alinement and has huge application in RNA and WG sequencing. To remove host DNA run below code. We will follow the instruction provided here:

https://www.metagenomics.wiki/tools/short-read/remove-host-sequences

 






 












 

 









Useful links:

Building refrence (e.g, Multilple E.coli): https://www.metagenomics.wiki/tools/bowtie2/index

https://open.bioqueue.org/home/knowledge/showKnowledge/sig/bowtie2

Archives indexes: https://bowtie-bio.sourceforge.net/bowtie2/news.shtml


bowtie2  -x 0_Host_genome/Solanaceae/Solanaceae  -1 S28_R1.gz  -2 S28_R2.gz  --un-conc-gz   S28_removed_Solanaceae  -S Mapped_and_unmapped_Solanaceae.sam  -p 20 --very-sensitive-local

bowtie2-build Solanaceae.fna  Solanaceae --threads 20

Day 2- Assembly
MEGAHIT(WSL)

MEGAHIT: there are different ways to install MEGAHIT. The esiest way is using conda. On WSL system I did not have conda (and probably most of you also do not) so I had to install conda first.

A) install conda (again diffent ways to install conda): first go to https://repo.anaconda.com/archive/ and based on your system download (copy the link) the appropriate conda. For example for my WSL  "Anaconda3-2024.06-1-Linux-x86_64.sh" is the appropriate package.

1. Open your  terminal:

WSL:   wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh

2. Run the installation script: bash Anaconda[YOUR VERSION].sh

3. Read the license agreement and follow the prompts to accept.

4. Close WSL and open it again

5. Type conda --version to see if the isntallation was successful.










6. Now use conda to install megahit.















To run MEGAHIT:


Input: metagenomics sample as paired-end fastq files _R1 and _R2

 

Important: Make sure your files are fastq files (You can use gunzip to unzip your files in terminal)!

























Binning

 

 

 

The esiest way to do this is to use galaxy MetaBAT2! To do so use your final_contigs_file from megahit and upload it at: https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/metabat2/metabat2/2.15+galaxy1

More info:https://bitbucket.org/berkeleylab/metabat/src/master/

Or install MetaBAT2 using conda on Linux:

 

Put these three files in one folder:

 

final.contigs: from megahit

S28_sorted.bai (from day 1 output from bowtie)

S28_sorted.bam (from day 1 output from bowtie)

 

 

 

 

 

 

The above code will generate a fill named depth. After generating this file, run the second code:

 

 

 

 

 

 

 

 

WSL due to memory issues cannot run this code thus I have to use Linux for this!

 

 

 

 

 

 

 

 

 

 

 

Here is the code in Linux!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mv S1_removed_Solanaceae.1   S1_host_removed_R1.fastq

mv S1_removed_Solanaceae.2   S1_host_removed_R2.fastq

samtools index S28_sorted.bam S28_sorted.bai

samtools sort S28_mapped_and_unmapped.bam -o S28_sorted.bam

megahit -1 SAMPLE_R1.fastq  -2 SAMPLE_R2.fastq  -t 20  -o megahit_result

MetaBat2(mac and wsl)

  conda install bioconda::megahit

 

jgi_summarize_bam_contig_depths --outputDepth depth.txt *.bam

metabat2 -i final.contigs.fa  -a depth.txt -o bins_dir/bin

Bowtie2 (Mac)

A. Check to see if you have brew by typing either brew (and hit enter or brew --help).















If you do not you have brew (command not found) then you have to install brew first. For installing brew please see installing fastqc for Mac section on how to install brew for Mac. If you have brew run next step.


B) Use brew to isntall Bowtie 2









C) Test your installation by running:






brew --help

brew install bowtie2

bowtie2

conda --version

conda install bioconda::megahit

MEGAHIT(Mac)
1. Install conda (make sure to install pkg not bash! I provided the link for pkg below)
https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.pkg


2. Check to see if your installation was successful by typing conda --version in your terminal.






(if asked you for Proceed, select y)


3. Now install megahit using:

conda --version

conda install bioconda/label/cf201901::metabat2

This software only works on Linux and did not work on WSL and Mac (for me)!

You can install the software in multiple ways!

1- Conda (both Max and WSL): https://anaconda.org/bioconda/metabat2





A. Change the file names for downstream analyses.











B. Change SAM file to BAM file.











Change Bam to sorted bam












Creating a BAM index file






Samtools

samtools view -bS Mapped_and_unmapped_Solanaceae.sam  > S28_mapped_and_unmapped.bam

Samtools(WSL-Mac)

conda install bioconda::samtools

conda install bioconda/label/cf201901::samtools

or
Resource


0_Data:



https://www.dropbox.com/scl/fo/xt23fj53v9eg8m9okvndy/AMkpQFhb6eOJ9B7_puzSfwA?rlkey=rxr6o5tn3um4b2d65stnc20et&st=53uzp1kn&dl=0




1_fastqc_results:



https://www.dropbox.com/scl/fo/deksjix5tt1ssoz5r446v/AM0Ta4zs-1iSq5GJ-grwJPo?rlkey=wo66gzt9jhvlwnwdr3xcmjsz1&st=i9tizve0&dl=0



2_Trimmomatic_removed_adaptor:



https://www.dropbox.com/scl/fo/mdxvi95c41knyry5na04l/ANGW2yk_xpwk57a3hD8qUgA?rlkey=kioqgrcqm2lsxmr3ugmyz9x7i&st=wf64wbfd&dl=0




3_Bowtie:



https://www.dropbox.com/scl/fo/a2b8noinc0sr6oa6icpdh/ALYTu_iOqg-tDekTwivWBqI?rlkey=5ekzstecfa2esawsrq5salejv&st=lb9vr5gd&dl=0




4_Sam_tools:



https://www.dropbox.com/scl/fo/clme5na8x92ao2jjlaw2p/AKgQFiO2jsJVqhS4H8oCkHg?rlkey=pely4stzvx564pe65urkkv7zx&st=atzc8brm&dl=0





5_Assembly_megahit:




https://www.dropbox.com/scl/fo/dtr14z6ijzcbndwcwum0t/AN3WiJh1lSAfGqXBROQSU6s?rlkey=jjdvfl8q5hd2xr6tsu50vofor&st=kmjuoju3&dl=0





6_Binning:




https://www.dropbox.com/scl/fo/s1fpvta4te3r85l9uffho/AJvHs-EV8diNP78dKqJpe3c?rlkey=o2b0ew12nuymln2wons5ymrt0&st=ubcgg97h&dl=0



7-CheckM


https://www.dropbox.com/scl/fo/c18a3277lmak0p0fdklet/AH7N4k4P6rrkQaADwFXMgJ4?rlkey=tbato504aw7byh56udg8lmo2h&st=fe3nyyzq&dl=0



8_Gene_Prediction


https://www.dropbox.com/scl/fo/j14zw3pynhehfsd0yvxy8/ABeOY9F5Fh8-g0GxYBR2RtI?rlkey=958zv9jeeyn7cqsnq4gfrxbf0&st=cimhhnd5&dl=0




9_Funtional_anotation_prokka



https://www.dropbox.com/scl/fo/szjbdlxu89y60o6hvu8xi/ANwCrYDtJ5KCjP_4sVj5ENM?rlkey=x9x2g5umixb57ysf0bzx9eywc&st=89xwi6nb&dl=0





11_Quantify_genes






https://www.dropbox.com/scl/fo/5mytb3ra1vbino9etnwzt/AHecX_-rQvNjrL3_xcDg7so?rlkey=4hk7szw33cm6vzead1opwflkv&st=cg0u405g&dl=0







CheckM








Prokka

brew install brewsci/bio/prokka

conda install -c conda-forge -c bioconda -c defaults prokka

brew install hmmer

brew install prodigal

brew install pplacer

pip3 install numpy

pip3 install matplotlib

pip3 install pysam


pip3 install checkm-genome




Also download the reference file and put it into your path


https://zenodo.org/records/7401545#.Y44ymHbMJD8

conda create -n checkm python=3.9

conda activate checkm

conda install -c bioconda numpy matplotlib pysam

conda install -c bioconda hmmer prodigal pplacer

pip3 install checkm-genome





Also download the reference file and put it into your path (export CHECKM_DATA_PATH=/path/to/my_checkm_data)


https://zenodo.org/records/7401545#.Y44ymHbMJD8

HTSeq

pip install HTSeq

pip install HTSeq

CheckM
Some of contigs are microbial genomes or might be viral genomes, and some are just fragments of one or multiple genomes. The idea is then to evaluate the completeness and the contamination of those bins to evaluate their quality and only consider genomes that are >50% complete and <10% contaminated.

checkm lineage_wf -x fa bins_dir/ METAG_checkm/  --threads 16 -f METAG-checkm.tsv --tab_table

Prodigal

prodigal -i my.metagenome.fna -o my.genes -a my.proteins.faa -p meta

Prokka-Funtional annotation

prokka --outdir mydir --prefix mygenome final.contigs13.fa

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.
HTSeq-Quantif

htseq-count -r pos -t CDS -f bam S13.map.sorted.bam S13.gft > S13.count

htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with gene