Reconstructing genomes from metagenomes
March 22 2019
Shotgun sequencing is a method where nucleic acid from an environmental sample is collected and all the nucleic acid in the sample is sequenced. This technique does cost slightly more compared to amplicon sequencing which sequences only the 16S regions from the environmental sample; for help estimating costs here are two sources: IMR pricing, and CGB pricing. The advantage of shotgun sequences is that this method gives both the taxonomic and the functional profiles for the community, while 16S provides only the taxonomic profile.
In the case that you have amplicon sequences and are looking to predict functional profiles of the community, PICRUSt- Phylogenetic Investigation of Communities by Reconstruction of Unobserved States can be applied. This tool predicts the functional profiles from 16S data, which is a great start or an option in the case of limited funding for the project. This software predicts the gene families of the identified genomes from 16S sequences. More on how PICRUSt works can be found here. It’s good to note that database bias is a concern, with only a few microbial and archaeal sequences identified and deposited to databases.
This blog focuses on shotgun sequenced metagenomes, which contain complete genome information. With these sequences, we can reconstruct genomes from the short sequenced reads; these reconstructed genomes are called metagenome assembled genomes (MAGs). This term was coined since it’s really hard to separate different species when there is low sequence coverage of the genome in the original metagenome, distinguishing these “reconstructed” genomes from monocultured whole genome sequences. However, there are now more tools to resolve strain deconvolution of MAGs and confirm with more confidence that the reconstructed genome is likely a whole genome.
What can you do with reconstructed metagenome assembled genomes (MAGs)
- Relate taxa to function,
- Pangenome analysis of the genera from the environmental sample or the environment,
- Identify new genes associated with the interested species from that environment
- Identify and deposit novel genomes to a database, helping decrease database bias
The steps I will talk about below are reported in several papers; the different pipelines likely include similar steps but with different tools. I should disclose that the bioinformatic programs under each step are ones I have used and validated in this paper. To stay up to date on the programs used for each step and their performance, here is another source I use results from CAMI challenge. Critical Assessment of Metagenome Initiative (CAMI) challenge is a community-driven initiative to establish standards, evaluation parameters, and performance standards to help with software selection.
Another note: this pipeline has been mainly used for Illumina paired-end, and IonTorrent reads.
Step 1 – Quality Control
There are several ways to do quality control, remove the adaptor sequences, and delete low-quality sequences. This step is important as low-quality sequences can introduce bias to your analysis. Programs available to help with this step include Trimmomatic , FastQC, and Prinseq. The output results are high quality reads with adaptor sequences removed.
Note- If you are looking to pair the paired-end reads, you can overlap the forward and reverse reads; this step can be skipped since most assemblers do take both forward and reverse reads.
Step 2- Assembly
The high-quality reads which are output from Step 1 can be assembled to longer sequences called contigs. Input: All metagenomes from a project, or all metagenomes from a treatment/host, or all metagenomes that replicate need to be first concatenated to one file.
Concatenated reads can be assembled using of the assemblers, Meta-SPAdes, IBDA-UD, or MegaHit. The output of the assembly step is a contig file, continuous sequence produced by joining reads together. If you are unsure of which software to choose or how good the assembly is then assembly statistics using QUAST or metaQUAST can be useful. Some common statistics looked at are
- N50 length and L50 length
- Number of contigs
- Longest contig
Additionally, other statistics applied are - the number of reads assembled to form the final contigs; you can calculate this using bowtie2 or BWA alignment tools. Its really hard to look at these numbers and estimate which assembler performed better, so I generally pick the assembler that generates lots of contigs with a high N50 length, and highest amount of reads aligned back to the contigs. Start with this assembled contig and downstream if the data doesn't make sense there are other assemblers contigs generated at this step to look into as well.
Step 3- Binning
Binning is a step where sequences are grouped together based on genome signatures like the kmer profiles of each contig, contig coverage, or GC content. Binning tools generally apply one or more genome signature to improve the quality of the bins generated. Here are Some binning programs MetaBAT, GroopM , and CONCOCT, where both MetaBat and CONCOCT apply two genome signatures, GC content, and kmer frequencies.
The common steps in these binning tools,
- Map each reads to the cross assembled/co-assembled contigs(output from Step2), using bowtie2. The result generates a sam/bam file which contains the information on which read mapped to which contig and its position.
- The output bam files need to be sorted which can be done using samtools. This sorted bam file is the input file for these binning programs.
- Running the binning command to group similar contigs together. The binning tool scripts will cluster those contigs that have similar genome signatures to different bins.
The results from these binning programs is generally a set for bins with a list of contig sequences that belong to bins.
Step 4 – Bins evaluation
The bins can be evaluated using CheckM. This program evaluates the bins to check whether they are:
- Complete- have all the bacterial single-copy core genes within a phylogenetic linkage,
- Contamination- single-copy core genes that don’t belong to the phylogenetic linkage,
- Strain heterogeneity- a set of clusters within the bin that show differences in genome characteristics.
CheckM also provides other features to assess the assembled MAG, generating lineage-specific marker sets, taxonomic-specific marker sets, other bin exploration command, and additional scripts to look at coverage, unbinned datasets and identify ribosomal small subunits in each bin.
Step 5- Refining the bins
The bins that are close to completion (> 90 %), and have less contamination (less than 10 %) and high strain heterogeneity can be further refined. Here are a couple of tools to help with refining bins - RefineM, dRep.
The resulting bins are called metagenome assembled genomes (MAG’s) or genomes, based on the quality of sequences and completeness of the genome. To assess the completeness of the reconstructed genomes and MAGs, you can use tools- CheckM or BUSCO
Step 6 – Visualization of the MAGs/genomes
The trouble is that most of the results till now is just files and statistics. Visualizing bin data is essential in explaining binning and the data. There are a couple of tools available to help with this—I prefer- Anvi’o.
- Anvi’o – To run Anvi’o pipelines, we have set it up on Jetstream (click here) and how to run Anvi’o on Jetstream (click here)
- VizBin – This platform requires the fasta files (contigs) imported, and the software visualizes the data as a 2D scatterplots (click here).
- R plots- R is a great program to plot data in really innovative ways- heatmaps, bar plots, PCA, as well. Coming up!
Here are some research papers that have run similar pipelines to reconstruct metagenome assembled genomes, and identify novel genomes
Feel free to contact NCGAS at firstname.lastname@example.org if you have any more questions.