Long read technologies potential to study the microbial world
July 31 2019
Introduction to long-read technologies
High-quality draft genomes from metagenomes
Single-cell bacterial sequencing
Studying the rare taxa or low coverage genomes
Identifying CRISPR sequences from metagenomes
Microbial eukaryotes —protists
Real-time outbreak surveillance
Long read or third-generation sequencing technologies have shown a lot of potential in genomic studies overcoming challenges with studying repetitive sequences and producing more complete genomes. To give you a general idea about long-read technologies, these are Oxford Nanopore, PacBio, Hi-C, and variations of these methods that produce longer read lengths than the current short-read technologies (Illumina, IonTorrent). Research studies have used either a combination of long-read with short-read technologies or a combination of long-read technologies to generate more complete eukaryotic genomes. If you are looking for more information on these technologies, here is a blogpost (link here) that explains these technologies in more detail with their cost estimates and other considerations. We also have another blogpost (link here) on the potential of these technologies for microbiome studies. This blogpost is an extension and an update to the microbiome blogpost showcasing how these technologies have since, been used in research studies to answer different research questions.
Cost comparison of these technologies have also been updated as of April 2019 and is available here.
Here are some advantages of using any of the long read technologies over short-read technologies:
Longer reads retain the repetitive elements and relatively more information of the genome architecture than the short read lengths, providing insights into both the genome as well as their role in the environment.
Can study the hypervariable regions in the genome with more sensitivity.
Shown to generate assembly free microbial genomes for small genomes and high-quality draft genomes for larger genomes.
The long read tech is fast evolving – sequencing preparation, sequencing machine, and bioinformatics tools are constantly being improved to generate higher throughput and less read error rates. Having longer reads complements the pitfalls of short reads really well, but there are shortcomings associated with these technologies that must be considered.
The number of reconstructed genomes from long-read tech is not necessarily higher than the short read tech, but the reconstructed genomes or metagenome-assembled genomes from long-read tech definitely have a higher genome quality.
Sample complexity and sequencing depth determine the quality of the data, even in long-read tech.
Due to relatively higher read error rates in long-read tech compared to short-read tech, bioinformatics tools to correct these error rates is an absolutely necessary step. When drawing conclusions from this data, you must consider the potential of these errors biasing their data.
Don’t let these shortcomings scare you away, there has been a lot of studies incorporating these technologies illustrating interesting developments in the field. In some cases, they have used a combination of these technologies (short and long read) to address different research questions.
High-quality draft genomes from metagenomes
If you have worked on this topic with short read techs before, you must be well aware of the challenges with assembly – the possibility of chimeric contigs, fragmented genomes, rare genomes are not studied due to low coverage. These challenges can be overcome with long-read technologies really well.
To start with, for some small microbial genomes they can be sequenced as one long read, solving the need to assemble therefore no assembly bias. For others, assembly is still required but long read techs can help improve the quality of the draft genome. Here are some studies that were done using the different long-read technologies below,
Hi-C chemistry can overcome some of these challenges by retaining the interactions between the DNA within the cell prior to cell lysing. These reads when coupled with Illumina reads can add more evidence to the contigs, allowing longer scaffolds to form. Here are three studies; 1) identifying antibiotic-resistant genes (ARGs) from the microbiome (link to paper here), 2) reconstructing metagenome-assembled genomes (MAGs) from cow rumen microbiome (link to paper here) , and 3) paper studying plasmid-gene interactions (link to paper here). All these papers illustrate that through Hi-C they were able to reconstruct genomes with a relatively higher level of completeness and less contamination. However, all the papers report that during bioinformatic analysis, Hi-C reads are mapped back to the contigs (assembled from Illumina data), and these contigs are then clustered. Each cluster representing sequences that belong to one genome. Chimeric clusters are therefore possible especially in cases with closely related species or strains which are hard to differentiate.
Oxford Nanopore - The paper by Dr. Ed DeLong’s lab (link to paper here), shows the use of nanopore technology (ONT) in studying viral communities from seawater. With the help of the information from the long reads, they were able to produce assembly free virus genomes from just 1µg of DNA sample. The nanopore run produced reads with an average length of 256kbp, which is almost close to the full length of even the largest bacteriophage genome. Since the genome was kept intact, the team was able to study highly repetitive elements: concatemers (multiple copies of the genomes as long contiguous errors); direct terminal repeats flanking the ends of dsDNA sequence (used as one of the markers for identifying a complete genome); and other phage genome structures that could not be explored with short-read sequencing.
10X genomics – Dr. Bhatt’s lab used 10X genomics and Illumina short-read technology to study the human gut microbiomes and marine sediment samples (link to paper here). In the paper, they also shed light into why they selected 10X linked read technology compared to Illumina’s long-read technology (SLR) for generating data that can help them best reconstruct quality genomes. In the paper, they developed a new method to study Illumina short reads and 10X together to reconstruct high-quality draft genomes.
Single-cell bacterial genome sequencing
Combination technologies can be applied that best fit your sequencing costs and the research question to be addressed. Here is an example of a paper (link to paper here) that walks through genome assembly for a set of 20 bacterial isolates, to test the performance of PacBio+Illumina vs. Nanopore+Illumina. If you are interested in eukaryotes and how combination sequencing is applied, here is a link to another blogpost that walks through the different approaches applied (click here).
The increase in the number of these projects has led to producing many high-quality genomes, with some studies more thorough than the others. To address this issue, in a recent letter to the editor (link to the letter here), Dr. Mick Watson and Dr. Amanda Warr from University of Edinburgh, UK wrote about the potential pitfalls in using long-read technologies and how they must be addressed through bioinformatics tools or considered before drawing conclusions from this data. Here is a note paragraph from the letter,
“To maximize assembly accuracy, it is important to use high-quality, high-coverage sequencing data from one of the long-read technologies. Inclusion of data from multiple technologies can help improve assembly quality. It is important to incorporate multiple rounds of assembly polishing into downstream analyses and to perform additional checks for remaining indels and errors. These additional checks should include alignment of known proteins and cDNA or mRNA sequences against the genome to check for genic indels, manual inspection of genomic alignments and, where necessary, manual fixing of errors that the correction algorithms miss. Assembly quality has a substantial impact on genome and gene annotation, and our work presented here provides further evidence that the field must not only focus on building new tools and improving existing tools for genome correction, but also undertake manual correction and curation where required.”
Their emphasis on manual curation is welcome and really falls back to getting to know the genome you are studying. As researchers want to claim “reference quality” assemblies, just automated assembly is not adequate.
Studying the rare taxa or genomes that have low coverage depth in metagenomes
The best example for this topic is soil microbiomes. Soil microbiome has high microbial diversity, proving to be really difficult to study. Generally, to study a soil microbiome, relatively more samples are required compared to human microbiome or marine environment to 1) get a good representation of the microbial diversity, and 2) good coverage of the microbial genomes so they can be assembled. Generally, even if there are a high number of samples (biological replicates), the assembly still produces short fragmented contigs with lots of unidentified sequences.
Long read technology can overcome this challenge since the longer sequences will contain enough biological information to identify these sequences and can potentially assemble to form longer, less fragmented contigs. Dr. Banfield lab’s paper (link to paper here) studied terrestrial sediments that have high diversity using both Illumina short read tech and Illumina long-read tech, to show that short read tech did not recover most of the microbes in the microbiome sample. While the reads from long-read tech did identify and provide a better assembly of the abundant organisms, Deltaproteobacteria, and Aminicenantes (candidate phylum OP8), lots of microbes still remain fragmented. In this project, when they tried to assemble the longer reads, they could not be assembled to form longer contigs due to the low throughput of the run and the complexity of the sample. They overcame this problem by using a synteny based approach that retains the gene-centric information, to determine the genome architecture of the rare species or the long tail of genomic data. Alternatively, they combined the long reads with the short read assembly to generate longer scaffolds successfully. Challenges with using the long-read technologies presented from this study,
Didn’t increase the resolution of soil samples, due to the low throughput of Illumina long-read technology.
Assembling long reads has been shown to be challenging, therefore limiting complete genome assembly for large microbial genomes.
Identifying CRISPR sequences from metagenomes
CRISPR-Cas systems have gained a lot of fame as gene-editing tools, but their role in bacterial immunity is still not well understood (link to editorial here). These sequences contain short palindromic repeats that are really difficult to study with short-read techs. In a recent paper from Dr. Yuzhen Ye's lab (link to paper here), the showed the effectiveness of using Illumina long read technology (SLR) for identifying CRISPR spacer sequences from the human gut microbiome. The reason for this effectiveness is that spacer sequences have been found to be ~50 bp repetitive sequences that are hard to assemble for short-read sequences. The study compared the identification of these spacer sequences between Illumina short read and SLR methods, to find that the spacer region identification was dependent on the assemblers, not just the sequencing methods. For example, they compared MetaSPAdes and MegaHit assemblies in identifying these sequences. This doesn’t mean you can get away with not assembling the long read sequences as well: in this study, they assembled the SLR reads to validate the results identifying as spacers belonging to the CRISPR-Cas system. Canu was applied to assemble the long reads, to show that there is potential contamination through the presence of bacteriophage sequences or the invaders that form protospacers
Microbial eukaryotes —protists
Not only does Hi-C allow for the much longer chromosomes of protists to be scaffolded and assembled—many protists have genomes much larger than the human genome—but also identifies which chromosomes are from the same cells. This metagenomic approach may replace some applications of single-cell approaches to study protists in the environment.
Real-time outbreak surveillance in public health using Oxford Nanopore
Most of you must have come across the nanopore technology, especially because the sequencer is portable and so small you can carry it around with you (in your pocket). In fact thanks to its portability, the sequencer advertises real-time analysis, and it has been to space as well (link to the article here). There have been lots of other interesting use cases with Nanopore sequencer, especially for its portability.
Nanopore was used for real-time surveillance of Ebola (link to paper here), and rapid diagnosis of respiratory infections (link to paper here) in less than 24 hours, most importantly in resource-limited settings.