Rarefaction curves for metagenomic datasets
September 04 2019Introduction
Plotting rarefaction curves
Why use rarefaction curves
In metagenomics studies, samples are collected from the environment. Let’s say we are interested in looking at the microbial community that is present in the lake. This would ideally mean sampling the whole lake, but that would take lots of time and $$. So, an alternative is to sample different sites in the lake – which would mean we are subsampling the microbial community at these different sites – with the hope that this would give us a good representation of the microbial community of the lake. Now, how can you show that the samples collected are indeed the best representation of the microbial community? One way to go about this is - Rarefaction curves
Rarefaction curves are a representation of the species richness for a given number of individual samples. This curve (looks like the microbial growth curve), generally grows rapidly at first where every read in the sample likely identifies as a new organism (like the exponential phase in the microbial growth curve) and slowly starts to plateau when the rare species remain to be sampled (like the stationary phase in microbial growth curve). If the rarefaction curves for the samples reach the plateau, this suggests you have a good representation of the microbial community as most of the abundant species are represented with some rare species. Rare species are important too, so to get a better representation of these species, we need more samples, increasing the sampling depth.
Before you plot rarefaction curves - Taxa identification
Taxa identification is a necessary step since rarefaction curves need this information to calculate species richness to plot the curve. For this step, you can use any taxa identification tool like - Kraken (link to program here), Centrifuge (link to program here), FOCUS (link to program here) etc to generate table as shown below, which can be imported to R to plot the curve. The table here contains the list of taxa and the corresponding number of reads identified as the taxa per sample.
|Taxa||Sample 1||Sample 2||Sample3|
If you already have this table generated, skip to the next section. If you are not sure how to get this table, follow along.
In this case, let's use Kraken for taxa identification. If you don't have this program installed, the instructions are available here (link to Kraken installation notes here). Here is how I installed this program,
Download the program from https://github.com/DerrickWood/kraken/releases
tar –xvzf v2.0.8-beta.tar.gz
export KRAKEN_DIR= /path to where you would like to install the program/
To install the program, run the command
Download the small 4GB database -
Now run Kraken commands as per the instructions in the manual (link to the manual here).
First, align the reads against the database, the below command is an example for paired-end reads,
kraken2 --preload --db $KRAKEN_DB --paired forward_1.fastq reverse_2.fastq --threads 1 --use-names --report kraken_report --output kraken.out
The report generated contains a table, with the following columns,
|Sample name||% of reads aligned||Number of reads covered by clade||Number of reads assigned to this taxa||Rank Code||NCBI Taxa ID||Scientific Name|
Since Kraken provides a report per sample, here is a script (link to GitHub) to convert the individual report to one file. Here are instructions on how to download the script and run it,
git clone https://github.com/npbhavya/Kraken2-output-manipulation.git
#Make sure to have python3 in PATH, on IU clusters we can run the command
module unload python/2.7.16
module load python/3.6.8
#Running the scripts
python kraken-multiple.py -d kraken-report/ -r F -c 2 -out kraken_report_all
The output from the script will be in the format
135621 ['210', '859', '2843', '595', '281', '1064']
468 ['80', '359', '1054', '361', '164', '299']
72275 ['66', '1838', '4664', '462', '75', '2074']
267888 ['45', '1407', '59440', '930', '120', '79']
To import this table in R, I run a quick command
sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" kraken_report_all >kraken_report_all_R.csv
Great now you have the table that you can use to plot the rarefaction curves in R!
Plotting rarefaction curves in R
Once you have the table generated you can plot the rarefaction curves using R/R studio using the library vegan and the rarecurve function in the package. Here is the script to run, or you can find it here (link to script on GitHub)
#importing the file and parsing the file correctly
# Replace the kraken_final name to the actual filename.
Data=read.table("kraken_report_all_R.csv", header=TRUE, row.names = 1)
#Below set of commands, need to change based on the table
#count the number of species
S <- specnumber(Data_t)
#Rarefaction of the samples
Srare <- rarefy(Data_t2, raremax)
#plotting the rarefaction curves
plot(S, Srare, xlab = "Observed No. of Species", ylab = "Rarefied No. of Species")
rarecurve(Data_t2, step =1, sample = raremax, col = "blue", cex = 0.4, )
Look for a file called Rarefaction_curve.pdf to see the rarefaction curves. Here is the example rarefaction curve generated from the vegan package test BCI dataset.
Great, now what?
If the rarefaction curves suggest that the samples haven’t plateaued – this means that the environments can still be sampled to get a better representation of the microbial community.
- If you have more funding to sequence, then you can add more replicates.
- If not, then you can look for other studies that sampled from this environment to add to your research from Sequence Read Archive (SRA).
- Finally, there is nothing wrong with continuing with this dataset as well, just make sure to make a note of the incomplete representation of the microbial community while drawing conclusions from your research study.
If you have any questions about this content, email us at email@example.com