Rarefaction curves for metagenomic datasets
September 04 2019Introduction
Plotting rarefaction curves
Why use rarefaction curves
In metagenomics studies, samples are collected from the environment. Let’s say we are interested in looking at the microbial community that is present in the lake. This would ideally mean sampling the whole lake, but that would take lots of time and $$. So, an alternative is to sample different sites in the lake – which would mean we are subsampling the microbial community at these different sites – with the hope that this would give us a good representation of the microbial community of the lake. Now, how can you show that the samples collected are indeed the best representation of the microbial community? One way to go about this is - Rarefaction curves
Rarefaction curves are a representation of the species richness for a given number of individual samples. This curve (looks like the microbial growth curve), generally grows rapidly at first where every read in the sample likely identifies as a new organism (like the exponential phase in the microbial growth curve) and slowly starts to plateau when the rare species remain to be sampled (like the stationary phase in microbial growth curve). If the rarefaction curves for the samples reach the plateau, this suggests you have a good representation of the microbial community as most of the abundant species are represented with some rare species. Rare species are important too, so to get a better representation of these species, we need more samples, increasing the sampling depth.
Before you plot rarefaction curves - Taxa identification
Taxa identification is a necessary step since rarefaction curves need this information to calculate species richness to plot the curve. For this step, you can use any taxa identification tool like - Kraken (link to program here), Centrifuge (link to program here), FOCUS (link to program here) etc to generate table as shown below, which can be imported to R to plot the curve. The table here contains the list of taxa and the corresponding number of reads identified as the taxa per sample.
|Taxa||Sample 1||Sample 2||Sample3|
If you already have this table generated, skip to the next section. If you are not sure how to get this table, follow along.
In this case, let's use Kraken for taxa identification. If you don't have this program installed, the instructions are available here (link to Kraken installation notes here). Here is how I installed this program,
Download the program from https://github.com/DerrickWood/kraken/releases wget https://github.com/DerrickWood/kraken/archive/v1.1.1.tar.gz
tar -xvzf v1.1.1.tar.gz cd kraken-1.1.1
To install the program, run the command
Download the small 4GB database -
Now run Kraken commands as per the instructions in the manual (link to the manual here).
First, align the reads against the database, the below command is an example for paired-end reads,
kraken --preload --db $KRAKEN_DB --paired reads/Sample1_1.fa reads/Sample1_2.fa > Sample1_kraken.out
kraken --preload --db $KRAKEN_DB --paired reads/Sample2_1.fa reads/Sample2_2.fa > Sample2_kraken.out
Kraken report to get the taxonomy name
kraken-report --db $KRAKEN_DB Sample1_out >Sample1_report
kraken-report --db $KRAKEN_DB Sample2_out >Sample2_report
The report generated contains a table, with the following columns,
|Sample name||% of reads aligned||Number of reads covered by clade||Number of reads assigned to this taxa||Rank Code||NCBI Taxa ID||Scientific Name|
If you take a look at this report you will notice that the file contains duplicate scientific names for example next to Phylum (P) is Actinobacteria, and under Class (C) is Actinobacteria as well. To prevent these duplicates, I chose to extract the “Genus (G)” entries from the list and added them to a new report subset report, by running the command
grep -w "G" Sample1_report > Sample1_report_genus
grep -w "G" Sample2_report > Sample2_report_genus
To join both the table to generate one table, first, create a master list of all the taxa in both the files,
cat Sample1_report_genus Sample2_report_genus > all
sed -i 's/ //g' allcut -f 6,7 all | sort -k 1 | uniq >all_tmp
Join the reports to generate the table above but with both the samples,
join -a 1 -e0 -1 1 -2 6 -o 1.1,1.2,2.4 <(sort -k 1 all_tmp) <(sort -k 6 Sample1_report_genus)>tmp
join -a 1 -e0 -1 1 -2 6 -o 1.1,1.2,1.3,2.4 <(sort -k 1 tmp) <(sort -k 6 Sample2_report_genus)>kraken_final
Formatting file a little bit,
sed 's/ /\t/g' kraken_final | sort -k 1| uniq >kraken_final2
The final file “kraken_final2” can be used for rarefaction curves. Open the kraken_final2 file and add the column names on the top
Taxa_ID Taxa_name Sample1 Sample2
Great now you have the table that you can use to plot the rarefaction curves in R!
Plotting rarefaction curves in R
Once you have the table generated you can plot the rarefaction curves using R/R studio using the library vegan and the rarecurve function in the package. Here is the script to run, or you can find it here (link to script on GitHub)
#importing the file and parsing the file correctly
# Replace the kraken_final name to the actual filename.
Data=read.table("kraken_final", header=TRUE, row.names = 1)
#Below set of commands, need to change based on the table
#count the number of species
S <- specnumber(Data_t)
#Rarefaction of the samples
Srare <- rarefy(Data_t2, raremax)
#plotting the rarefaction curves
plot(S, Srare, xlab = "Observed No. of Species", ylab = "Rarefied No. of Species")
rarecurve(Data_t2, step =1, sample = raremax, col = "blue", cex = 0.4, )
Look for a file called Rarefaction_curve.pdf to see the rarefaction curves. Here is the example rarefaction curve generated from the vegan package test BCI dataset.
Great, now what?
If the rarefaction curves suggest that the samples haven’t plateaued – this means that the environments can still be sampled to get a better representation of the microbial community.
- If you have more funding to sequence, then you can add more replicates.
- If not, then you can look for other studies that sampled from this environment to add to your research from Sequence Read Archive (SRA).
- Finally, there is nothing wrong with continuing with this dataset as well, just make sure to make a note of the incomplete representation of the microbial community while drawing conclusions from your research study.
If you have any questions about this content, email us at email@example.com