How EviGene Works
December 17 2018
Introduction and General Philosophy
If you have talked to us aboout transcriptome assembly, or come to our workshops, you undoubtedly have heard of a program called EvidentialGene (or EviGene for short). This program is the last step in assembly for our pipeline, used to condence our 19 assemblies into one cleaned up, non-redundant transcriptome. But the number one question we get...
What is it doing?
That is an extremely valid question, and the reason I'm writing this blog. We believe in the importance of understanding what software is doing - we go through the assumptions of each Trinity step when we're teaching transcriptomes for this reason! Computational solutions are not perfect, and knowing what is going on is a great way to use your biological knowledge to double check that the software tool is appropriate for your needs and what biases you may have to deal with.
So let's do exactly what we do with Trinity when we talk about transcriptome assemblers - identify the philosophy and then walk through the steps EviGene is taking.
One important thing to keep in mind about EviGene is that it is a program designed for genome annotation primarily. We are using one part of the workflow, called tr2aacds. The "point of tr2aacds tries to keep all the valid biological sequences, and remove the not so valid ones. It errs mostly on the side of keeping in not-so valid ones, for the later step of checking against the external evidence of other species or chromosomes. (D. Gilbert)" Everything is tagged in very helpful ways, so you can always re-evaluate things that are initially dropped, or evaluate whether there is enough evidence to keep something that looks potentially novel.
This part of the workflow is also entirely "self-referential" - it only uses the sequences you give it to reduce redundancy and give the best internally validated set of transcripts and alternatives, mainly by concentrating on the CDS regions. Don Gilbert refers to this as not a final step, but an "intermediate step which separates the wheat from the chaf". Often this step is then followed by external evidence like blasting to other species, databases, and identifying domains, etc. We offer a trinotate extension to our pipeline to begin this process.
Briefly, this is the workflow of tr2aacds is as follows:
0) Overassemble RNA with different assemblers
I am not just being a nerd and 0 indexing a list - this step is really not part of EviGene. It's here we generate the input to the program - which is an important step in itself.
The goal here is to aim for right sequence not right number. I won't go into a lot of detail here, since we harp on this in other places, but there is a real benefit to running several parameters and several assemblers when trying to get back the real biological sequence.
Assemblers all have different biases - which will cause you to miss sequences or misassemble sequences if you only use one. And it is critical in early stages of analyses (such as generating your reference) to be picky about quality - low quality in this stage will effect all subsequent stages - and have a lot of influence on things such as differential expression!
Trinity was originally written to find novel cancer transcripts, with no real worry about false positives - candidates were all being checked anyway. This is why you'll get hundreds of thousands of possible transcripts when you run Trinity - there are a lot of false positives! TransABySS on the other hand is very conservative, and tries very hard to prevent chimeras, but may miss more transcripts for being too conservative. Velvet is great for not being as affected by noise in the data (i.e. lots of isoforms/paralogs), but has more chimeras. SOAP deNOVO is a great combination of all the different best parts of assembliers... but is more error prone with less coversage. So we run all of these assemblers (and sometimes RNASpades) with different kmers (see here -https://github.com/rrwick/Bandage/wiki/Effect-of-kmer-size- for why kmer variety is helpful). That way, we cover all the bases...
But we end up with 19+ assemblies and a whole lot of redundancy (we hope!). If you overassemble with all these assemblers, you need a way to deal with the redundancy, pick best representatives, and filter out misassemblies. The great part here is that it is highly unlikely that different algorithms with different biases will produce the same incorrect assembly - so we're looking for concordinance between and can use quality metrics to identify the best looking option for each "loci".
This is what EviGene (well, the tr2aacds part of it) does.
1) Find coding sequences
"Note that PacBio software, their SMRT analysis tools, that produce, or assemble, RNA transcripts from raw PacBio RNA reads, use coding sequence measures, as does Evigene, to select best transcript; they do not use a longest-transcript-is-best approach." - Don Gilbert
This is the step that can be the hardest to suss out, as it lacks a lot of documentation on the specifics. However, between talking to the creator of EviGene and looking through the code, we are able to detail the specifics here.
One of the first things you may notice when running EviGene is that you get an "input_set" folder, which has your input in three forms - aa, fa, and cds. There is a very good reason for generating these files, aside from being extremely convenient to have for subsetting at anytime.
Dr. Gilbert goes into more detail here (EviGene tr2aacds website and EviGene Quality Documentation), but the short version is this: you need to focus on coding regions (cds) when doing quality control.
Usually we have some step that looks for the longest sequence - a means of getting as much information as possible - but this can be erroneous if you aren't careful. Misassemblies can make for really long genes (which is why quast lists numbers of transcripts at different sizes!!). Misassemblies will often have inserts or deletions, which shift frame, and really mess with coding regions - which is easy to spot. Additionally, untranslated regions are often the most misassembled regions (gene fusion can be real a problem here).
By looking primarily at the coding regions, we circumvet these problems. Protein coding regions have a (soft) biological maximum and some standard characteristics that we can use to filter quality and remove the misassemblies - this data is too big to look at everything individually!
Characteristics that are characteristic of cds:
- Homology to other isoforms/paralogs/orthologs - higher in cds than in utr (cite)
- Start codon
- Stop codon
- Sensible length
Reasonable Protein Lengths (From EviGene website)
|Clade||Maximum Transcript Size|
|vertebrates*||2400 aa or 7000 cds|
|insects/crustacea*||2000 aa or 6000 cds|
|plants||1500 aa or 4500 cds|
*There are often very long muscle proteins in animals with ~10,000aa. These are not present in plants. There are other super large proteins, but these are exceptions, not the norm.
So working with the cds is important, but how is it identified in EviGene? Since we are inputting full fasta form assemblies, the first thing that needs to be done is convert the data into amino acids. There are several possible translations for many sequences given the nature of codon based readframes, so assigning quality measures to these translations is important. quality is assigned by the last three characteristics in the above chart - presence of a start and stop codon and a sensible length. When more than one possible translation occurs, the one with the best amino acid score is used.
These best amino acids sequences are then used to determine the coding region of the transcript, and the fasta format is subset to create a cds only form of each input.
2) Analyze sequences
This is the part that is explained in more detail on the EviGene website - the more analysis of these cds sequences. The basics here are to remove the redundancy first in two passes and then reduced transcripts are clustered by amino acid identity. Finally, blastn is used within clusters to find high identity within clusters.
The first step is to remove perfect duplicates, which we expect to occur in a reasonably high number. If you assemble the same data with six different kmer settings and the same assembler, chances are you will get perfect duplicates. This can also occur between assembler outputs - which is great! That's evidence that that transcript is likely real.
The removal is performed by using fastanrdb, a part of the exonerate software suite that creates non-redundant versions of a fasta database (fastanrdb github). Fastanrdb is run with default settings with the cds sequences as input.
The software also calculates a cds quality - with metrics such as utr length, gaps, aa quality, stop and start codon presense. This quality metric is used to double check that we are keeping the best cds for each set of perfect duplicates. For example - if two amino acids match, but one has almost no utr, it is worth keeping the one with a better utr for mapping or other downstream analyses. This step is performed by custom code (cdsqual.pl script).
Next, the program looks for perfect fragments - situations where the assemblers came up with identical cds sequences, but one is longer than the other. In this case, we only want to keep the longest cds, so perfect fragments are tagged. This step is done by running cd-hit-est (cd-hit documentation) on the cds sequences, using a similarity cut off of 100% - or identical and custom scripts for the header tagging (one of the real beauties of EviGene - see step 3).
Clustering is performed on the aminoacid sequences using cd-hit (cd-hit-est is for nucleotide sequences, cd-hit uses amino acid sequences), using a threshold identity of 90%. This creates groups of likely isoforms.
Blastn is used to find high identity exon sized fragments (98% identity, evalue cut off of 1e-19) to add additional evidence to isoforms that share exons.
3) Classify gene lociAll the while these above analyses are being performed, EviGene is adding a significant amount of detail to the sequence headers and log files. This last step is kind of complicated code, but the gist of the tags are as follows:
Description of Tags found on transcript headers
|aalen||amino acid conversion|
|match:*||redundant seq only - gives confidence, length, and percent identity for match|
|complete||complete reading frame, from amino acid conversion, aaqual, cdsqual|
|-utrpoor||complete reading frame with >=60bp utr|
|-utrbad||complete reading frame with >=35bp utr|
|partial||partial reading frame, from amino acid conversion|
|-utrpoor||partial reading frame with >=60bp utr|
|-utrbad||partial reading frame with >=35bp utr|
|partial5||partial reading frame with 5prime utr, from amino acid conversion, aaqual, cdsqual|
|-utrpoor||partial reading frame with 5prime utr >=60bp|
|-utrbad||partial reading frame with 5prime utr >=35bp|
|partial3||partial reading frame with 3prime utr, from amino acid conversion, aaqual, cdsqual|
|-utrpoor||partial reading frame with 3prime utr >=60bp|
|-utrbad||partial reading frame with 3prime utr >-35bp|
|perfectdup||perfect duplicate from fastanrdb|
|perfectfrag||perfect fragment from cd-hit-est|
|alt||clustering of amino acids or blast similiarity|
|-althi1||higher identity than hi|
|-alta2||high amino acid identity (98%)|
|-altmid||lower identity matches, genome mapping by DG indicates these may be paralogs/homeologs|
|-altmida2||lower nucleotide identity matches with high protein identity (98%)|
|-altmfrag||lower identity partial matches, may be paralogs/homeologs|
|-altmfraga2||lower nucleotide identity partial matches with high protein identity|
|part||partial alternatives from amino acid clustering or blast similarity|
|-parthi1||very high identity|
|-parta2||high amino acid identity (98%)|
|main||primary transcript with longest high quality cds with known alternatives|
|maina2||primary transcript with longest high quality cds with known alterantive with an amino acid similiarity (but not cds) of 98%|
|noclass||primary transcript without known alternatives|
|noclassa2||primary transcript without known alternatives, but with a 98% similarity of amino acid to another transcript (but different cds - meaning it could be a different loci/duplication in genome)|
These tags are used to classify the scripts into three categories - okay, okalt, and drop.
- okay - all are main or noclass
- okalt - all are althit (majority) or parthi
- drop - sequences dropped for various reasons
Sequences are dropped for various reasons. For example, they can be labeled "main", but still be very low quality (i.e. 33 amino acids) and therefore be dropped. This is also where you will find your perfectdups and perfectfrags.
There can be some useful stuff in here if you are really concerned about specific transcripts - particularly in the dropped alternates. When we do polyploid systems, we have to keep in mind that most programs (including EviGene) will sometimes drop very similar alternatives that may have largely different utrs (only checked when duplicates!). In the case of polyploidy, these may actually be homeologs. This is a HUGE file, due to the overassembly, but with everything tagged with information, it is usually pretty easy to search and find anything that you might be worried was dropped for any reason.
This gives you the final output folders - okay_set and drop_set, each with fa, aa, and cds formats.
After this, we usually run analyses with external evidence (pfam, blastp against similar organisms, blastp against nr, kegg, go, etc) thorugh trinotate to add external evidence to the internal evidence used to reduce the dataset here.
And that is what EviGene is doing (well, the tr2aacds part)!