The progression of an analysis
NCBI BLAST on the Type Material+Prototypes DB
Closely similar sequences to the Query sequence are searched in the DB by using NCBI BLAST
The Extraction of a set of 100 candidate sequences is done and the query is added.
Kalign is used. Note that Kalign is much faster than MAFFT and after trimming the aligments are similar. The differences between MAFFT and Kalign are rare and not so obvious.
After the alignment, a ligth correction of the query is done to replace some apparently faulty positions due sequencing errors by gaps. For exemple this is the case if the position is an "A" in the query BUT no "A" is present in the corresponding column of the alignment. The same is applied if the position contains non cardinal letters (non ATCG, but FastTree is doing the same at the beginning of its work).
The columns with more than 45% gaps are then deleted. The current version of leBIBI was using Gblocks but the loss in positions was heavilly reducing the information, being not so adapted to Kalign that does not over-align at high frequencies.
The small set of bases that are at an unexpected distance of the core of the alignment in the query sequence are then replaced by gaps (this occurs in the cas of short and bad quality sequences).
When the length of the query is low comparatively to the length of the alignment, the phylogenetic reconstruction is impacted by the number of undetermined positions as FastTree or the model it uses does not tolerate too much of them. This is especially the case when some sequences are sharing 100% identities in the region covered by the query. The query may be extremely ill-positionned due to this reconstruction artifact. So we have to adapt the length of the alignment to the query
The major cause is that the position of the query in the SSUrDNA gene for some species/genus/families when the gene has very low variabilities in this region. A mitigation has been set-up : the length of the alignment is driven by the length of the query as soon as the length of the query is less than 90% of the length of the aligment . This 90% length is set to prevent reconstruction errors and maintain some more information. The alignment length is thus adapted to the query length corrected by the number of gapped+undetermined positions and 5% of the query-length added, if possible at both ends.
In extremely rare cases the process cannot be achieved and is replaced automatically by a call to Gblocks.
FastTree approximate maximum likelihood
FastTree is compiled with double precision openMP and the model used is GTR+gamma
The tree is then rooted by using Fastroot with the minimum variance option
The nodes with a support SH value less or egal to 0.2 are suppressed (a polytomy created). 0.8 and especially 0.95 (read the "Nodes" paragraph) [This is a main change] So, we can be confident in the green nodes and interpretate the relations accordingly-->
The topology of the tree is analysed to detect the cases where the root is too close to the node containing the query (in other words, the query is in a root like position).
As long as this occurs and up to a 200 leaves tree, the process of tree construction is done again with a set of sequences enriched by 50 new sequences (thus with a lower similarity).
If requested, the same process is done with the supplementary Database but only the 20 most similar sequences are added to the alignment. The resulting tree will thus contains 120 leaves. The tree is enriched by these more diverses sequences byt only in the vicinity of the query branch.
Tree rendering and enrichment
Branch labels and links
This is an example of leave label:
Here is an exemple with the same links (active) as found in the tree:
Methanothermus_fervidus RT URS000014A460 ↑ Archaea~Euryarchaeota~Methanobacteria
You may discover the three links to LPSN, RNAcentral and DDBJ taxonomy server.
Clicking on ↑ will launch a new phylogeny by taking the species name as query, in this case the number of leaves is 150 to extend the exploration (primary DB only).
Some species names in the NCBI-taxonomy may not be found in LPSN for various reasons. In some very rare cases the species names are taken LPSN and may not be present in the NCBI taxonomy BD (and therefore in the DDBJ taxonomy).
Note that the hierarchy is limited to Class-Order-Family as the Genus is given by the first part of the species name and is not repeated to prevent too long descriptions. Note that "candidatus" is replaced by a "c" prefix to the name, ex: cZinderia_insecticola. This is not following the Nomenclature code but this helps the computer treatment and these "candidatus" are de facto recognized by the community, if not by the official nomenclature. In this exemple, TR is related to the Sequences classification.
Nodes having a SHvalue less or egal to 0.2 are replaced by a multifurcation. Nodes are yellow when the SH value is between 0.8 and 0.95 and green when SH value is over 0.95. The diameter of the yellow node is also proportional to the SH value within [0.8:0.95[
So, we can be confident in the green nodes ( Colors signification) and interpretate the relations accordingly. SH values > 0.95 are currently admitted as significative support value, 0.9 being more challenging.
a tag "+" is added for the sequences issued from the secondary bank.