Optimized workshop in BIBI IV Automated ProKaryotes Phylogeny

The progression of an analysis

After a new set of tests the mnadatory length of the sequence has been set up to 400 Bp and when under 800 Bp you should ALSO have a look to the BLAST results.

NCBI BLAST on the the low-redundancy DB (Type strains + Representative essentially)

Closely similar sequences to the Query sequence are searched in the DB by using NCBI BLAST

The Extraction of a set of 100 candidate sequences is done and the query is added.

Alignment

Kalign is used. Note that Kalign is much faster than MAFFT and after trimming the aligments are similar. The differences between MAFFT and Kalign are rare and not so obvious.

Trimming

After the alignment, a ligth correction of the query is done to replace some apparently faulty positions due sequencing errors by gaps. For exemple this is the case if the position is an "A" in the query BUT no "A" is present in the corresponding column of the alignment. The same is applied if the position contains non cardinal letters (non ATCG, but FastTree is doing the same at the beginning of its work).

The columns with more than 45% gaps are then deleted. The current version of leBIBI was using Gblocks but the loss in positions was heavilly reducing the information, being not so adapted to Kalign that does not over-align at high frequencies.

The small set of bases that are at an unexpected distance of the core of the alignment in the query sequence are then replaced by gaps (this occurs in the cas of short and bad quality sequences).

When the length of the query is low comparatively to the length of the alignment, the phylogenetic reconstruction is impacted by the number of undetermined positions as FastTree or the model it uses does not tolerate too much of them. This is especially the case when some sequences are sharing 100% identities in the region covered by the query. The query may be extremely ill-positionned due to this reconstruction artifact. So we have to adapt the length of the alignment to the query

The major cause is that the position of the query in the SSUrDNA gene for some species/genus/families when the gene has very low variabilities in this region. A mitigation has been set-up : the length of the alignment is driven by the length of the query as soon as the length of the query is less than 90% of the length of the aligment . This 90% length is set to prevent reconstruction errors and maintain some more information. The alignment length is thus adapted to the query length corrected by the number of gapped+undetermined positions and 5% of the query-length added, if possible at both ends.

In extremely rare cases the process cannot be achieved and is replaced automatically by a call to Gblocks.

FastTree approximate maximum likelihood

FastTree is compiled with double precision openMP and the model used is GTR+gamma

The tree is then rooted by using Fastroot with the minimum variance option

The nodes with a support SH value less or egal to 0.2 are suppressed (a polytomy created).

0.8 and especially 0.95 (read the "Nodes" paragraph) [This is a main change] So, we can be confident in the green nodes and interpretate the relations accordingly-->

Tree rendering and enrichment

Branch labels and links

This is an example of leave label:
Exemple leave label
Links
Here is an exemple with the same links (active) as found in the tree:
Methanothermus_fervidus RT URS000014A460 ↑ Archaea~Euryarchaeota~Methanobacteria
You may discover the three links to LPSN, RNAcentral and DDBJ taxonomy server.

Some species names in the NCBI-taxonomy may not be found in LPSN for various reasons. In some very rare cases the species names are taken LPSN and may not be present in the NCBI taxonomy BD (and hopefully in the DDBJ taxonomy).

Note that the hierarchy is limited to Class-Order-Family as the Genus is given by the first part of the species name and is not repeated to prevent too long descriptions. Note that "candidatus" is replaced by a "c" prefix to the name, ex: cZinderia_insecticola. This is not following the Nomenclature code but this helps the computer treatment and these "candidatus" are de facto recognized by the community, if not by the official nomenclature. In this exemple, TR is related to the Sequences classification.

Nodes

Nodes having a SHvalue less or egal to 0.2 are replaced by a multifurcation. Nodes are yellow when the SH value is between 0.8 and 0.95 and green when SH value is over 0.95. The diameter of the yellow node is also proportional to the SH value within [0.8:0.95[
So, we can be confident ONLY in the green nodes ( Colors signification) and interpretate the relations accordingly. SH values > 0.95 are currently admitted as significative support value, 0.9 being more challenging.

Alignment viewer

Not well working due to changes in CIAviever :( .

This is a very global view of the alignment. See the colors signification. The query is always positionned on the first line. an exemple.

leBIBI IV SSU-rDNA (16S) Automated ProKaryotes Phylogeny