leBIBI IV SSU-rDNA (16S) Automated ProKaryotes Phylogeny

Building the collection of Prototype sequences

Aims

Mitigate the lack of reference sequence or a species

The sequences corresponding to a RefSeq one are tagged "R" and the sequences issued from Type-Strains are tagged "T" but this is not always the case, even if the high majority of the species have a RT/T sequence. Missed RT/T sequences are for example due to:

Expend the biodiversity coverage for a given species

Some variations are expected in the sequences of rDNA for a given species, for example:

Have a glimpse on the currently unnamed bio-diversity

We have collected the sequences corresponding to not well characterized strains (but as soon as a substantial part of the taxonomical hierarchy is known). All these sequences named for example "Clostridium sp." are interesting to explore the hidden biodiversity but here we tried to reduce the complexity of the phylogenetic trees by selecting sequences found at a given frequency. This is done by clustering and sometime we got big clusters that may be the clue for new species, most of the time we only reduce the complexity to get a minimal glimpse on the biodiversity around a known species without impairing the readability of the tree.

Methods

Gathering the sequences that belong to the same nomenclatural unit

The nomenclatural unit is the set of sequences said to belong to the same lower level in the nomenclature (species or subspecies). Groups are also gathered if they correspond to the same genus but with no defined species (xxx sp.).

Two cases of nomenclatural unit

If NbNu is the number of sequences available in a nomenclatural unit:

Random selection

The problem is that the species name of the sequences is given by the author during the deposition process and (except the RefSeq case) this is not verified and not -or marginaly- modified afterwards. So some sequences may be known with a wrong species name. When the number of sequences in the nomenclatural unit is low there is no way to detect faulty identification safely. When the number is bigger, the faulty identifications are not impacting negatively the choice of the prototype because they remains isolated (singletons). The less risky way here to reduce the number of sequence is to sample at random. As the representativity of such sequences is questionable, the sequences are tagged "q".

Clustering

MMSEQ2 is used with the following parameters:

mmseqs easy-linclust --kmer-per-seq-scale 0.2 --sort-results 1 --min-seq-id 0.99 -e 1.000E-09 --cov-mode 0 -c 0.99

In some case of low population numbers, MMSEQ2 is returning an error, the random selection is then done instead.

Sorting the clusters by decreasing number of sequences

This is done to check the representativity of any cluster comparatively to the previous and next one. Usually the nomenclatural unit is divided in evident (rather big) clusters and many low population clusters (or even singletons). The idea is to detect this change in the ordinated list of clusters to stop the process.

An optimal case and a lower limit

Results

In the case of the "Extended DB" construction, all the P and p sequences are conserved but a maximum of 5 q classified sequences are conserved only, until the total number of sequences tagged P/p/q remains under 5. In the "compact DB" case, only 5 sequences are included with a priority for the P tagged sequences, if needed a maximum of 3 q-tagged sequences is taken.

 

 

logo LBBE

LABORATORY OF BIOMETRY
AND
EVOLUTIONARY BIOLOGY

logo CNRS logo university

Original solution copyright w3schools.com