Building the collection of Prototype sequences
Aims
Mitigate the lack of reference sequence or a species
The sequences corresponding to a RefSeq one are tagged "R" and the sequences issued from Type-Strains are tagged "T" but this is not always the case, even if the high majority of the species have a RT/T sequence. Missed RT/T sequences are for example due to:
- The sequence may have been rejected from RNA-Central
- The sequence may have been dropped from BIBI-DB due to its length (less than 1400bp)
- The presence of unknown positions in exaggerate amount (>2%) led to rejection
- We have based the DB upon RNA-Central The RT/T status may not be extracted from the collected data in the other banks
- Some errors remains in the collected data in the other banks
- Our algorithm is missing some criteria in rare situations.
Expend the biodiversity coverage for a given species
Some variations are expected in the sequences of rDNA for a given species, for example:
- Many species have paralogs of small unit rDNA (Escherichia coli has 7 molecules in its genome)
- The paralogs of rDNA are rarely identical
- Evolution exists, even for rDNA !
- Sequencing errors exists (and very early-deposal are of lower quality)
Have a glimpse on the currently unnamed bio-diversity
We have collected the sequences corresponding to not well characterized strains (but as soon as a substantial part of the taxonomical hierarchy is known). All these sequences named for example "Clostridium sp." are interesting to explore the hidden biodiversity but here we tried to reduce the complexity of the phylogenetic trees by selecting sequences found at a given frequency. This is done by clustering and sometime we got big clusters that may be the clue for new species, most of the time we only reduce the complexity to get a minimal glimpse on the biodiversity around a known species without impairing the readability of the tree.
Methods
Gathering the sequences that belong to the same nomenclatural unit
The nomenclatural unit is the set of sequences said to belong to the same lower level in the nomenclature (species or subspecies). Groups are also gathered if they correspond to the same genus but with no defined species (xxx sp.).
Two cases of nomenclatural unit
If NbNu is the number of sequences available in a nomenclatural unit:
- A clustering is possible if NbNu > 5
- If NbNu <5 a random selection is done.
Random selection
The problem is that the species name of the sequences is given by the author during the deposition process and (except the RefSeq case) this is not verified and not -or marginaly- modified afterwards. So some sequences may be known with a wrong species name. When the number of sequences in the nomenclatural unit is low there is no way to detect faulty identification safely. When the number is bigger, the faulty identifications are not impacting negatively the choice of the prototype because they remains isolated (singletons). The less risky way here to reduce the number of sequence is to sample at random. As the representativity of such sequences is questionable, the sequences are tagged "q".
Clustering
MMSEQ2 is used with the following parameters:
mmseqs easy-linclust --kmer-per-seq-scale 0.2 --sort-results 1 --min-seq-id 0.99 -e 1.000E-09 --cov-mode 0 -c 0.99
In some case of low population numbers, MMSEQ2 is returning an error, the random selection is then done instead.
Sorting the clusters by decreasing number of sequences
This is done to check the representativity of any cluster comparatively to the previous and next one. Usually the nomenclatural unit is divided in evident (rather big) clusters and many low population clusters (or even singletons). The idea is to detect this change in the ordinated list of clusters to stop the process.
An optimal case and a lower limit
- The first criteria to identify representative clusters is that a cluster should gather a substantial population. For each nomenclatural unit an optimum population threshold (OPT) is computed: OPT=Nbnu/20 (namedDB sequence case) or OPT=Nbnu/10 (unnamedDB case). All the clusters with more than 100 sequences are also considered (this is necessary for heavily populated species where the optimum population level is too big). Otherwise a cluster with less than 5 sequences is not considered as optimal. Note that the computation of OPT is modulated by the DB content: unnamedDB have a very low frequency of highly populated clusters because of the low stringency of the nomenclature and high variability of the sequences within a nomenclatural unit.
- The cluster that satisfy this criteria are considered to be of "high quality" but consider the number of sequences in the following clusters (NbNu=1200): [310,280,265,250,60,20,8,8,5,5,...,3,1,1...] The four first clusters are of high quality OPT>50 but the fifth is much lower than the four previous ones but it is satisfying the OPT condition. The representativity of this cluster is somewhat lower. To detect the rapid decline that seems to reflect a change in the representativity, the rule is to expect that the following cluster will have more than 1/4 of the previous population and will not have less than 1/10 of the population of the biggest cluster.
- A minimal criteria is deduced from the OPT optimal threshold: when the content of a cluster is below OPT/5 (with a minimal of 3) the cluster is classified as "questionable". max(optTaille//5,3)
- Note that in the unnamed case, the parameters are optimized to highly fragmented populations.
Results
- We thus can identify:
- The high representativity clusters from which we take the MMSEQS2 proposed sequence as a prototype and the sequence is tagged P (capital P)
- The lower representativity clusters from which we take the MMSEQS2 proposed sequence as a prototype and the sequence is tagged p (lower p)
- The questionable representativity clusters from which we take the MMSEQS2 proposed sequence as a "Prototype" and the sequence is tagged q
In the case of the "Extended DB" construction, all the P and p sequences are conserved but a maximum of 5 q classified sequences are conserved only, until the total number of sequences tagged P/p/q remains under 5. In the "compact DB" case, only 5 sequences are included with a priority for the P tagged sequences, if needed a maximum of 3 q-tagged sequences is taken.