The non-coding RNA sequence database is now the source of the sequences and of most of the information. RNAcentral database is made of unique sequences with an unique identifier URSnnnn. One URS may gather multiple identical sequences
RNAcentral is interesting because it is a de-duplicated and thus reduced collection of RNA. The sequences are tested and classified according to Rfam. The only problem is that identical sequences may correspond to different taxonomical levels.
EMBL-ENA taxonomy (xml format). A MongoDB DB is constructed and is used to access to the data and construct the taxonomy/nomenclature normalized hierarchy.
Type Strains DB
The source of information concerning Type Strains sequences is now RefSeq and LPSN (DSMZ)
Version currently online is BIBI DB mk45 (December 2022)
older versions of the DBBIBI DB mk44 (June 2022), BIBI DB mk43 (January 2022)
Named vs unnamed
The databases contains sequences said (from their annotations) to come from a species that bears a name recognized as relevant in the nomenclature of Bacteria/Archaea.
The main source is the NCBI Taxonomy site, with some complements from LPSN. These sequences are identified as "Named" and gathered in the "Named"(sequences) databases.
The sequences corresponding to "species" or "unknown genus" and most often recognized by the "sp." abreviation are said "Unnamed" and the databases that contains these sequences are known here as "Unnamed"(sequences) databases.
Compact vs extended
For each species the sequences corresponding to a RefSeq genome (R) and/or to a Type species (T, "type-material" sequence) are identified. This procedure may be faulty because of many difficulties dealing with annotations, species-name change...
This process is completed by the definition of Prototype sequences.
For each species (or group having the same nomenclature level, even if unnamed) a clustering of the corresponding sequences is done. A "Compact" database will gather the most relevant clusters in term of representativity (proportion of the sequences of the species being in the cluster, absolute population of the cluster, statistical repartition of the population in the species clusters). This construction led to identify a prototype "P" sequence for each cluster. The number of "P" sequences is not limided. Most of the clusters (especially because if the scarcity of the sequences in most of the species) are not representative enough. The representative sequence is then tagged as "p". If there is only a small number of sequences for the species in the database, the clustering cannot be done and some of the sequences are selected at random and tagged "Questionable" or "q". A "compact" database contains all the RT/T/P sequences and a limited number of p and q sequences (P+p+q <=5). An "extended" database contains a less limited of p and q sequences (P+p+q <=7).
Looking for exhastivity: Prototypes sequences DB low redundancy
It contains the sequences corresponding to Type-Material (RT+T : RefSeq Type-Material + Type-Strains).
If no RT/T sequence is available for a given name, 5 "Prototype" sequences are included. See Sequences classification
This 28,277 sequences rich DB corresponding to 20,798 different species-level names
Exploring variability: Named sequences compact/extensive DB
This corresponds to the Type Material+Prototypes sequences low redundancy with addition of the P/p/q sequences for all the species (including those having already a Type-Material sequence).
Using this DB enable to explore the genomic variability (or uncertainty of the sequences). Unfortunately very rare "p/q" tagged sequences may be wrongly identified in the current databases.
Compact: Content: 67,102/78,586 sequences
Exploring diversity; Named+Unnamed sequences compact/extensive DB
This is the same database but the sequences corresponding to "sp." (genus is said to be defined, not the species level) are added. The clustering procedure is used to select a limited number of relevant prototype sequences.
Even if these "sp." prototype sequences are tagged "P/p/q" the signification is only indicative of the clustering quality.
Compact: Content: 87,196/91,014 sequences (without the selection of relevant sequences the DB would have been containing more than 350,000 sequences)
Extensive: This is a big DB (380,007 sequences) containing all the sequences declared to be issued form a species with a "correct" name (or looking like). This DB may be heavilly redundant (for example it contains more than 10.000 Escherichia_coli sequences).
Two supplementary DB are used in the "optimized" phylogeny process
The "Non Type-Material Prototype Sequences (compact)" contains 5 prototype sequences for each given name.
The "Non Type-Material Prototype + UnNammed (sp.) sequences (compact)" is the same with the selected prototypes for "sp." groups added.
These two DB are not available directly as they are used to populate the tree around the position of the query sequence with a small number of sequences. The expert may use complete DB with the same global composition, like for instance the "All Named+Unnamed sequences compact" DB.