RiboDB : a prokaryotic ribosomal proteins DataBase

RiboDB v.3.0, release 11.2, (June 15 2021)
This is the latest built database with more genomes, an extended biodiversity a simplier scheme and an improved final quality-control of the ribosomal proteins.
You can list the genomes you want, count them and extract all the corresponding ribosomal proteins.

This is also a new version of the web-interface
There is no table of the ribosomal proteins content of the genomes here.
RIBODB V.3 CONTENT RiboDB release 12 contains nucleic and protein sequences of ribosomal proteins from 167,908 genomes of Archaea and Bacteria. The aim of this work is facilitate the use of ribosomal proteins in phylogeny:
     • all bacterial and archaeal genomes from RefSeq (120,548 genomes). Worth noticing, "highly populated species" (i.e species with more that 1,000 genome sequences in RefSeq) are represented only by the RefSeq representative strains available on their FTP and by a random selection of the genomes for these species (26.558).
     • We collected 41.725 genomes from GenBank. For Bacteria this corresponds to genomes bearing a "species" names not found in RefSeq and for Archaea to all the genomes not found in RefSeq (See detailed view of the RiboDB v11 content))
QUERY CONSTRUCTION RiboDB allows two types of queries:
     • The retrieval of information on strains and genomes for which ribosomal proteins are available in RiboDB. These queries must begin by the tag "@". For instance "@Cyanobacteria" will return the list and information on all Cyanobacteria (phylum) strains/genomes contained in RiboDB.
           • Adding "%" at the end of a query line, will return the number of strains / genomes for the corresponding taxon. For instance "@Escherichia_coli %" will return the number of Escherichia coli strains / genomes contained in RiboDB.
     • The retrieval of ribosomal protein nucleic and protein sequences for given sets of taxa or genomes. These queries must begin by the tag "#". For instance "#Bacillus" will return r-prots sequences of all Bacillus (genus) genomes contained in RiboDB.
The two types of queries are mutually exclusive.
RiboDB allows multiple queries of the same type at once, by listing queries on separated lines:
     • @Bacteria
     • @Archaea
     • #Streptococcus_pneumoniae
     • #Bacillus_subtilis
For additional details see (below) "Query construction" and "Structure of the FASTA commentary line"

Options By default queries will target strains / genomes from RefSeq and GenBank, with highly populated species are returning the representative strain(s) AND a random subset of the collection.

Checkboxes can be used to reduce this set of strains / genomes by targeting :
     • type strain material
     • representative/reference strains / genomes
     • genomes from GenbBank/RefSeq included in Ensembl! Bacteria
     • Only the representative strains for highly populated species
Targeted genomes / taxa (use # to extract r-prots or @ to extract information). If empty, launchs a random test
Targeted r-prots (delete the unwanted r-prots and the associated semicolons) R-prots are named according to BAN, Nenad, BECKMANN, Roland, CATE, Jamie HD, et al. A new system for naming ribosomal proteins. Current opinion in structural biology, 2014, vol. 24, p. 165-169. (see also the Ban Lab website)
Options Selection of the subsets

Additional information Queries allow scanning "FASTA commentary lines" of ribosomal proteins contained in the database using keywords. The structure of "FASTA commentary lines" is described below.

Most relevant searches target fields corresponding to:
     • Genus, Species, or lineage_report (e.g. #Sodalis_praecaptivus, @Bacillaceae-Bacillus)
     • NCBI_Species_TaxID (e.g. #~1463164, @~208962)
     • Genome_assembly_number (e.g. #GCF_900890425.1, @GCF_001027105.1)

To avoid any confusion among taxonomic ranks use "-" at the end of the taxon name when querying RiboDB on lineage report information. Using @Listeria will retrieve information on both Listeria (genus) and Listeriaceae (family). To retrieve information on strains and genomes from the Listeria genus, use "@Listeria-".
Similarly, use a "~" when querying on TaxID (e.g. "@~1312852")

More generally, any information contained in "FASTA commentary lines" may be queried, but may be risky or poorly relevant.
For instance, querying the database with "@Myco" will return information on Mycobacterium, Mycolicibacterium, Mycobacteroides, Mycolicibacter, and other Mycobacteriaceae (Actinobacteria), Mycoplasma (Mycoplasmatales), Mycoplana_dimorpha (an alphaproteobacterium), and Mycoavidus_cysteinexigens (a betaproteobacterium) strains contained in RiboDB.
Similarly, "@myco" will return information on Corynebacterium_amycolatum, Amycolatopsis, Streptomyces_antimycoticus, and Actinoplanes_awajinensis_subsp._mycoplanecinus (Actinobacteria), Bacillus_mycoides, Bacillus_paramycoides, Bacillus_pseudomycoides, and Mycoplasma_mycoides (Firmicutes).
Structure of the FASTA commentary line FASTA commentary lines are built as follow:
     • Genus_species: e.g. Pseudomonas_aeruginosa
     • strain_ID: e.g. PAO1
     • genome_type [#T, #R, or #E] with #T = genome tagged as type strain material in RefSeq or GenBank, #R = genome tagged as reference / representative genomes in RefSeq, #E = genome listed in Ensembl! Bacteria
     • genome_assembly_number: e.g. GCF_000006765.1
     • contig_number: e.g. NZ_002516.2
     • position: indicates the position of CDS on the contig, with "C" indicating that the CDS is encoded on the reverse strand, e.g. C[4781985..4782680]
     • NCBI_species_TaxID: corresponds to the species TaxID of the strain
     • Genetic_code: indicates the genetic code for the genome
     • Genome_source [#A, #B, #E] with #A = genome from RefSeq, #B = genome from Genbank not present in RefSeq, #E = genome rejected from RefSeq due to incomplete set of rRNA coding genes, ... , but reintroduced into RiboDB
     • Protein_evidence [#V or #H] with #V = match between RiboDB and CDS annotations as ribosomal protein, #H = if the protein identified by RiboDB is annotated as ribosomal protein.
     • Lineage_report = Domain-Phylum-Class-Order-Family-Genus-Species taxonomic ranks separated by "-": e.g. Bacteria-Proteobacteria-Gammaproteobacteria-Pseudomonadales-Pseudomonadaceae-Pseudomonas-Pseudomonas_aeruginosa.
See for example:

Questions: jpdotflandroisatuniv-lyon1dotfr

init done