RiboDB : a prokaryotic ribosomal proteins DataBase
(June 15 2021)
This is the
latest built database
with more genomes, an extended biodiversity a simplier scheme and an improved final quality-control of the ribosomal proteins.
You can list the genomes you want, count them and extract all the corresponding ribosomal proteins.
This is also a
new version of the web-interface
There is no table of the ribosomal proteins content of the genomes here.
RIBODB V.3 CONTENT
RiboDB release 12 contains nucleic and protein sequences of ribosomal proteins from 167,908 genomes of Archaea and Bacteria. The aim of this work is facilitate the use of ribosomal proteins in phylogeny
• all bacterial and archaeal genomes from RefSeq (120,548 genomes). Worth noticing, "highly populated species" (i.e species with more that 1,000 genome sequences in RefSeq) are represented only by the RefSeq representative strains available on their FTP and by a random selection of the genomes for these species (26.558).
• We collected 41.725 genomes from GenBank. For Bacteria this corresponds to genomes bearing a "species" names not found in RefSeq and for Archaea to all the genomes not found in RefSeq (
See detailed view of the RiboDB v11 content)
RiboDB allows two types of queries:
• The retrieval of information on strains and genomes for which ribosomal proteins are available in RiboDB. These queries must begin by the tag
. For instance
will return the list and information on all Cyanobacteria (phylum) strains/genomes contained in RiboDB.
at the end of a query line, will return the number of strains / genomes for the corresponding taxon. For instance
will return the number of Escherichia coli strains / genomes contained in RiboDB.
• The retrieval of ribosomal protein nucleic and protein sequences for given sets of taxa or genomes. These queries must begin by the tag "#". For instance "#Bacillus" will return r-prots sequences of all Bacillus (genus) genomes contained in RiboDB.
The two types of queries are mutually exclusive.
RiboDB allows multiple queries of the same type at once,
by listing queries on separated lines:
For additional details see (below)
"Structure of the FASTA commentary line"
By default queries will target strains / genomes from RefSeq and GenBank, with highly populated species are returning the representative strain(s) AND a random subset of the collection.
Checkboxes can be used to reduce this set of strains / genomes by targeting :
• type strain material
• representative/reference strains / genomes
• genomes from GenbBank/RefSeq included in Ensembl! Bacteria
• Only the representative strains for highly populated species
Targeted genomes / taxa (use # to extract r-prots or @ to extract information). If empty, launchs a random test
Targeted r-prots (delete the unwanted r-prots and the associated semicolons)
R-prots are named according to
BAN, Nenad, BECKMANN, Roland, CATE, Jamie HD, et al. A new system for naming ribosomal proteins. Current opinion in structural biology, 2014, vol. 24, p. 165-169. (see also
the Ban Lab website)
ul1;ul2;ul3;ul4;ul5;ul6;ul10;ul11;ul13;ul14;ul15;ul16;ul18;ul22;ul23;ul24;ul29;ul30; us2;us3;us4;us5;us7;us8;us9;us10;us11;us12;us13;us14;us15;us17;us19; bs1;bs6;bs16;bs18;bs20;bs21; bl9;bl12;bl17;bl19;bl20;bl21;bl25;bl27;bl28;bl31;bl32;bl33;bl34;bl35;bl36;cs23;bTHX; al45;al46;al47;el8;el13;el14;el15;el18;el19;el20;el21;el24;el30;el31;el32;el33;el34;el37;el38;el39;el40;el41;el42_44;el43; es1;es4;es6;es8;es17;es19;es24;es25;es26;es27;es28;es30;es31;p1_p2
Selection of the subsets
Type strain material
Genomes included in Ensembl! Bacteria
EXCLUDE the extraction of a random sample from highly populated species from RefSeq/GenBank
Queries allow scanning "FASTA commentary lines" of ribosomal proteins contained in the database using keywords. The structure of "FASTA commentary lines" is described below.
Most relevant searches target fields corresponding to:
• Genus, Species, or lineage_report (e.g. #Sodalis_praecaptivus, @Bacillaceae-Bacillus)
• NCBI_Species_TaxID (e.g. #~1463164, @~208962)
• Genome_assembly_number (e.g. #GCF_900890425.1, @GCF_001027105.1)
To avoid any confusion among taxonomic ranks use "-" at the end of the taxon name when querying RiboDB on lineage report information. Using @Listeria will retrieve information on both Listeria (genus) and Listeriaceae (family). To retrieve information on strains and genomes from the Listeria genus, use "@Listeria-".
Similarly, use a "~" when querying on TaxID (e.g. "@~1312852")
More generally, any information contained in "FASTA commentary lines" may be queried, but may be risky or poorly relevant.
For instance, querying the database with
will return information on Mycobacterium, Mycolicibacterium, Mycobacteroides, Mycolicibacter, and other Mycobacteriaceae (Actinobacteria), Mycoplasma (Mycoplasmatales), Mycoplana_dimorpha (an alphaproteobacterium), and Mycoavidus_cysteinexigens (a betaproteobacterium) strains contained in RiboDB.
will return information on Corynebacterium_amycolatum, Amycolatopsis, Streptomyces_antimycoticus, and Actinoplanes_awajinensis_subsp._mycoplanecinus (Actinobacteria), Bacillus_mycoides, Bacillus_paramycoides, Bacillus_pseudomycoides, and Mycoplasma_mycoides (Firmicutes).
Structure of the FASTA commentary line
FASTA commentary lines are built as follow:
: e.g. Pseudomonas_aeruginosa
: e.g. PAO1
genome_type [#T, #R, or #E]
with #T = genome tagged as type strain material in RefSeq or GenBank, #R = genome tagged as reference / representative genomes in RefSeq, #E = genome listed in Ensembl! Bacteria
: e.g. GCF_000006765.1
: e.g. NZ_002516.2
: indicates the position of CDS on the contig, with "C" indicating that the CDS is encoded on the reverse strand, e.g. C[4781985..4782680]
: corresponds to the species TaxID of the strain
: indicates the genetic code for the genome
Genome_source [#A, #B, #E]
with #A = genome from RefSeq, #B = genome from Genbank not present in RefSeq, #E = genome rejected from RefSeq due to incomplete set of rRNA coding genes, ... , but reintroduced into RiboDB
Protein_evidence [#V or #H]
with #V = match between RiboDB and CDS annotations as ribosomal protein, #H = if the protein identified by RiboDB is annotated as ribosomal protein.
= Domain-Phylum-Class-Order-Family-Genus-Species taxonomic ranks separated by "-": e.g. Bacteria-Proteobacteria-Gammaproteobacteria-Pseudomonadales-Pseudomonadaceae-Pseudomonas-Pseudomonas_aeruginosa.
See for example: