PDF file Download PDF Article

Published: 12 September 2019

Genomic Encyclopedia of Bacteria and Archaea (GEBA) VI: learning from type strains

William B Whitman A , Hans-Peter Klenk B , David R Arahal C , Rosa Aznar C , George Garrity D , Michael Pester E and Philip Hugenholtz F

A Department of Microbiology, University of Georgia, Athens, GA, USA. Email: whitman@uga.edu

B School of Natural and Environmental Sciences, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK

C Colección Española de Cultivos Tipo (CECT), Universidad de Valencia, 46100 Burjassot (Valencia), Spain

D Department of Microbiology and Molecular Genetics, Michigan State University and NamesforLife, LLC, East Lansing, MI, USA

E DSMZ - German Collection of Microorganisms and Cell Cultures, Inhoffenstr. 7B, 38124 Braunschweig, Germany

F Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Qld 4072, Australia

Type strains of species are one of the most valuable resources in microbiology. During the last decade, the Genomic Encyclopedia of Bacteria and Archaea (GEBA) projects at the US Department of Energy Joint Genome Institute (JGI) and their collaborators have worked towards sequencing the genome of all the type strains of prokaryotic species. A new project GEBA VI extends these efforts to functional genomics, including pangenome and transcriptome sequencing and exometabolite analyses. As part of this project, investigators with interests in specific groups of prokaryotes are invited to submit samples for analysis at JGI.

What are type strains? By definition, type strains are descendants of the original isolates that were the basis for species descriptions, as defined by the International Code of Nomenclature of Prokaryotes1. They exhibit all of the relevant phenotypic and genotypic properties cited in the original published taxonomic circumscriptions. Type strains are also deposited in public culture collections and are likely to remain available for the foreseeable future. Thus, the importance of type strains in nomenclature is only secondary to their value as a biological resource. By the rules of nomenclature, a type strain cannot be identical with any other type strain. Since 1987, the difference between type strains has generally been defined genetically2. In terms of genome sequences, this level of diversity is equivalent to about 70% DNA : DNA hybridisation and 95% average nucleotide identity (ANI) among the conserved DNA3,4. In terms of phenotypic similarity, species generally possess S values as defined by numerical taxonomy of >70%, which is close to the limit of significance5. Thus, any two properly described type strains must be appreciably different. If these same criteria were applied to mammals most of the primates would be members of the same species6.

During the last decade, the Genomic Encyclopedia of Bacteria and Archaea (GEBA) projects at the US Department of Energy Joint Genome Institute (JGI) and their collaborators have undertaken a program to sequence the genomes of all the type strains of prokaryotic species. As of February 2019, 16 232 validly named prokaryotic species have been formally described, most of which are represented by type strains. The genomes of 7647 of these have been sequenced, 3145 by the GEBA projects. Another 2300 genome sequencing projects are currently underway at JGI. In addition, the World Data Centre for Microorganisms (WDCM) of the World Federation of Culture Collections (WFCC) began an additional project called GCM2.0 to sequence the genomes of the remaining prokaryote type strains7. This project parallels the efforts of GEBA, and the projects are closely coordinated to ensure that there is no overlap in sequencing efforts. Currently, the WFCC has completed or has in progress more than 2600 genome sequences of type strains (www.gcm.wdcm.org).

Beginning in 2013, the GEBA project began sequencing prospective type strains during their formal description8. The WDCM also provides a similar service to make genome sequencing readily available to laboratories worldwide9. Moreover, all three of the major microbial systematics journals, International Journal of Systematic and Evolutionary Microbiology, Systematics and Applied Microbiology and Antonie van Leeuwenhoek, now either require or strongly recommend including genome sequences in the descriptions of novel species. Minimum standards for the use of genome data for taxonomy of prokaryotes have also been proposed10. For these reasons, we expect that genome sequences will be included in the description of most new species in the future. This change in policy insures that the number of type strains without sequences will no longer increase. The major challenge then is sequencing the genomes of the type strains that have been previously described without a sequence. However, given the abovementioned initiatives, it is now possible to expect that the remaining type strains will be sequenced in the near future. When this goal is realised, this biological resource can be fully utilised.

Why sequence type strains? In the larger context, only a small portion of prokaryotic diversity has ever been cultured. The genomes of all known type strains are estimated to represent no more than 15% of the total diversity11. However, given the enormity of prokaryotic diversity, this is not an insignificant fraction. Our knowledge of remaining organisms comes largely from metagenome sequencing of environmental DNA. Because the type strains are well characterised, their genome sequences provide the framework for inferring the biological properties of uncultured prokaryotes from their genome sequences. Specifically, it will address a number of complementary questions. How is gene content related to function? Is the presence of certain genes and combinations of genes a strong predictor of phenotype? What properties of prokaryotes are predicted from their phylogeny? In the absence of clear understanding of the functional annotations of genes, what properties can be predicted from those of their relatives? Can probabilistic models be developed to express the likelihood of organismal properties from gene content and phylogeny? Currently, the data to address these issues do not exist at a scale that will generalise to all prokaryotic life.

There is also a number of other, very different but equally valid reasons to sequence the genomes of microbial type strains. As more genomes become available for specific groups, the applications of genome-based systematics are revolutionising the classification of prokaryotes. Genomes provide more reliable and complete data, and allow formation of more meaningful groupings of higher taxa. For instance, genome sequences suggest that the NCBI taxonomy, which is based largely on 16S rRNA sequences, contains a large number of misclassifications within the Clostridia and Bacteroidetes (Figure 1). Already genome sequences collected in large part by the GEBA projects have led to major taxonomic revisions of the Actinobacteria, Bacteroidetes, Epsilonproteobacteria, Geodermatophilaceae and the Rhodobacteraceae1317.

Figure 1.  Comparisons of the NCBI classification based largely on 16S rRNA sequencing and the GTDB classification based upon 120 genes obtained from genome sequences. (a) Comparison of NCBI (left) and GTDB (right) order-level classifications of the 2368 bacterial genomes assigned to the class Clostridia in the GTDB taxonomy. Genomes classified in a class other than Clostridia by NCBI are indicated in parentheses. (b) Comparison of NCBI and GTDB class-level classifications of the 2058 bacterial genomes assigned to the phylum Bacteroidetes in the GTDB taxonomy. Genomes classified in a phylum other than the Bacteroidetes by NCBI are indicated in parentheses. Figure reproduced from Park et al.12.
Click to zoom

Because they are phylogenetically diverse, the GEBA genomes are also very useful for identifying metagenome sequences from environmental DNA18. For instance, the first thousand GEBA genomes enabled classification of more than 25 million proteins from metagenomes. The same genomes led to a >10% increase in the known protein sequence diversity in prokaryotes18,19. Most type strains were described because of their environmental, medical or commercial importance. Their genomic sequences will contribute greatly to our knowledge of the processes in which these prokaryotes play fundamental roles. Identification of prokaryotes is still a major challenge that hinders many practical applications. Genomic sequencing of type strains provides tools that greatly facilitate identification and classification.

While continuing the efforts to sequence the genomes of type strains, the newest GEBA project, GEBA VI, will go beyond genome sequencing to initiate the next stage of utilisation of this important biological resource. Because their phenotypic and physiological properties are typically well characterised, type strains are well suited for studies of functional genomics, which unite genome content with the biological properties of bacteria and archaea. The proposed studies will systematically investigate genome function of type strains by a combination of (1) genome sequencing to determine gene content, (2) transcriptomics to elucidate gene expression, (3) secreted metabolites (or exometabolomics) to examine function, and (4) pangenomics to identify the core genome and evolutionary processes.

The rationale for these projects is as follows. While genome sequences provide valuable insights into phylogeny and systematics, they are only the first layer of information available from type strains. Organism function depends critically upon gene expression. Transcriptome studies will identify the highly expressed genes central to an organism’s growth and metabolism. Many of these genes are expected to be ‘character’ genes, which play critical roles in an organism’s specific adaptation to the environment20. Exometabolomics provides a different view of function, and spent culture media will be screened for nonpolar as well as polar metabolites. Nonpolar metabolites are expected to include many secondary metabolites, such as antibiotics, polyketides and phenolics. These compounds play fundamental roles in cell-cell signaling, competition with other microorganisms, metal uptake and other important biological functions. Polar metabolites are expected to include amino acids, sugars, small organic acids and other hydrophilic compounds. These analyses will identify the components of complex culture media that are consumed, providing direct evidence for the culture’s metabolism. They will also identify compounds produced during growth by incomplete metabolism or fermentation.

Lastly, the genome content of any specific strain does not fully represent the gene content of the species21,22. Sequencing of closely related strains will identify the core and pangenome as well as the nature of evolutionary processes occurring within the taxon. The core genome is of special importance in the description of a species because it encodes those properties that are conserved across all members20. Thus, it captures the phenotypic basis for the species, including those factors which are responsible for speciation and adaptation to its environment. The dispensable genome is also interesting and provides insight into strain specific differences. For instance, in many pathogenic species, the dispensable genome encodes processes associated with host interactions. Finally, the pangenome provides important insights into the frequency of horizontal gene transfer and sources of diversity within a species23. For instance, a pangenome can be either ‘closed’ or ‘open’. The number of genes in a closed pangenome increases to a maximum value as the number of strains increases, suggesting that there is a finite limit to the genetic diversity within a species. The number of genes in an open pangenome is unbounded, and the genetic diversity is, in theory, unlimited.

To realise these goals, GEBA VI welcomes contributions of samples from individual investigators (Figure 2). To participate, submit a description of your project at our website (https://gold.jgi.doe.gov/gebaVI) or send an email to whitman@uga.edu (include ‘GEBA VI’ in the subject line). The types of projects being considered are genome sequencing of type strains, pangenome sequencing of species poorly represented in the databases, and functional genomics of type strains. The GEBA VI project will support two types of experimental designs, although other good ideas are encouraged. The first will be in-depth studies of single species, including characterisation of the pangenome or transcriptomes and exometabolomes under different growth conditions. In studies of this type, the genome of the type strain may have been sequenced in previous studies. We expect that individual investigators will be the primary contributors for these studies, and the goal is to provide sufficient data for a substantive publication. In this design, the project will support transcriptome studies of 3–4 replicates of up to four conditions. Exometabolome analyses will be dependent upon the suitability of the culture medium for analysis by mass spectroscopy. It may include analysis of the spent medium of up to four conditions for nonpolar metabolites and two conditions for polar metabolites. The pangenome is expected to comprise 5–10 reference strains. The second type of experimental design will comprise comparative studies of groups of related type strains and surveys of genome sequences, transcriptomes and exometabolomes. These designs will try to encompass collections of related species or genera. Type strains will be cultured under identical conditions for preparation of samples for the transcriptomes and exometabolomes.

Figure 2.  Workflow of investigator initiated GEBA VI projects.

Our primary intention is to survey a large amount of phylogenetic diversity, and, for that reason, these studies will not be comprehensive. For instance, complete sampling of a pangenome may require hundreds of sequences. The transcriptome can only be partially elucidated in the limited studies proposed here. Similarly, production of various classes of exometabolites often depends critically on the cultivation conditions (e.g. media, time, temperature, limiting nutrients, oxygenation, pH/buffering). Historically, screening programs use batteries of complex media designed to trigger secondary metabolite production based on nutrient limitations, and spent media is extracted using a battery of solvents ranging in polarity and protonation. The cell pellet is also often extracted to recover bound metabolites. Thus, the studies proposed here are unlikely to uncover the full range of exometabolites. Nevertheless, these studies will identify candidates for subsequent in-depth studies and provide valuable comparative information.

Conflicts of interest

The authors declare no conflicts of interest.


The authors are grateful to Nikos Kyrpides, Tanja Woyke and Trent Northen for helpful discussions. This research did not receive any specific funding.


William (Barny) Whitman is an Emeritus Professor of Microbiology at the University of Georgia, USA, and trustee of Bergey’s Manual. His research interests are prokaryotic systematics and the physiology of the methanogenic Archaea.

Hans-Peter Klenk is an Emeritus Professor of Microbial Genomics and Diversity at the University of Newcastle, Newcastle upon Tyne, UK, and former Head of School of Biology at that very place. His research interests are Actinobacteria, phylogenomics and prokaryotic systematics in general.

David R Arahal is an Associate Professor of Microbiology at the University of Valencia, Spain, and senior researcher at the Spanish Type Culture Collection (CECT). His research interests are taxonomy and genomics of prokaryotes, and very especially those from aquatic environments.

Rosa Aznar is a Full Professor of Microbiology at the University of Valencia, Spain, and Director of the Spanish Type Culture Collection (CECT). Her research interest and background focus on identification and characterisation of bacteria related to food quality, food safety and functional food.

George Garrity is a Professor in the Department of Microbiology and Molecular Genetics at Michigan State University and Co-founder of NamesforLife, LLC a bioinformatics and publishing services company that was founded to commercialise a novel semantic technology developed at Michigan State University. Prior to joining Michigan State, he held a number of positions of increasing responsibility in the natural products screening program at Merck & Co.

Michael Pester is a Professor at the Technical University of Braunschweig and heads the Department of Microbiology at the Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures. His research interests are environmental microbiology and geomicrobiology of sulfur and nitrogen cycling.

Philip Hugenholtz is a microbiologist who has made contributions in the field of culture-independent analysis of microorganisms. He has contributed to the development and application of metagenomics, the genome-based characterisation of microbiomes, including their taxonomic classification.

RSS Free subscription to our email Contents Alert. Or register for the free RSS feed.