Department of Microbiology
The JHJ-Lab

Report Number

No. TR 99-01  Julius H. Jackson

Terminologies for Gene & Protein Similarity


    This explanation of current and popular genomic jargon is prepared primarily as an aid to undergraduates at Michigan State University who are new to this area of research, and especially those beginning work in the JHJackson lab.   As such, this treatment is informal by design in effort to make the information maximally accessible.

    The intent of scientific vocabulary or terminology is to create a precision of definition such that any topic may be discussed with unambiguous meaning.  In the midst of the current information explosion resulting from new sequence acquisitions, terminologies are continuously being constructed to characterize genes and proteins according to knowledge or inferences about their origins and activities.  Here is a partial listing of some terms to define or describe gene and protein similarity that have come into common use.  A brief definition or description accompanies each term.  These few examples should illustrate that terms used can easily become terms confused, and that expansion of terms frequently may not expand precision of understanding.  A key point of confusion results from a much more rapid expansion of information than knowledge from the information.  The continued addition of new terms truly expands information, ... and sometimes knowledge.


Heterologs.  {Heterologs differ in both origin and activity.}  Genes that are "unique" in activity and sequence are said to be heterologous.   Note that genes initially defined as heterologous by syntax (letter matching) may actually be homologous by activity.

Homologs.  {Homologs have common  origins but may or may not have common activity.}  Genes that share an arbitrary threshold level of similarity determined by alignment of matching bases are termed homologous. Homology is a qualitative term that describes a relationship between genes and is based upon the quantitative similarity.  Similarity is a quantitative term that defines the degree of sequence match between two compared sequences.  Homology implies that the compared sequences diverged in evolution from a common origin.    For example, two aligned genes or segments of sequence  that are homologous may have varying degrees of similarity based upon identical base matches in the alignment.  In the first sequence alignment in the following figure, the sequences are obviously identical and therefore exhibit 39 matches out of 39 positions aligned, or 100% similarity.  In the second alignment the aligned sequences contain 28 matches out of 39 possible.  The quantitative match or degree of similarity is then 28/39 or 72%.  In both cases the sequences are homologous.


|||||||||||||||||||||||||||||||||||||||  39 of 39 matches


||||||       |||||| |||||||| |||||| ||     28 of 39 matches

Fig. 1

Homologous sequences are termed homologs and this term may be applied to both genes and proteins.  Homologs look similar to each other and appear to share common ancestry but they may or may not display the same activity. 

Analogs. {Analogs have common activity but not common origin.}  Genes or proteins that display the same activity but lack sufficient similarity to imply common origin are said to have analogous activity.  The implication is that analogous proteins followed evolutionary pathways from different origins to converge upon the same activity.  Thus, analogous genes or proteins are considered a product of convergent evolution.   Analogs have homologous activity but heterologous origins. 

Paralogs.  {Paralogs are homologs produced by gene duplication.}  Homologous genes produced by gene duplication are termed paralogous.  Paralogous genes are homologous genes that result from divergent evolution from a common ancestral gene.  Paralogous implies that gene duplication and divergence occurred within the same organism/species and divergence of sequence led to divergence of activity.  Paralogs have homologous origin but heterologous activities.  

Orthologs.  {Orthologs are homologs produced by speciation.}  When speciation follows duplication and one homolog  sorts with one species and the other copy the other species, subsequent divergence of the duplicated sequence is associated with one or the other species.  Such species specific homologs are termed orthologous.  Thus, orthologs are homologs from duplication that precedes speciation, followed by divergence of sequence but not activity in separate species.  Orthologs have homologous origin and homologous activity.

Xenologs.  {Xenologs are homologs resulting from horizontal gene transfer.}  The determination of whether a gene of interest was recently transferred into the current host by horizontal gene transfer is frequently non-trivial.  Occasionally the %G+C content may be so vastly different from the average gene in the current host that a conclusion of external origin is nearly inescapable.  Absent such a sore thumb, codon usage bias might provide a clue but interpretation of such data presents challenges, especially in sorting out whether differences are significant and a reflection of the relative state of gene expression or actually a gene from another world.

Discussion & Conclusions

For most purposes three buckets may be found both necessary and sufficient to classify genes or proteins in order to address questions beyond the evolutionary pathway(s) to the contemporary sequence.  The three buckets are:

  • heterologous 
  • homologous 
  • analogous 
These three states require low-level knowledge from similarity measurements to classify sequences, and provide sufficient information to use in most of our applications.  Some applications may benefit from further subclassifications, but it would be advisable to weigh the benefits derived vs the time required for the intended application.  Although terms such as parologous and orthologous are used with apparent confidence to describe sets of genes, rarely is it possible to know whether genes with similarity of sequence and different activity resulted from duplication and divergence within that organism/species, or arose by recombination, or evolved to activity elsewhere and moved into the current space by horizontal transfer, etc.  A word of caution against routine dithering of sequence into sub buckets is that more knowledge is required for accurate placement, and inaccurate placements can eventually lead to major confusion of the field.  

References (A starting point) 
Fitch, W. M.   Distinguishing homologous from analogous proteins.  Syst. Zool. 19:99-113 (1970)

Jensen, R.  Evolution of metabolic pathways in enteric bacteria.  In Escherichia coli and Salmonella Cellular and Molecular Biology (Neidhardt, F. C., R. Curtiss III, J. L. Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M. Schaechter, and H. E. Umbarger, Eds). vol 2, ch. 144, pp 2649-2662, ASM Press, Washington, DC (1996)

Koonin, E. V., R. L. Tatusov, and K. E. Rudd.  Escherichia coli protein sequences: functional and evolutionary implications.   In Escherichia coli and Salmonella Cellular and Molecular Biology (Neidhardt, F. C., R. Curtiss III, J. L. Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M. Schaechter, and H. E. Umbarger, Eds). vol 2, ch. 117, pp 2203-2217, ASM Press, Washington, DC (1996)

Riley, M. & B. Labedan.  Escherichia coli Gene Products: Physiological functions and common ancestries.  In Escherichia coli and Salmonella Cellular and Molecular Biology (Neidhardt, F. C., R. Curtiss III, J. L. Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M. Schaechter, and H. E. Umbarger, Eds). vol 2, ch. 116, pp 2118-2202, ASM Press, Washington, DC (1996)

MSU Microbiology Web Page | J-Lab Web Page
Last updated: Thu Apr 08 17:34:35 1999 
 The information on these pages is under copyright © by member(s) of the J. H. Jackson Laboratory.