De novo gene birth

Novel genes can emerge from ancestrally non-genic regions through poorly understood mechanisms. (A) A non-genic region first gains transcription and an ORF, in either order, facilitating the birth of a de novo gene. The ORF is for illustrative purposes only, as de novo genes may also be multi-exonic, or lack an ORF, as with RNA genes. (B) Overprinting. A novel ORF is created that overlaps with an existing ORF, but in a different frame. (C) Exonization. A formerly intronic region becomes alternatively spliced as an exon, such as when repetitive sequences are acquired through retroposition and new splice sites are created through mutational processes. Overprinting and exonization may be considered as special cases of de novo gene birth.[1]
Novel genes can be formed from ancestral genes through a variety of mechanisms.[2] (A) Duplication and divergence. Following duplication, one copy experiences relaxed selection and gradually acquires novel function(s). (B) Gene fusion. A hybrid gene formed from some or all of two previously separate genes. Gene fusions can occur by different mechanisms; shown here is an interstitial deletion. (C) Gene fission. A single gene separates to form two distinct genes, such as by duplication and differential degeneration of the two copies.[3] (D) Horizontal gene transfer. Genes acquired from other species by horizontal transfer undergo divergence and neofunctionalization. (E) Retroposition. Transcripts may be reverse transcribed and integrated as an intronless gene elsewhere in the genome. This new gene may then undergo divergence.[1]

De novo gene birth is the process by which new genes evolve from DNA sequences that were ancestrally non-genic. De novo genes represent a subset of novel genes, and may be protein-coding or instead act as RNA genes.[4] The processes that govern de novo gene birth are not well understood, although several models exist that describe possible mechanisms by which de novo gene birth may occur.

Although de novo gene birth may have occurred at any point in an organism's evolutionary history, ancient de novo gene birth events are difficult to detect. Most studies of de novo genes to date have thus focused on young genes, typically taxonomically restricted genes (TRGs) that are present in a single species or lineage, including so-called orphan genes, defined as genes that lack any identifiable homolog. It is important to note, however, that not all orphan genes arise de novo, and instead may emerge through fairly well-characterized mechanisms such as gene duplication (including retroposition) or horizontal gene transfer followed by sequence divergence, or by gene fission/fusion.[5][6]

Although de novo gene birth was once viewed as a highly unlikely occurrence,[7] several unequivocal examples have now been described,[1] and some researchers speculate that de novo gene birth could play a major role in evolutionary innovation.[8][9]


As early as the 1930s, J. B. S. Haldane and others suggested that copies of existing genes may lead to new genes with novel functions.[6] In 1970, Susumu Ohno published the seminal text Evolution by Gene Duplication.[10] For some time subsequently, the consensus view was that virtually all genes were derived from ancestral genes,[11] with François Jacob famously remarking in a 1977 essay that "the probability that a functional protein would appear de novo by random association of amino acids is practically zero."[7]

In the same year, however, Pierre-Paul Grassé coined the term "overprinting" to describe the emergence of genes through the expression of alternative open reading frames (ORFs) that overlap preexisting genes.[12] These new ORFs may be out of frame with or antisense to the preexisting gene. They may also be in frame with the existing ORF, creating a truncated version of the original gene, or represent 3’ extensions of an existing ORF into a nearby ORF. The first two types of overprinting may be thought of as a particular subtype of de novo gene birth; although overlapping with a previously coding region of the genome, the primary amino-acid sequence of the new protein is entirely novel and derived from a frame that did not previously contain a gene. The first examples of this phenomenon in bacteriophages were reported in a series of studies from 1976 to 1978,[13][14][15] and since then numerous other examples have been identified in viruses, bacteria, and several eukaryotic species.[16][17][18][19][20]

The phenomenon of exonization also represents a special case of de novo gene birth, in which, for example, often-repetitive intronic sequences acquire splice sites through mutation, leading to de novo exons. This was first described in 1994 in the context of Alu sequences found in the coding regions of primate mRNAs.[21] Interestingly, such de novo exons are frequently found in minor splice variants, which may allow the evolutionary “testing” of novel sequences while retaining the functionality of the major splice variant(s).[22]

Still, it was thought by some that most or all eukaryotic proteins were constructed from a constrained pool of “starter type” exons.[23] Using the sequence data available at the time, a 1991 review estimated the number of unique, ancestral eukaryotic exons to be < 60,000,[23] while in 1992 a piece was published estimating that the vast majority of proteins belonged to no more than 1,000 families.[24] Around the same time, however, the sequence of chromosome III of the budding yeast Saccharomyces cerevisiae was released,[25] representing the first time an entire chromosome from any eukaryotic organism had been sequenced. Sequencing of the entire yeast nuclear genome was then completed by early 1996 through a massive, collaborative international effort.[26] In his review of the yeast genome project, Bernard Dujon noted that the unexpected abundance of genes lacking any known homologs was perhaps the most striking finding of the entire project.[26]

In 2006 and 2007, a series of studies provided arguably the first documented examples of de novo gene birth that did not involve overprinting.[27][28][29] An analysis of the accessory gland transcriptomes of Drosophila yakuba and Drosophila erecta first identified 20 putative lineage-restricted genes that appeared unlikely to have resulted from gene duplication.[29] Levine and colleagues then confirmed the de novo origination of five candidate genes specific to Drosophila melanogaster and/or the closely related Drosophila simulans through a rigorous pipeline that combined bioinformatic and experimental techniques.[28] These genes were identified by combining BLAST search-based and synteny-based approaches (see below), which demonstrated the absence of the genes in closely-related species.[28]

Despite their recent evolution, all five genes appear fixed in D. melanogaster, and the presence of paralogous non-coding sequences that are absent in close relatives suggests that four of the five genes may have arisen through a recent intrachromosomal duplication event.[28] Interestingly, all five were preferentially expressed in the testes of male flies[28] (see below). The three genes for which complete ORFs exist in both D. melanogaster and D. simulans showed evidence of rapid evolution and positive selection.[28] This is consistent with a recent emergence of these genes, as it is typical for young, novel genes to undergo adaptive evolution,[30][31][32] but it also makes it difficult to be completely sure that the candidates encode truly functional products. A subsequent study using methods similar to Levine et al. and an expressed sequence tag library derived from D. yakuba testes identified seven genes derived from six unique de novo gene birth events in D. yakuba and/or the closely related D. erecta.[27]

Three of these genes are extremely short (<90 bp), suggesting that they may be RNA genes,[27] although several examples of very short functional peptides have also been documented.[33][34][35][36] Around the same time as these studies in Drosophila were published, a homology search of genomes from all domains of life, including 18 fungal genomes, identified 132 fungal-specific proteins, 99 of which were unique to S. cerevisiae.[37]

Since these initial studies, many groups have identified specific cases of de novo gene birth events in diverse organisms.[38] The BSC4 gene in S. cerevisiae, identified in 2008, shows evidence of purifying selection, is expressed at both the mRNA and protein levels, and when deleted is synthetically lethal with two other yeast genes, all of which indicate a functional role for the BSC4 gene product.[39] Historically, one argument against the notion of widespread de novo gene birth is the evolved complexity of protein folding. Interestingly, Bsc4 was later shown to adopt a partially folded state that combines properties of native and non-native protein folding.[40] Another well-characterized example in yeast is MDF1, which both represses mating efficiency and promotes vegetative growth, and is intricately regulated by a conserved antisense ORF.[41][42] In plants, the first de novo gene to be functionally characterized was QQS, an Arabidopsis thaliana gene identified in 2009 that regulates carbon and nitrogen metabolism.[43] The first functionally characterized de novo gene identified in mice, a noncoding RNA gene, was also described in 2009.[44] In primates, a 2008 informatic analysis estimated that 15/270 primate orphan genes had been formed de novo.[45] A 2009 report identified the first three de novo human genes, one of which is a therapeutic target in chronic lymphocytic leukemia.[46] Since this time, a plethora of genome-level studies have identified large numbers of orphan genes in many organisms, although the extent to which they arose de novo, and the degree to which they can be deemed functional, remain debated.

Other Languages