Earlier this week, news of a new paper about the number of protein-encoding genes surfaced on Sandwalk and Henry. The paper’s title is straightforward, “Distinguishing protein-coding and noncoding genes in the human genome” but the concepts behind it may not be.
As mentioned in Sandwalk, the initial estimates of the number of genes in the human genome was about 30,000. That was when the first drafts of the human genome became available in June of 2000. Since then the numbers have been fluctuating, and for many it may seem like geneticists and molecular biologists working on annotating the human genome are riding a roller coaster of indecision. In reality, it is not easy to exactly calculate the number of the genes in any genome.
Why is it not easy to calculate the number of genes? The human genome is around 3,000,000,000 bases long. That’s three thousand million and the average human gene is 12,000 bases long! It is almost like finding a needle in a haystack, but thankfully there is some organization in the genome that helps us find genes faster. Large deserts of junk DNA exist, which helps weed out the possibility of finding genes. And since a gene have a start and a stop, we can harness the power of computers to scan and seek out these signals.
See, the current work flow to estimate the number of genes is to first isolate genomic DNA from the organism. The DNA is then sheared up into many fragments and depending on the cloning mechanism, the fragments are amplified by PCR, in vector expressing bacteria, or both! Once amplified the fragments are then sequenced. This is called shotgun sequencing, the method that Craig Venter deployed to help accelerate the sequencing of the human genome. Since some fragments are larger than other, it is possible to create scaffolding based on homologous sequences called contigs to figure our where fragments fall in order. This is called the assembly of the genome.
Once most of the fragments are assembled, it is also possible to annotate the genome. Annotate means to explain what the nucleotide sequence means. If a nucleotide sequence begins with a start codon and ends with a stop codon in frame, it creates a big flag that this sequence maybe a gene. There’s a lot of definitions of a gene, and for the sake of this post, let’s run on the one definition that calls a gene as any sequence of DNA that is transcribed. This segment of the genome is further scrutinized for splice sites and any other regions, such as regulatory sequences, to help figure out if it’s really a gene. The sequence is also compared to other known sequences, using BLAST, a tool the compares the sequence to a massive database of sequence. If any significant matches come up to already known genes, the possibility that the unknown sequence is a gene increases based on the observation that genes are generally highly conserved throughout evolutionary time.
If the sequence meets all the criteria of a gene, it is labeled an open reading frame or ORF. ORFs are putative genes. In order to confirm an ORF, researchers often need to turn to the wet-lab to either find the gene expressed as an RNA or protein in an organism. With 30,000 or so ORFs, the process of validating each gene is enormous and time consuming. Not every research lab is working on confirming if an ORF is really a gene, so that also slows down the process.
The research conducted in the paper above, involved scrutinizing 22,000 ORFs from the Ensembl database. The analysis revealed a lot of orphan DNA sequences. Orphan sequences look like they encode proteins because of their open reading frames, but they are not present in the mouse and dog genomes. Just cause dogs and mice didn’t have the ORFs didn’t mean the ORFs aren’t real genes. They could be unique primates genes, deriving during or after the primate lineage split from the rest of the mammals. Or, the genes could have been more ancient creations and lost in mouse and dog lineages. Either way, if the ORFs were also compared to primate genomes, then they should appear there as well.
Comparing the ORFs to the chimpanzee and macaque genomes invalidated a total of about 5,000 ORFs that had been incorrectly added to the lists of protein-coding genes. This reduces the current estimate to roughly 20,500 genes that encode for proteins in the human genome. That’s not much, evolution isn’t a numbers game. Some of the variation in the genes as well as the patterns of regulation and expression of these genes are what makes us human. So if you’re thinking, “Why do humans have so few genes?” don’t fret, size doesn’t matter in this case.