Determining the sequence of the human genome has been compared in significance to Neil Armstrong’s first steps on the moon and to revealing the “book of life.” At the White House announcement of completion of a draft sequence, the achievement was described by President Clinton in 2000 as “Without a doubt . . . the most important, most wondrous map ever produced by humankind.”
The word genome is used to describe the entire DNA sequence of an organism. The cells of all plants and animals contain two copies of the genome—one inherited from each parent—except of course for the germ cells (eggs and sperm), which contain a single copy. The DNA (deoxyribonucleic acid) is tightly coiled on chromosomes, of which there are 23 pairs in humans. If all of the DNA in a single human cell was stretched out, it would be about 2 meters in length. DNA consists of two antiparallel strands coiled into a double helix. The strands are held together by hydrogen bonds between molecular structures known as bases. The bases on each strand are held together by a sugar-phosphate backbone. The length of DNA is measured in base pairs (bp), with the entire human genome being about 3,000,000,000 bp in length.
The DNA contains all of the information required to synthesize each of the proteins present in every cell and the manner in which each is expressed. For example, although there is a subset of proteins that every cell requires to survive, there are others that are specific to liver cells, kidney cells, or nerve cells. The information (or code) for each must, however, be in every cell because all plants and animals develop from a single fertilized egg. The portions of the genome that code for proteins are referred to as genes. The DNA around each gene contains information controlling gene expression—so that liver proteins are made in liver, kidney proteins in the kidney, and so forth. Alterations in the DNA sequence encoding proteins may result in corresponding changes to the protein sequence as well. These changes (mutations) are the direct cause of inherited diseases such as cystic fibrosis, muscular dystrophy, and sickle cell anemia. Other diseases such as heart disease and hypertension are believed to result from different mutations in different genes. Some sequence differences do not cause diseases at all, but are necessary for there to be variation in a population— otherwise, everyone would look the same. Furthermore, if there are no changes in DNA sequence, animals and plants would be unable to adapt to changing environments, and species would not evolve.
The possibility of identifying each and every gene was first discussed in the mid-1980s after a few viral genomes had been sequenced and the construction of the first human gene maps had been accomplished. The U.S. National Research Council (NRC) endorsed the notion in a report in 1988. The Human Genome Project (HGP) officially started on October 1, 1990 under the oversight of the U.S. National Institutes of Health (NIH) and Department of Energy (DoE) in conjunction with United Kingdom’s Wellcome Trust and Medical Research Council, and subsequently involved scientists from France, Germany, China, and Japan. The 15-year, $3 billion project envisioned the determination of the complete human genome sequence after the generation of detailed genetic and physical maps and the development of high-throughput, efficient sequencing technologies while determining the genome sequences of model organisms such as Escherichia coli, yeast, a worm, fruit fly, and mouse.
Eventually, two parallel approaches were taken to sequence the genome. In the first, the HGP cloned human genomic DNA into fragments of about 100 to 200 kilobase pairs (kb) in bacterial artificial chromosomes (BACs), organized these clones into overlapping series, sequenced each BAC, and then assembled the sequence using the order of BACs to make sure the sequence was in the proper order. In the alternative
approach, employed by Celera Genomics, the genome was broken into fragments of 2, 10, and 50 kb, sequenced from either end, and reassembled. Every base was sequenced 5 to 10 times from different clones, and the ends of each clone were a known distance apart (mate pairs). This latter approach (“shotgun sequencing”) only became feasible with the development of high-throughput automated sequencers (by Applied Biosystems) and the design of the necessary assembly algorithms. Both approaches had to handle the substantial repeat content of the human genome that affects both the mapping of BACs and the assembly of sequence. Independent drafts of the genome sequence were completed in 2000, and an essentially complete sequence was finished in 2003. The efficiency of the shotgun approach allowed the rapid determination of the genome sequences of mouse, rat, mosquito, chicken, and many other species from 2001 to 2004. The finished human sequence has been shown to be 99.99% accurate (i.e., no more than one mistake in 10,000 bp).
An important result of the HGP has been the delineation of the genes encoding all of the proteins that make up the human body and those that regulate its development. These proteins are encoded by about 20,000 genes. Because humans usually regard themselves as more complex beings than flies or worms, it was surprising to find that, in comparison, the fruit fly genome has about 13,000 genes and the nematode genome about 19,000. Comparison of genes between species indicates that mammals have more regulatory genes than invertebrates, perhaps allowing for more complex developmental processes. Another surprising observation is that the protein coding sequences account for only 1.2% of the entire genome and that nearly half of the genome is composed of repeated sequence elements. However, by comparison with the mouse genome, about 5% of the human genome can be seen to be under active selection. It is believed that most of this conserved sequence is made up of elements that regulate the expression of nearby genes. Now that the genome sequences of different mammals (mouse, rat, dog, chimpanzee), as well chicken, frog, and fish, have been determined, it is possible to compare the human sequence to each of these species to identify parts of the sequence that have been conserved over hundreds of millions of years of evolution and that are presumably important in regulating the developmental processes that animals have in common. On the other hand, the differences between closely related species should help identify the cause of specific characteristics. For example: Why is a rat bigger than a mouse? Why are humans taller than chimpanzees?
Another important observation is that the genome sequences of any two unrelated people are 99.9% identical. The 0.1% difference is the variation that makes each of us different—taller, shorter, brown hair, blond, and so forth. There is actually more difference within a large geographically defined population than between two populations (e.g., between the population of northern Europe and China). Thus, the grouping of individuals by skin color has no more biological relevance than grouping by height or shoe size. Scientists are now using this variation between individuals to identify the genes that predispose people to heart disease, diabetes, hypertension, asthma, and other disorders.
Rapid technological advances are expediting these studies. Over the course of the HGP, the cost of DNA sequencing was reduced by more than 100-fold from about $10 per base to less than $0.10 per base. Scientists are now working to lower the cost to $1,000 per genome, so that personalized medicine becomes feasible for everyone. It should be possible to accurately predict the chances of an individual developing cancer or heart disease by examining their DNA sequence, rather than their family history and diet. It will also be possible to identify the best drug to treat individual patients, knowing how different drugs will act in people with variation at specific genes.
References:
- The human genome. (n.d.). Retrieved from http://www.nature.com/genomics/human/papers/articles.html
- The human genome. (2001, February 16). Retrieved from http://www.sciencemag.org/content/vol291/issue5507/index.html
- International Human Genome Sequencing Consortium. (2004, October 21). Finishing the euchromatic sequence of the human genome. Nature, 431, 931–945. Retrieved from http://www.nature.com/cgi-taf/DynaPtaf?file=/nature/journal/v431/n7011/full/nature03001_fs.html
- Shreeve, (2004). The genome wars. New York: Knopf.