At present, more than 26,000 functional genes have been found and located, 42% of which are unknown. Among the known genes, enzymes account for 10.28%, nucleases for 7.5%, signal transduction for 12.2%, transcription factors for 6.0%, signal molecules for10.2%, and receptor molecules for 5.3%. It is of great significance to discover and understand the functions of these functional genes for gene function and new drug screening.
As for the number of protein, it is definitely different, because one gene corresponds to one mRNA at most. As for mRNA, many of them will be formed by cleavage and connection before translation, and then translated into peptide chains. After folding, modifying and combining different subunits, there will be more kinds. Some protein have the same peptide chain, and different active metals will become different protein. So the number of protein is far greater than the number of genes.
Brief introduction of human genome project
The human genome project (HGP) was first proposed by American scientists in 1985, and was officially launched in 1990. Scientists from the United States, Britain, France, Germany, Japan and China participated in the $3 billion human genome project. According to this plan, in 2005, all the codes of about 654.38+ten thousand genes in human body will be unlocked, and the map of human genes will be drawn. In other words, it is necessary to uncover the secret of the 3 billion base pairs that make up the 654.38+10,000 genes of the human body. The Human Genome Project, the Manhattan Atomic Bomb Project and the Apollo Project are also called the three major scientific projects.
1986, renato dulbecco, the Nobel Prize winner, published the essay "The Turning Point of Cancer Research: Sequencing the Human Genome" (Science, 231:1055-1056). It is pointed out that if we want to know more about tumors, we must pay attention to the genome of cells from now on. ..... Which species should we start with? If you want to understand human tumors, you must start with humans. ..... A detailed understanding of DNA will greatly promote the study of human tumors. "
What is the genome? Genome is the whole composition of all genes in a species. The human genome has two meanings: genetic information and genetic material. To reveal the mystery of life, it is necessary to study the existence, structure and function of genes and the relationship between genes from the whole level.
The purpose of the human genome project
Why choose the human genome for research? Because human beings are the most advanced creatures in the process of "evolution", the study of them is helpful to know themselves, master the law of birth, aging, illness and death, diagnose and treat diseases, and understand the origin of life.
Measure the sequence of 3 billion base pairs of human genome DNA, find all human genes, find out their positions on chromosomes and decipher all human genetic information.
In the human genome project, it also includes the research on the genomes of Escherichia coli, yeast, nematodes, fruit flies and mice, which are called the five "model organisms" of human beings.
The purpose of HGP is to decode life, understand the origin of life, understand the law of life growth and development, understand the reasons of species and individual differences, understand the mechanism of disease and life phenomena such as longevity and aging, and provide scientific basis for the diagnosis and treatment of diseases.
[Edit this paragraph] The birth and beginning of ]HGP
The research on the human genome took shape in the 1970s, and reached a certain scale in many countries in the 1980s.
1984 entrusted by the U.S. department of energy (DOE), White R and Mendelssohn M in Alta, Utah held a small professional meeting to discuss the significance and prospect of determining the DNA sequence of the whole human genome (Cook Deegan RM, 1989).
1985 In May, in Santa Cruz, California, a motion was put forward to determine the complete sequence of the human genome, which formed the draft of the "Human Genome Project" of the US Department of Energy.
1986 In March, the feasibility of this plan was discussed in Santa Fe, New Mexico, and then DOE announced the implementation of this plan.
1986, geneticist Mike Kusick v proposed that the science of studying heredity from the whole genome level is called "genomics".
At the beginning of 1987, the US Department of Energy and the US National Institutes of Health allocated about US$ 5.5 million for HGP (US$ 65,438+66 million for the whole year).
From 65438 to 0988, the National Human Genome Research Center was established, with Watson J as the first director.
1990 10 10/day, approved by the us congress, HGP was officially listed in the us. The overall plan is to invest at least $3 billion in the whole human genome analysis in 15.
1987, Italian National Research Council * * * began to study HGP, which is characterized by various technologies (YAC, hybrid cells, cDNA, etc. ) and regional concentration (basically limited to the Xq24-qter region).
HGP started in Britain in February, 1989. Its characteristics are as follows: Imperial Cancer Research Foundation and National Medical Research Council (ICRP-MRC) are responsible for national coordination and fund supervision, and Sanger Center near Cambridge focuses on accumulating experience in nematode genome and improving large-scale DNA sequencing technology; At the same time, the "British Human Genome Resource Center" was established to screen and clone YAC libraries, specific cell lines, DNA probes, genomic DNA, c DNA libraries, comparative biological genomic DNA sequences and information analysis. It can be described as "resource concentration and national overall planning".
June 1990 French * * * and China HGP started. The Ministry of Scientific Research entrusts the National Academy of Medical Sciences to formulate HGP, which is characterized by paying attention to the whole genome, cDNA and automation. The establishment of Human Polymorphism Research Center (CEPH) has greatly influenced the construction of YAC contigs, microsatellite markers (genetic maps) and CEPH families (80 individuals with three generations) of the whole genome, and it is a world-famous classic material for genome research.
1995, the Federal Republic of Germany * * * and the United States began to produce HGP rapidly, and successively set up resource centers and gene scanning and positioning centers, and started the large-scale sequencing of chromosome 2 1.
1June, 1990, the European Human Genome Research Program was adopted, and 23 laboratories were mainly funded for the establishment and operation of the resource center. There are also the Kingdom of Denmark, the Russian Federation, Japan, the Republic of Korea, Australia and so on.
1994, China was initiated by, Qiang Boqin, and Yang. Initially, with the support of the National Natural Science Foundation and the 863 High-tech Program, we successively carried out "Study on the Gene Structure of Several Sites in the China Human Genome" and "Study on the Location, Cloning, Structure and Function of Genes Related to Major Diseases". Southern Gene Center was established in Shanghai 1998, Northern Human Genome Center was established in Beijing 1999, and Institute of Genetics of China Academy of Sciences was established in 1998. /kloc-0 was registered in the international human genome in July, 1999, and a 30Mb region on the short arm of human chromosome 3 was sequenced, accounting for about 1% of the whole human genome.
The Human Genome Project was initiated by the United States in 1987, and China actively participated in this research project in September 1999, undertaking the task of 1%, that is, sequencing about 30 million base pairs on human chromosome 3. Therefore, China has become the only developing country participating in this research project. On June 26th, 2000, the working draft of human genome was completed. Because human gene sequencing and gene patents may bring great commercial value, governments and some enterprises are actively investing in this research. For example, AMGE Company transferred a gene related to central nervous system diseases in 1997, with a profit of 392 million dollars.
[Edit this paragraph] Research content of ]HGP
The main task of HGP is human DNA sequencing, including the four spectrograms shown in the figure below, as well as sequencing technology, human genome sequence variation, functional genome technology, comparative genomics, social, legal and ethical research, bioinformatics and computational biology, education and training.
1, genetic map
Also known as linkage map, it takes genetic markers with genetic polymorphism (one locus has more than one allele, and the frequency of occurrence in the population is higher than 1%) as "signposts", and the genetic distance (the percentage of exchange recombination between two loci in meiosis event, the recombination rate of 1% is called 1cM. The establishment of genetic map creates conditions for gene identification and gene location. Significance: More than 6,000 genetic markers have been able to divide the human genome into more than 6,000 regions, so that linkage analysis can find evidence that a pathogenic or phenotypic gene is close to a marker, so that the gene can be located in this known region, and then the gene can be isolated and studied. For diseases, finding and analyzing genes is a key.
1 generation marker: a classical genetic marker, such as ABO blood group marker and HLA marker. In the middle and late 1970s, restriction fragment length polymorphism (RFLP), the number of loci was 105, and the DNA strand was specifically cut by restriction endonucleases. Due to the variation of a "point" of DNA, fragments of different lengths (allele fragments) can be produced. Polymorphism can be displayed by gel electrophoresis, and pathogenic genes can be found by linkage analysis between fragment polymorphism information and disease phenotype. Like Huntington's disease. But every time you digest 2-3 fragments, the information is limited.
The second generation marker: 1985, microsatellite core and variable number tandem repeats (VNTR) can provide fragments with different lengths, and the repeat unit length is 6 to 12 nucleotides. Microsatellite marker system was discovered and established in 1989, and the repeat unit length is 2~6 nucleotides, also known as short tandem repeats (STR).
The third generation marker: Lander ES of 1996 MIT put forward the genetic marker system of SNP. The mutation rate of each nucleotide is 10-9, and the number of diallel markers in human genome can reach 3 million, with an average of about one per 1250 base pairs. There are 8~ 16 haplotypes composed of 3~4 adjacent markers.
2. Natural map
Physical map refers to the information about the arrangement and spacing of all genes that make up the genome, which is drawn by measuring the DNA molecules that make up the genome. The purpose of drawing physical maps is to arrange the genetic information about genes and their relative positions on each chromosome in a linear system. The physical map of DNA refers to the arrangement order of restriction fragments of DNA chain, that is, the position of restriction fragments on DNA chain. Because restriction endonucleases are based on specific sequences, DNA fragments with different nucleotide sequences will be produced after digestion, thus forming a unique digestion map. Therefore, the physical map of DNA is one of the characteristics of DNA molecular structure. DNA is a very large molecule, and the DNA fragment produced by restriction endonuclease for sequencing reaction is only a very small part of it. The position relationship of these fragments in DNA chain is the first problem to be solved, so the physical map of DNA is the basis of sequencing, and can also be understood as a blueprint to guide DNA sequencing. Broadly speaking, DNA sequencing begins with making physical maps, which is the first step of sequencing. There are many ways to make a physical map of DNA. Here, we choose a common and simple method-partial enzymolysis of labeled fragments to illustrate the drawing principle.
Determining the physical map of DNA by partial enzymatic hydrolysis includes two basic steps:
(1) complete degradation: select appropriate restriction endonuclease to completely degrade the DNA chain (radioisotope label) to be detected, and the degradation product is separated by gel electrophoresis and developed by itself, and the obtained map is the number and size of restriction fragments constituting the DNA chain.
(2) Partial degradation: a strand of DNA to be detected is labeled with a tracer isotope, and then the DNA strand is partially degraded by the same enzyme, that is, by controlling the reaction conditions, the gaps of the enzymes on the DNA strand are randomly broken to avoid complete degradation of all the gaps. Part of the enzymatic hydrolysis products were also separated by electrophoresis and self-developed. By comparing the autoradiographs of the above two steps, according to the fragment size and the difference between them, the position of the restriction fragment on the DNA chain can be discharged. The following is a detailed description of the DNA physical map of the histone gene.
A complete physical map should include the overlapping group map of DNA clone fragments of different vectors in the human genome, the cutting point map of large fragments of restriction endonucleases, the marker map of DNA fragments or specific DNA sequences (STS), the marker map of characteristic sequences widely existing in the genome (such as CpG sequences, Alu sequences, isovolumes), the cytogenetic map of the human genome (i.e. regions, bands and subbands of chromosomes, or marked by the percentage of chromosome length), and finally.
The basic principle is to "break" the huge DNA that cannot be started, and then splice it. Mb, kb and bp are used as the distance of the map, and STS (sequence tag site) sequence of DNA probe is used as the road sign. In 1998, the physical map of a continuous clone with 52,000 sequence tag sites (STS) was completed, covering most areas of the human genome. One of the main contents of constructing physical map is to connect DNA clone fragments containing STS corresponding sequences into overlapping "overlapping groups" The library containing human DNA fragments with "YAC" as the carrier has included the construction of a highly representative fragment overlap group with a total coverage of 100%. In recent years, more reliable BAC, PAC or cosmid libraries have been developed.
3. Sequence diagram
With the completion of gene map and physical map, sequencing has become the most important. DNA sequence analysis is a multi-stage process including DNA fragmentation, base analysis and DNA information translation. The sequence map of the genome was obtained by sequencing.
Basic strategy of large-scale sequencing
Clone-by-clone method: subclone sequencing and assembling predetermined BAC clones in continuous clone lines (common domain sequencing plan).
Whole genome shotgun method: On the basis of certain mapping information, the genome is directly decomposed into small fragments for random sequencing, and the continuous cloning of large fragments is constructed and assembled by supercomputer (Celera Company, USA).
4. Gene map
Gene mapping is based on identifying the protein coding sequence contained in the genome, and combining information such as gene sequence, location and expression pattern. The most important way to identify the position, structure and function of all genes with 2%~5% length in human genome is to trace back to the position of chromosome through mRNA, the expression product of genes.
The principle is that all biological traits and diseases are determined by structural or functional protein, and all known protein are encoded by mRNA, so that mRNA can be synthesized into cDNA or partial cDNA fragments called EST by reverse transcriptase, or cDNA or cDNA fragments can be artificially synthesized according to the information of mRNA, and then stable cDNA or EST can be used as a "probe" for molecular hybridization to identify genes related to transcription. EST (expression sequence tag) is obtained by sequencing hundreds of bp on the tail side of mRNA with PolyA complementary oligo-T or related sequences of cloning vectors as primers. In June 2000, EMBL had 4,229,786 environmentally sound technologies.
The significance of gene mapping is that it can effectively reflect the Shi Kongtu expressed by the whole gene under normal or controlled conditions. Through this picture, we can know the expression of a gene in different tissues and levels at different times. We can also know the different expression levels of different genes in a tissue at different times, and we can also know the different expression levels of different genes in different tissues at a specific time.
The human genome is an international cooperation project: describing the characteristics of the human genome, sequencing and mapping the DNA of selected model organisms, developing new technologies for genome research, improving the ethical, legal and social issues involved in human genome research, cultivating scientists who can use these technologies and resources developed by HGP for biological research, and promoting human health.