M W F; 10:10 - 11:00 pm

 Molecular Biology

 Douglas W. Smith

York 2722

 BIMM 100

 5254 Muir Biology Building

Fall, 2000  

 x42620; dsmith@ucsd.edu

 

 

| BIMM100 | Syllabus | Sections / Off Hrs | Grading Policy | DNASYSTEM |
| Lectures | Journal Articles | Study Qs | Lab Techniques | Exams |

 


 

 

6. Functional Genomics

 

Outline:

A. Functional Genomics
B. Finding Genes on Genomes
1. Prediction of Location: ORFs, Signals, Homologs
a. Open Reading Frames - ORFs
b. Signals - Start and Stop Codons, etc
c. Homologous Proteins
2. Location by Experiment: Northerns, cDNA Analyses
C. Determination of Gene Function
1. Prediction of Function: Homologs, Comparative Genomics
2. Function by Experiment: Gene Inactivation, Overexpression, cDNA Arrays

 

 

A. Functional Genomics

Genomics is the study of the structure and function of the Genome of an Organism.

Functional Genomics specifically focuses on the function of the Genome, and includes all information associated with Genome Functional and Structural Elements (genes, protein binding sites, complexity of the DNA, chromatin structure, ...) and with Gene Expression.

Gene Expression then includes all molecular events at the RNA level (what some people have called the Transcriptome and the processes of Transcriptomics) and at the Protein level (what more people have called the Proteome and the processes of Proteomics)

 

Functional Genomics then is the next step after determination of the Genome Sequence: what does the sequence mean (where are genes and elements), and how do the Genome genes and elements, and their products, function to give rise to particular cell types, tissues, and organisms?

Functional Genomics then first focuses on:
1. where are the genes and elements on the Genome?
2. what are the functions of these genes and elements?

 

 

B. Finding Genes on Genomes

Computational approaches, part of Bioinformatics, are very useful for Gene Prediction on new Genome Sequences.
This prediction of Gene Location on new Genome Sequences is part of the process of Annotation of the DNA Sequence.

Journal Article 2 is an excellent example of Gene Annotation done EXCLUSIVELY using computational approaches.

Annotation: addition of biological and other information to the experimental data.
Such information can be additional experimental data or it can be computer generated information.

However, computational prediction of genes and gene structures (exons and introns) is difficult, often statistical, and sometimes impossible.
For example, exons can be very small, encoding only a few amino acids. Computational approaches to prediction of Exon-Intron boundaries, however, are very unreliable for exon sizes less than about 30 bp.

Experimental approaches remain required to determine precisely for a specific gene the gene structure (exons, introns, protein binding sites, etc).

 

1. Prediction of Location: ORFs, Signals, Homologs

a. ORFs - Open Reading Frames ... [Brown, Fig 5.2]

Genes encoding DNA must not contain Stop Codons used in Protein Synthesis.
That is, genes and exons must be Open Reading Frames for protein synthesis from the encoded mRNA.

Note that any double-stranded DNA sequence has Six Reading Frames ... [Brown, Fig 5.1]
This is because the Genetic Code is a triplet code: 3 nucleotides encode a given amino acid
Thus, one of the DNA strands will have three Reading Frames and the other strand will have three additional Reading Frames.

Note that this also means that mRNA synthesis, and the concomitant Protein synthesis, can potentially begin at any position on a DNA molecule ... this is in fact true.

ORFs are easy to find with computers ... There are however two major problems:

1) Small Proteins
Even in prokaryotes, with no exons, what "cutoff" should be used for a minimum sized protein?
In practice, a cutoff of 100 amino acids is often used ...
However, in so doing, some true small proteins containing fewer than 100 aa are not annotated ... AND some ORFs containing more than 100 putative amino acids are annotated even though they in fact do not encode a protein !!!

2) Small Exons... [Brown, Fig 5.3]
Exons smaller than about 30 nucleotides can not be reliably predicted by existing computer programs ... and yet such exons exist
Missing a small Exon can result in prediction of a protein sequence that has an internal "frame shift", ie the protein coding frame has shifted. Such a shift changes all the amino acids after the frame shift position, resulting in major errors in prediction of the protein sequence.

 

b. Signals

Gene prediction is aided greatly by consideration of Gene Signals and Gene Content

Gene Signals include transcription start and stop signals, translation start and stop codons, exon-intron boundary nucleotides, and others.
Such signals should be at the proper locations in a given prediction of a gene.
For example, one or more Stop Codons must be present at the end of the coding sequence of a gene

Gene Content is concerned with the nucleotides present within the coding region of a gene.
These nucleotides are not a random distribution, but rather reflect the following:
1) abundance of amino acids found in proteins
Some amino acids, e.g. gly, ala, ser, are more abundant than others, e.g. cys, trp
2) the codons used to encode these amino acids: Genetic Code Redundancy
Given amino acids are encoded by between one and six codons;
see the Genetic Code - [Brown, Fig 10.7]
3) the codons specifically preferred by a given organism: Codon Bias ... [Brown, Table 5.1]
Codon bias usually reflects the %(A+T) found in the DNA of a given orgnanism

Computer programs exist to predict genes in DNA using each of these considerations.

Additional computer programs exist that predict genes, and exon-intron structure, based on algorithms (methods or procedures encoded in a programming language) that involve training the program on a test set of DNA sequences.

Two important types of algorithms used in such training approaches, with examples of available programs, are:
1. Neural networks - Grail - notes on Grail
2. Hidden Markov Models - GenMark

 

c. Homologs

Homology searches are an alternative, powerful, and complementary approach to Gene Prediction in new DNA sequence.

In Homology Searches, one searches Nucleic Acid or Protein databases with a new DNA sequence or Protein sequence, looking for sequences either from the same organism or from a different organism which are very similar sequences.

The primary Assumption in this approach is that Genes and Proteins evolved in time along with the evolution of the organisms.
Thus, an important protein found in humans may likely have a cognate protein in other organisms such as mouse, fruit fly (Drosophila), round worm (C. elegans), yeast (S. cereviseae), and bacteria (E. coli).
Homology searches attempt to find such similar molecules.


Definitions:

1. Homology: Two proteins are homologs if they descended from a common ancestor.

2. Similarity: Similarity is a quantitative measure of how closely the sequences of the two molecules resemble each other.

Thus, homology is a qualitative measure; proteins are either homologs ... or they are not. One can correctly say 'These two proteins are 65% identical' or 'These two proteins are 75% similar to each other' but it is incorrect to say 'These two proteins are 50% homologous to each other'.

Determination of Homology is part of Phylogenetics or Molecular Evolution.
It is often very difficult to do accurately, and using is done statistically ... that is, one uses statistical measures of the degree of similarity found in the sequence comparison to attempt to make conclusions regarding the absolute yes-no answer of the homology question.

Note that if two Proteins can be shown via a high enough sequence similarity that they are Homologs of each other, then one can infer that the two Proteins are likely to have similar function to each other. This is a powerful approach to function determination of new Proteins.
This is discussed more fully below.

 

 

2. Location by Experiment: Northerns, cDNA Analyses

Experimental procedures for locating Genes in new DNA are basically of two types:
a. Identification via hybridization to mRNA or cDNA.
b. Identification of the 5'-end and Intron-Exon Junctions of the Gene.

a. Identification via hybridization to mRNA or cDNA.

1) Northern blots ... [Brown, Fig 5.4; see also Tech Notes 5.1]

As mentioned earlier, Northerns are the same as Southerns except that mRNA is run out on the Gel.

Thus, transcripts resulting from expression of a Gene can be detected and isolated to any given new DNA sequence by using a labeled Probe of the same sequence as this new DNA sequence.

This methodology can also be used to distinguish Exons from Introns by appropriate Probe construction, although a more complete experimental approach is to sequence the mRNA via the cognate cDNA and compare the sequence directly with the Genomic DNA sequence.

2) Zoo blots ... [Brown, Fig 5.5]

Zoo blots are simply Southern blots of a labeled Probe from the new DNA sequence against genomic DNA R.fragments from different Organisms (the Zoo).

The point is to determine if DNA sequences that are highly similar, and hence possibly Homologous, to the new DNA sequence are present in one or more of these other Organisms.
An observed hybridization signal argues strongly that:
1) the DNA probe comes from an intragenic region
2) both organisms encode Homologous proteins that probably execute similar functions.

Zoo blots thus provide both Gene Location information as well as predictive Gene Function information, as discussed below.

 

b. Identification of the 5'-end and Intron-Exon Junctions of the Gene

1) S1 Nuclease mapping or Primer Extension

One Example of S1 nuclease mapping is shown in Brown, Fig 5.7; the following is a more common way of doing the experiment

In S1 Nuclease mapping, a DNA probe labled at its 5'-end and which overlaps the Gene 5'-end or the 5'-end of an Exon is hybridized to the Gene DNA. S1 nuclease, which is specific for either ssDNA or RNA, is used to digest the single-stranded DNA. The resulting 5'-labeled DNA probe is then "sized" via a Southern blot. Its size pinpoints the 5'-end of the Gene or Exon.

In Primer Extension, a DNA probe labled at its 5'-end and which is contained within the Gene is hybridized to mRNA from the Gene. The Probe DNA is used as a Primer for a Reverse Transcriptase which will extend this Primer, using the mRNA as Template, synthesizing DNA to the end of the mRNA. The resulting 5'-labled DNA probe, identical to the one produced in the S1 Nuclease mapping approach, is then "sized" via a Southern blot, to pinpoint the 5-end of the Gene.

2) Exon Trapping ... [Brown, Fig 5.8]

This method is used most often to isolate Exons from new DNA, rather than simply to identify Exon-Intron boundaries ...

In Exon Trapping, an R.fragment from a new DNA sequence is cloned into a cognate R.site in an intron of a cloned Gene; Cloning Vehicles have been constructed that make this relatively easy to do. This chimeric DNA is introduced into an appropriate eukaryotic host, usually Yeast, and the cloned Gene is expressed. During processing of the initial transcript, introns are spliced out, leaving only the Exon from the cloned R.fragment behind. DNA from this mRNA is obtained via RT-PCR, and its sequence determined. Comparison of this sequence, containing only the Exon from the cloned R.fragment, with the sequence of the R.fragment itself shows where the Intron-Exon boundaries are located.

RT-PCR (see Brown, Tech Notes 5.1) is a variation on PCR in which the first polymerization step of the PCR reaction is executed by a Reverse Transcriptase, thereby converting the mRNA into a cDNA. This cDNA is then amplified further using a thermostable DNA polymerase such as Taq Polymerase.

 

C. Determination of Gene Function

1. Computational Prediction of Gene Function - Finding Homologues

As mentioned above, when a highly similar Protein Sequence is found via Protein sequence comparisons to sequences maintained in Protein Databases, one is reasonably certain that a Homologous Protein has been found.

Such sequence comparisons are best done at the Protein level rather than the DNA level

Proteins whose sequences are greater than about 30% identical are nearly always homologues.
An example of homologous sequences is shown in Brown, Fig 5.9, and an example of nonrelated protein and DNA sequences is shown in Brown, Fig 5.10.
Those with sequence identities between about 20% and 30% are in the Twilight Zone: they may or may not be homologues ... one must apply biological knowledge in attempts to decide

Further, if the function of the Protein whose sequence was found in a Protein Database is known, then it is highly likely that the function of your putative Protein will be a similar function.

This is the basis of Annotation of Function of Genes found in Organisms whose Genome Sequence has been determined, but where the severe growth requirements of the organism are such that few if any biochemical, genetic, or physiological experiments have been done with the organism.

An example of such an Organism is Methanococcus janaschii, the genome sequence determination of which is the topic of Journal Article 2. This Archii can not be grown in the laboratory.

Two Types of Homologues:

Orthologues: Homologues whose genes are the "original" genes in two different organisms originating from a common ancestral gene. These genes are most often determined by:
1) most closely similar in sequence comparisons
2) gene products (proteins) are most similar in function (catalyze the "wildtype" function)

Paralogues: Homologues whose genes are "second" or "additional" genes in a given organism, in addition to the Orthologue gene. Paralogues arise evolutionarily mainly via tandem duplication DNA recombination events. Since the Orthologue provides the needed protein function, Paralogue genes are free to mutate (not selected against mutations), yielding genes encoding Proteins of new function. As a result, Paralogue genes are often less similar in sequence comparisons to a Homologue in another organism than is the Orthogue gene.

COGs - Clusters of Orthologous Groups
Project at NCBI to define Orthologs and Paralogs in organisms whose Genomes have been determined.
COG0568 - COG for Sigma Factor 70/32 Example of COGs - many Paralogs present, demonstrates the difficulties in determination of Orthologs vs Paralogs.
E. coli rpoD - gene encoding Sigma-70, the main E. coli sigma factor - BLAST Protein sequence comparisons ...

 

2. Function determined by Experiment

a. Gene Inactivation - "Knockouts"

In a "knockout" mutant, a gene is completely inactivated, usually be replacing most of the gene DNA with other DNA.
This is usually done using Homologous Recombination (Brown, Fig 5.13, 5.14)

Experimentally it is easily to construct "Knockins", in which additional DNA has been added to the middle of a gene. This is often done using Transposon Mutagenesis.

In either case, a gross change in the nucleotide sequence of the gene has occurred, almost always resulting in complete inactivation of the gene.

In subsequent detailed studies of the domains of the inactivated Protein, point mutations or base substitution mutations are introduced into the gene at specific positions, a mutagenesis approach termed Site-specific Mutagenesis ... [Brown, Tech Notes 5.2]
This results in mutational changes in specific amino acids, permitting analysis of the roles of these specific amino acids.

 

b. Gene Overexpression ... [Brown, Fig 5.15]

In Gene Overexpression, the converse of Gene Inactivation, the normal expression level of a gene is dramatically elevated rather than dramatically decreased via inactivation.
This is usually done by changes in regulation of expression of the gene.

A comparison of effects on the cell or organism of a gene Knockout with those of Overexpression are often very useful for determination of function of the gene.

 

c. cDNA Arrays ... [Brown, Fig 5.21; Research Brief 5.2]

In whole Genome analyses, cDNA Microarrays are now being used to assess effects on expression of all genes due to Inactivation or Overexpression of one gene. This approach has the potential to provide complete information on the function of the inactivated gene.

One must however be able to differentiate the major effects from minor or indirect effects. Investigators are only beginning to learn how to do this.

 

 







| BIMM100 | Syllabus | Sections / Off Hrs | Grading Policy | DNASYSTEM |
| Lectures | Journal Articles | Study Qs | Lab Techniques | Exams |

 

If you have problems or comments, send email to Doug Smith