Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (19)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Coleman, S. L.
Right arrow Articles by O'Donovan, M. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Coleman, S. L.
Right arrow Articles by O'Donovan, M. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Human Molecular Genetics, 2002, Vol. 11, No. 16 1817-1821
© 2002 Oxford University Press

Experimental analysis of the annotation of promoters in the public database

Sharon L. Coleman, Paul R. Buckland, Bastiaan Hoogendoorn, Carol Guy, Kaye Smith and Michael C. O'Donovan*

Department of Psychological Medicine, University of Wales College of Medicine, Heath Park, Cardiff, CF14 4XN, UK

Received March 3, 2002; Accepted May 31, 2002


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 REFERENCES
 
The ability to identify and examine promoter elements is important to researchers who wish to understand how gene expression is regulated in normal and pathological states. Unfortunately, the number of human promoters that have been directly experimentally defined is small. In order to determine if promoter sequences can be identified by simply aligning mRNA and genomic sequences, we have used a reporter gene assay to assess the promoter activity of the immediate 5' region flanking 38 mRNAs mapping to chromosome 21. For comparison, we have measured the activities of 19 sequences not thought to be promoters and 39 sequences taken from the Eukaryotic Promoter Database. Our results suggest that alignment of reference mRNAs to genomic sequence allows promoters to be identified for at least 75% of genes. These data provide the first empirical evidence that the current state of annotation of the genome is sufficient to allow molecular geneticists to correctly identify promoter sequences for most genes for which reference mRNA and genomic sequences are available.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 REFERENCES
 
Molecular genetic studies of inherited traits have tended to focus on DNA sequences encoding protein. This reflects the widely held assumption, based upon the observed data from simple genetic disorders, that most pathogenic mutations exert their functional effects by altering the structure of the protein encoded by the mutant gene. Recent analyses of the almost complete genome sequences generated by public and private initiatives have revealed that humans possess fewer protein-coding genes than expected (1,2). This observation has re-emphasized the potential importance of gene regulation in complex disease susceptibility and other inherited phenotypes (3,4). Unfortunately, our knowledge of regulatory elements in the human genome is far from comprehensive, and, beyond their coding sequences, most genes are not well annotated. These factors are formidable obstacles to researchers who wish to study how gene expression is regulated, or how altered gene regulation can lead to pathology.

Although there are many cis-acting elements involved in regulating gene expression, promoter elements are pivotal, as they are responsible for initiating gene transcription. Unfortunately, only a small proportion of genes have had their promoters directly experimentally determined. For example, in a recent study, of the 473 genes thought to map to chromosome 22, only 20 had experimentally defined promoters (5). In the absence of experimental data, the simplest approach to identifying putative promoters is to assume that the sequence beyond the 5' end of an mRNA corresponds to the promoter of that transcript. However, the success of this approach depends upon the proportion of mRNA sequences that are full length and therefore extend to the true transcription start site. As this proportion is currently unknown, and there is no large body of experimentally derived empirical data on the accuracy of promoter annotation in the human genome, it is possible that promoter identification by mRNA alignment to genomic sequence might not successfully identify a promoter in most cases. The alignment approach is likely to be particularly problematic in the 40% of genes that have been predicted by exon-finding software to contain an entirely non-coding first exon (6).

In order to evaluate whether promoter sequences can be identified using the human sequence that is freely available in the public domain, we have undertaken a direct experimental survey of putative promoters. We have cloned the putative 5' flanking region and adjacent 5'-untranslated region (5'-UTR) of 38 genes selected at random from the list of known genes mapped to chromosome 21 (7), and screened all the clones experimentally for promoter activity using a luciferase reporter gene assay. For comparison, we also cloned and measured the promoter activity of 39 proven promoters taken randomly from the Eukaryotic Promoter Database (EPD) (8) and 19 ‘non-promoter’ sequences randomly taken from the many candidate genes for neuropsychiatric disorders that our group is studying. Our results suggest that at least 75% of chromosome 21 genes with reference sequences deposited in GenBank extend into exon 1. These data provide the first direct empirical evidence that the current state of annotation of the genome is sufficient in most instances to allow molecular geneticists to correctly identify promoter sequences of known genes.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 REFERENCES
 
All 39 known promoters from the EPD database were tested for their ability to drive transcription of the luciferase reporter gene in three cell lines (HEK293t, JEG-3 and TE671). We initially also tested 20 of the 28 putative chromosome 21 promoter sequences in the same three cell lines. From these experiments, we were able to determine that one of the cell lines (TE671) yielded no additional information that could not be extracted from the other two cell lines. A further 18 of the original 76 putative promoters were only tested in the HEK and JEG cell lines. The reporter gene expression data for all known and putative promoters are given in Tables 1 and 2, respectively. Data are expressed as the maximum activity in either HEK or JEG (whichever is the highest) lines relative to a PGL3-basic negative control promoterless T vector. The data are also displayed in Figure 1. The chromosome 21 fragments displayed higher reporter gene activity than the EPD fragments, but this difference was not statistically different (Mann Whitney, P=0.66).


View this table:
[in this window]
[in a new window]
 
Table 1. Reporter gene activity of EPD promoters
 

View this table:
[in this window]
[in a new window]
 
Table 2. Reporter gene activity of chromosome 21 fragments
 


View larger version (14K):
[in this window]
[in a new window]
 
Figure 1. A comparison of the activities of fragments from EPD, chromosome 21, and control ‘non-promoter’ fragments. The proportions of fragments are presented as percentages of the total (EPD n=39, Ch21 n=38, control=19)showing activity greater than the value plotted. Activity is measured as the magnitude of normalized luciferase activity relative to pGL3-basic promoterless negative control vector. The graph represents maximum activity in either HEK293t or JEG-3.

 
There is no consensus on the definition of promoter activity in reporter gene systems. In order to determine a more conservative threshold for promoter activity than simply showing activity greater than a promoterless negative control, we tested in our assay 19 different PCR fragments representing DNA sequences that we would not expect to be promoters based upon their positions relative to the known genes mapping to the loci from which the sequences were chosen. Both EPD and the putative chromosome 21 fragments displayed significantly higher activities than the control fragments (both at P>0.0000, Fig. 1).

The mean activity of the control fragments was 3.2 times greater than that of the pGL3 basic vector (SEM=0.6, upper limit of 99% CI=4.8, range 0.2–9.9). Accordingly, we set our definition of promoter activity at 10 times that of basic. Seventy five per cent (n=29) of the chromosome 21 fragments demonstrated promoter activity at or above this threshold compared with 70% (n=27) of known EPD promoters (Table 1). Just as for the analysis of the distributions of promoter activities between the chromosome 21 and EPD groups, this difference was not statistically significant ({chi}2=0.48, 1 df, P=0.48).


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 REFERENCES
 
There is increasing interest in studying gene regulation in normal and pathological states. However, such investigations are hampered by the fact that the human genome is poorly annotated with regard to regulatory elements. In principle, given that their position is relatively fixed with regard to the genomic organization of a gene, promoters can be recognized by aligning mRNA and genomic sequence, with the proviso that a high proportion of mRNA species extend into their first exon. In order to determine what proportion of genes can be annotated by this simple method based upon sequence databases that are freely available in the public domain, we have undertaken a direct experimental survey of putative promoters.

We selected chromosome 21 as the source for our test promoters, because it was one of the first chromosome to be completely sequenced (7), thus guaranteeing the availability of genomic sequence flanking the selected mRNAs, and because it is of wide interest to our group, since it contains genes involved in Down's syndrome (8), Alzheimer's disease (9,10) and possibly other psychiatric disorders (11,12). Because there are no comparable data for the reporter gene activity of an extensive series of true promoters to provide a positive comparator group, we also measured the activity of a random selection of 39 known promoters selected from EPD (13). The distributions of activities of the known promoters and those mapping to chromosome 21 were not significantly different. Moreover, using an empirically derived categorical definition based upon the reporter gene activity of 19 ‘non-promoter’ sequences, there was no significant difference between the proportions of test (75%) and known EPD promoters (70%) showing promoter activity.

Our data therefore suggest that, at least as far as chromosome 21 is concerned, at least 75% of reference mRNA sequences lodged in GenBank extend into exon 1 and are sufficiently full length to allow promoter identification by sequence alignment. This estimate is conservative because, empirically, we have shown that 30% of proven promoters from EPD do not meet our criterion for promoter activity. Presumably this is because the promoters are extremely weak in cell culture, or because tissue-specific and development-specific transcription factors are missing, as are regulatory sequences outside the proximal promoter. Regardless of the explanation, our findings with the EPD promoters suggest that at least some of the chromosome 21 fragments that did not satisfy our criterion for promoter activity are also true promoters. Although we cannot specify the proportion of false negatives, it is likely to be similar to the rate in found in the EPD fragments, given that the overall distributions of the activities of the chromosome 21 and EPD fragments are not significantly different.

Our finding that it is possible to correctly assign a promoter for a high proportion of mRNA species by alignment is supported indirectly by a recent computational genomics study (6), which suggested that approximately 65% of first exons on chromosomes 21 and 22 are partially coding, and that the average size of predicted partially coding first exons is 348 bp. It follows then that where reference mRNAs extend at least as far as the ATG translation start, there is a strong chance that a promoter sequence is present within the next 500 bases of 5' sequence. Thus, our direct experimental data are congruent with recent data based upon in silico prediction.

It is possible that chromosome 21 may not be representative of the genome as a whole. This is certainly true regarding the quality of finished genomic sequence, but this will only influence the proportion of mRNAs for which a genomic clone can be found, not the proportion of putative promoters displaying promoter activity. However, our data only apply to genes with known mRNAs. How our data apply to the approximately equal number of annotated genes that are based upon various methods for predicting genes (7) is unknown.

We view our findings as important and encouraging for researchers who wish to take advantage of the public genome-mapping initiatives in their search for promoter elements, but there are, however, residual problems. While we have shown that promoter sequence can be obtained for most known genes by selecting approximately 500 bases 5' to the most complete reported mRNA, this approach does not allow us to distinguish how much UTR (as opposed to promoter sequence) is also included in this sequence. A conservative approach would be to assume that up to 400 bases of sequence identified in this way is actually UTR. The second problem is that selecting the longest mRNA for alignment with genomic sequence does not allow identification of multiple promoters. We are currently looking to see if the use of shorter mRNAs deduced from the presence of multiple copies of an apparently 5'-truncated EST allows this. It may also be possible to use the algorithms that have been developed to reliably identify these using computational genomics. At present, they do not yet allow comprehensive identification of promoters (5), and tend to be relatively weaker at predicting non-CpG-associated promoters, but both the sensitivity and specificity of the programmes available are improving (6).


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 REFERENCES
 
Using the databases available at NCBI, we randomly selected 5' sequence for 76 known genes that map to chromosome 21. Fragments (500–700 bp) of the 5' sequence flanking each gene (putative promoter) including UTR sequence were amplified from a single anonymous sample of DNA. The PCR products for each unique putative promoter were then pooled, and shotgun cloned into a pGL3 basic T/A cloning vector that we created from the pGL3 basic vector (Promega UK Ltd., Southampton, United Kingdom). From this procedure, we identified clones representing 20 unique putative promoters. A further 18 putative chromosome 21 promoters and 39 known promoters randomly selected from EPD (13) were PCR amplified and individually cloned into the modified pGL3-basic T/A cloning vector. All cloned sequences are listed in our website (www.uwcm.ac.uk/study/medicine/psychological_medicine/pub_data/coleman_etal.htm). PCR primers were designed using primer 3 (www.genome.wi.mit.edu/genome software/other/primer3.html). The downstream primer for each target promoter was designed to include sequence corresponding to the transcription start site in each amplicon, but not to include coding sequence, in order to avoid changing the open reading frame of the reporter gene or making a target–reporter fusion protein. Where primer design permitted, we included no more than 50 bp of known 5'-UTR and avoided inclusion of any untranslated ATG sequences. We endeavoured to include at least 35 bp of 5'-UTR, as this may contain downstream promoter elements for promoter regions without obvious TATA boxes (14). To provide a second negative control in addition to the pGL3 basic vector, we selected 21 fragments (mean size 417 bases) representing the exonic, intronic and peri-genic sequences of candidate genes of interest to our extended research group. None of these fragments were believed to contain promoter elements. Of the 21 fragments, 19 were successfully cloned. DNA to be cloned was amplified by PCR using Expand High Fidelity DNA polymerase (Roche Diagnostics Ltd., Lewes, United Kingdom) to minimize misincorporation of nucleotides, so that fragments would faithfully represent native genomic sequence in subsequent cloning stages.

Pooled or individual ligation products (PCR plus pGL3 T/A vector) were cloned into SURE 2 supercompetent cells (Stratagene, La Jolla, CA, USA) according to the manufacturer's instructions. Individual colonies were picked and PCR amplified to check the presence and orientation of insert. Preparations of selected plasmids were carried out using Qiagen chemistry and proprietary procedures (Qiagen Ltd., Crawley, United Kingdom). The identity, orientation and fidelity of cloning of the inserts, were established by sequencing using Big Dye Terminator chemistry (Applied Biosystems, Warrington, United Kingdom). Unique clones faithfully representing the genomic sequence were tested in a dual reporter assay for ability to drive transcription of the luciferase reporter gene in three cell lines (HEK293t, JEG-3 and TE671). From this, we were able to determine that one of the cell lines (TE671) yielded no additional information that could not be extracted from the other two cell lines.

Reporter gene assays
The ability of each sequence to promote transcription of the luciferase gene was tested transiently in human cell lines HEK293t (human embryo kidney, a gift from GlaxoSmithKline, Glaxo Wellcome UK Ltd., Uxbridge, United Kingdom), TE671 (human medulloblastoma) and JEG-3 (human choriocarcinoma placenta). The latter two lines were obtained from the European Collection of Cell Cultures (ECACC). The TE671 cell line was selected, as it was listed in the ECACC catalogue as a human medulloblastoma line. TE671 is now known to be identical to the human rhabdomyosarcoma RD cell line (ECACC No.: 85111502).

Cell lines were transfected with plasmids using lipofectamine following the manufacturer's (Gibco, Invitrogen Ltd., Paisley, United Kingdom) protocol with modifications described below. Cell lines were cultured according to ECACC specifications at 37°C with 5% CO2. Cells were seeded into black, clear-bottomed 96-well luminometric plates (Canberra Packard, Packard BioScience Ltd., Pangbourne, United Kingdom) at approximately 80% confluence the day prior to transfection. Plates seeded with HEK293t were coated with poly-D-Lysine (Sigma-Aldrich Company Ltd., Gillingham, United Kingdom) prior to seeding. Prior to transfection, all plasmids were quantitated fluorimetrically using Pico Green (Molecular Probes, Inc., Eugene, OR, USA) and a TD-700 (Turner Designs, Turner Biosystems, Inc., Sunnyvale, CA, USA) fluorimeter. We empirically determined the optimum concentration of DNA and the ratio of DNA/transfection reagent for each cell line. HEK293t were transfected with 100 ng/well DNA, 1 µL/well lipofectamine (Gibco), TE671 were transfected with 50 ng/well DNA, 0.25 µl/well lipofectamine, 0.75 µL/well PLUS-reagent (Gibco), and JEG3 were transfected with 100 ng/well DNA, 0.5 µl/well lipofectamine, 0.5 µl/well PLUS-reagent. To control for transfection efficiency, HEK293t, TE671 and JEG3 were co-transfected with CMV-SPAP (a gift from GlaxoSmithKline) at 0.1 ng/well. Cell lines were transfected overnight in serum-free optimem (Gibco), which was replaced with complete medium (PAA Laboratories Ltd., Yeovil, Somerset, United Kingdom) containing heat-inactivated fetal calf serum and incubated for a further 24 h. SPAP activity was measured in the culture medium after transferring medium to a second 96-well black plate using a phospha-light kit (Tropix, Applied Biosystems, Warrington, United Kingdom) according to the manufacturer's instructions. Luciferase activity in the remaining cells was measured in the original plate using a Luc Screen assaying kit (Tropix) selected for an extended luciferase half-life of 4–5 h. Both plates were read on a TR717 scintillation counting luminometer for 1–10 s per well. Promoter activity was then normalized by dividing luciferase activity by SPAP activity.


    ACKNOWLEDGEMENTS
 
This work was funded by an MRC (UK) Grant.


    FOOTNOTES
 
* To whom correspondence should be addressed. Tel: +44 2920743242; Fax: +44 2920747839; Email: odonovanmc{at}cardiff.ac.uk Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 REFERENCES
 
1 International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921.[Medline]

2 Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351.[Abstract/Free Full Text]

3 Lander, E.S. (1996) The new genomics: Global views of biology. Science, 274, 536–539.[Free Full Text]

4 Peltonen, L. and McKusick, V.A. (2001) Genomics and medicine. Dissecting human disease in the postgenomic era. Science, 291, 1224–1229.[Free Full Text]

5 Scherf, M., Klingenhoff, A., Frech, K., Quandt, K., Schneider, R., Grote, K., Frisch, M., Gailus-Durner, V., Seidel, A., Brack-Werner, R. et al. (2001) First pass annotation of promoters on human chromosome 22. Genome Res., 11, 333–340.[Abstract/Free Full Text]

6 Davuluri, R.V., Grosse, I. and Zhang, M.Q. (2001) Computational identification of promoters and first exons in the human genome. Nat. Genet., 29, 412–417.[ISI][Medline]

7 Hattori, M., Fujiyama, A., Taylor, T.D., Watanabe, H., Yada, T., Park, H.S., Toyoda, A., Ishii, K., Totoki, Y., Choi, D.K. et al. (2000) The DNA sequence of human chromosome 21. Nature, 405, 311–319.[Medline]

8 Petersen, M.B. and Mikkelsen, M. (2000) Nondisjunction in trisomy 21: origin and mechanisms. Cytogenet. Cell. Genet., 91, 199–203.[ISI][Medline]

9 Saunders, A.M. (2001) Gene identification in Alzheimer's disease. Pharmacogenomics, 2, 239–249.[Medline]

10 Olson, J.M., Goddard, K.A. and Dudek, D.M. (2001) The amyloid precursor protein locus and very-late-onset Alzheimer disease. Am. J. Hum. Genet., 69, 895–899.[ISI][Medline]

11 Curtis, D. (1999) Chromosome 21 workshop. Am. J. Med. Genet., 88, 272–275.[Medline]

12 Straub, R.E., Lehner, T., Luo, Y., Loth, J.E., Shao, W., Sharpe, L., Alexander, J.R., Das, K., Simon, R., Fieve, R.R. et al. (1994) A possible vulnerability locus for bipolar affective disorder on chromosome 21q22.3. Nat. Genet., 8, 291–296.[ISI][Medline]

13 Praz, V., Périer, R.C., Bonnard, C. and Bucher, P. (2002) The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data. Nucleic Acids Res., 30, 322–324.[Abstract/Free Full Text]

14 Burke, T.W. and Kadonaga, J.T. (1997) The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila. Genes Dev., 11, 3020–3031.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Toxicol SciHome page
J. Ping, H. Wang, M. Huang, and Z.-s. Liu
Genetic Analysis of Glutathione S-transferase A1 Polymorphism in the Chinese Population and the Influence of Genotype on Enzymatic Properties
Toxicol. Sci., February 1, 2006; 89(2): 438 - 443.
[Abstract] [Full Text] [PDF]


Home page
Am. J. PsychiatryHome page
P. R. Buckland, B. Hoogendoorn, C. A. Guy, S. K. Smith, S. L. Coleman, and M. C. O'Donovan
Low Gene Expression Conferred by Association of an Allele of the 5-HT2C Receptor Gene With Antipsychotic-Induced Weight Gain
Am J Psychiatry, March 1, 2005; 162(3): 613 - 615.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
P. R. Buckland
Allele-specific gene expression differences in humans
Hum. Mol. Genet., October 1, 2004; 13(suppl_2): R255 - R260.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Chong, G. Zhang, and V. B. Bajic
FIE2: a program for the extraction of genomic DNA sequences around the start and translation initiation site of human genes
Nucleic Acids Res., July 1, 2003; 31(13): 3546 - 3553.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
J. Bruch, W. A. Schulz, J. Häussler, I. Melzner, S. Brüderlein, P. Möller, R. Kemmerling, W. Vogel, and H. Hameister
Delineation of the 6p22 Amplification Unit in Urinary Bladder Carcinoma Cell Lines
Cancer Res., August 1, 2000; 60(16): 4526 - 4530.
[Abstract] [Full Text]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (19)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Coleman, S. L.
Right arrow Articles by O'Donovan, M. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Coleman, S. L.
Right arrow Articles by O'Donovan, M. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?