Human Molecular Genetics Advance Access originally published online on July 14, 2004
Human Molecular Genetics 2004 13(17):1969-1978; doi:10.1093/hmg/ddh207
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Human Molecular Genetics, Vol. 13, No. 17 © Oxford University Press 2004; all rights reserved
Gene-Ontology analysis reveals association of tissue-specific 5' CpG-island genes with development and embryogenesis
1Institute of Medical Genetics, Charité University Hospital, Humboldt University, Augustenburger Platz 1, 13353 Berlin, Germany, 2EBI-Hinxton, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 3Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany and 4Gene Mapping Center and Department of Molecular Genetics, Max Delbrück Center, Robert-Rössle-Str. 10, 13092 Berlin-Buch, Germany
Received March 31, 2004; Accepted June 29, 2004
| ABSTRACT |
|---|
|
|
|---|
A key open question in the understanding of the biology of DNA methylation relates to the origin and function of CpG islands, stretches of GC-rich and relatively CpG-rich DNA sequence that often colocalize with promoter regions. All housekeeping, but also a substantial minority of tissue-specific genes are associated with the CpG islands. Limited experimental evidence suggests that CpG islands are associated with promoters or replication origins active during early development. Although this hypothesis is attractive for widely expressed genes, which would be expected to be expressed during early development, many tissue-specific genes also contain CpG islands. In this work, we used a genome-wide Gene-Ontology (GO)-based approach to analyze associations between GO terms and the presence of 5' CpG islands in human Reference Sequence (RefSeq) genes. We found that 19 of the 3849 GO terms with at least one annotated human sequence showed a highly significant association with the likelihood of 5' CpG islands being present in genes annotated to that term. In particular, the term development showed a highly significantly increased proportion of 5' CpG island genes. The overrepresentation of 5' CpG island genes was even more significant for tissue-specific RefSeqs annotated to development as well as many of its descendent terms. In addition, the proportion of expressed sequence tags from embryonic libraries amongst tissue-specific genes was twice as high for RefSeqs with 5' CpG islands as for those without CpG islands. These results provide strong support for previous speculations that early embryonic expression is associated with CpG islands.
| INTRODUCTION |
|---|
|
|
|---|
The CpG dinucleotide is present at about 20% of the expected frequency in the human genome. CpG dinucleotides are the major target for DNA methylation in animals, with methylation occurring at the cytosine on both strands. The depletion of the CpG dinucleotide in the human and other mammalian genomes is thought to be due to the increased mutability of methylcytosine within CpG dinucleotides, with the result being an increase in the frequency of TpG and its complementary dinucleotide CpA at the expense of CpG dinucleotides (1).
Stretches of GC-rich (
65%) sequence in which the observed frequency of CpG dinucleotides is close to that expected on the basis of G+C content termed CpG islands are associated with the upstream region of many genes. The average size of such upstream CpG islands is about 1 kb (2), and the CpG islands generally cover all or part of the promoter and may extend into the first exon or beyond. Despite their high CpG content, the great majority of CpG islands remain unmethylated in all tissues and in all developmental stages. About 60% of human genes are associated with CpG islands, with all widely expressed (housekeeping) genes and up to 40% of tissue-specific genes being associated with CpG islands (2).
Various theories have been brought forward to explain the origin of CpG islands. As mentioned earlier, methylated CpGs have a high mutation rate towards TpG, which will tend to reduce the frequency of CpG over evolutionary time. However, CpG islands generally exist in an unmethylated state, and as a result escape CpG depletion (3). Lack of transcriptional activity at totipotent stages of development may be associated with de novo methylation, and therefore, unmethylated CpG islands may be predicted to contain promoters active during early development at the time of de novo methylation (4).
Although this provides an attractive explanation for the observation that all housekeeping genes have CpG islands (as one would expect housekeeping genes to be transcribed in early development and in germ line cells), many tissue-specific genes also contain CpG islands. A possible explanation for this could be that the promoters of genes with 5' CpG islands are active during early development. Experimental evidence is available for a limited number of tissue-specific genes with 5' CpG islands that supports this notion (57).
The Gene Ontology (GO) (8) has provided a dynamic, controlled vocabulary for describing gene products in any organism. GO contains three extensive subontologies describing molecular function (the biochemical activity of a gene product), biological process (the objective or biological goal to which a gene product contributes) and cellular component (the place in the cell in which the biological activity of a gene product is exerted). GO presently contains over 17 000 terms, each of which has an accession number, a name, a more detailed definition, and other information relating a term to its parent terms. Individual terms are organized as a directed acyclic graph, whereby the terms form the nodes in the ontology and the arcs the relationships (Figs. 2 and 3). More specific terms are lower in the graph and terms are related to their parent terms by is-a relationships (e.g. condensed chromosome is-a chromosome) or part-of relationships (e.g. nucleolus is part-of nucleus). In contrast to simpler hierarchical structures, one node in a directed acyclic graph may have multiple parents. In the case of GO, this allows for a more flexible, expressive and detailed description of biological functions.
|
|
The GO terms do not themselves describe specific genes or gene products; instead, collaborating databases such as the GO Annotation (GOA) database (9) generate associations of GO terms to specific gene products. Gene products are annotated at the most specific level possible, but are considered to share the attributes of all ancestor terms (true-path rule) (10). For instance, if a gene product is annotated to neurogenesis, it is understood to be implicitly annotated to organogenesis, morphogenesis and development (Fig. 3). In general, annotations to individual GO terms are not independent, because one gene product can be annotated to more than one term. For instance, a subset of genes annotated to nucleus are also annotated to regulation of transcription from Pol II promoter.
In this work, we developed a GO-based genome-wide in silico approach to investigate functional aspects of tissue-specific CpG island genes. Using data from NCBI (11), the UCSC genome database (12), UniProt/Swiss-Prot (13) and GO (8), we showed a highly significant association between tissue-specific CpG island genes and GO terms related to development and transcription regulator activity. Expressed Sequence Tag (EST) analysis of murine genes provided further support for the association between tissue-specific CpG island genes and early embryonic expression.
| RESULTS |
|---|
|
|
|---|
A subset of GO terms show highly significant deviations from the expected frequency of 5' CpG island genes among associated RefSeqs
We developed an in silico approach using data from NCBI (11), the UCSC genome database (12), UniProt/Swiss-Prot (13) and GO (8) to investigate potential correlations between the function, biological role, and cellular location of the protein products of genes and the likelihood of the genes having 5' CpG islands. The EMBOSS program newcpgreport (2,14) was used to identify CpG islands in the 2000 upstream nucleotides and the exon 1 sequence of 17 820 human Reference Sequence (RefSeq) genes. A sequence was identified as a CpG island if there was a minimum G+C content of 50% with a minimum CpGobs/CpGexp of 0.6 over at least 500 nucleotides. In the following, we will refer to such genes as 5' CpG island genes. The frequency of 5' CpG islands among all human RefSeqs using these settings was 35.4%. The definition of a CpG island is arbitrary and different settings will reveal different numbers of CpG islands. Analyses with other settings produced similar results to those described below and are presented with the supplementary online material available at http://www.charite.de/ch/medgen/cpg/.
In order to examine the associations between CpG islands and GO terms, we utilized a mapping from RefSeqs to Swiss-Prot entries, because GO annotations for humans have been produced for proteins rather than genes (9). For each GO term, the number of associated genes with or without at least one CpG island was counted and a
2-statistic was calculated.
Note that as there were 3849 individual GO terms with associated human RefSeq/Swiss-Prot entries, adjustment of estimates of statistical significance for multiple testing are needed. Therefore, in order to estimate statistical significance, data were randomized 1000 times and analyzed as mentioned earlier. The highest single
2-value observed in 1000 runs, 20.3, was taken to be the threshold of significance. Another classical and conservative way of correcting for multiple comparisons is the Bonferroni method, in which the desired alpha value (e.g. 0.05) for the entire set of N comparisons is divided by N. In the present case, if we set the alpha value for each individual test to be 0.05/3849 (1.3x105), then the overall significance level of the test is guaranteed to be
0.05. A
2-value of 20.0, corresponds to a P-value of about 7.7x106, so that the threshold for significance used in this part of the present study was somewhat more conservative than a simple Bonferroni correction.
Nineteen GO terms were identified for which there was a significantly higher or lower than expected frequency of 5' CpG island genes (Table 1). Some of the associations seemed to reflect the notion that genes associated with housekeeping functions are often associated with 5' CpG islands (e.g. 68% of genes annotated to RNA polymerase II transcription had a 5' CpG island). Many of the GO terms with higher than expected frequencies of 5' CpG island genes were related to transcription (Table 1). On the other hand, some GO terms with highly specialized functions had significantly lower than expected frequencies of 5' CpG island genes (e.g. only 3% of genes associated with xenobiotic metabolism had CpG islands). However, not all the GO terms with significant associations lent themselves to such an explanation. In particular, the term development was associated with a higher than expected frequency of 5' CpG island genes.
|
GO terms related to development are significantly overrepresented for tissue-specific5' CpG island RefSeqs
Housekeeping genes are expressed in a wide range of tissues and developmental stages, whereas the expression of tissue-specific genes is restricted to one or a few tissue types. However, any cutoff between tissue-specific and housekeeping genes is arbitrary (15). We developed two EST-based heuristic definitions for tissue-specific genes (Fig. 1). Individual EST libraries contain up to thousands of ESTs corresponding to distinct genes, and also generally contain annotations as to the tissue of origin and developmental stage. The level of expression of an individual gene in a given tissue can be inferred from the number of corresponding ESTs found in EST libraries derived from that tissue. Likewise, the range of expression of a gene over different tissues can be estimated from the range of libraries in which ESTs corresponding to the gene are found. However, EST databases and data acquisition strategies were not designed to answer many of the questions such as tissue-specificity and differential expression that are now being asked of them, and there are several difficulties related to EST-based analysis of gene expression patterns; for instance, a substantial proportion of EST libraries have been normalized to reduce counts of highly expressed genes with respect to those of low-copy genes (16). In addition, the coverage of different tissue types in EST libraries is extremely uneven; for instance, there are 1092 brain libraries but only 10 bone libraries in UniGene build Hs166. However, EST analysis has recently come into wider use for the estimation of detection of genes expressed differentially among different tissues and for creating expression profiles of genes across tissue categories (1618).
|
To investigate associations between tissue-specific 5' CpG island genes and GO terms, analysis and randomization were performed as mentioned earlier, except that tissue-specific genes (identified with the two heuristics described in the Materials and Methods and Fig. 1) were analyzed separately. In the data for tissue-specific genes defined according to the first heuristic (EST library count, Fig. 1A), the highest
2-value in 1000 random runs was 21.3, and 16 GO terms were identified for which there was a more significant
2-value. In the category biological_process, we found significantly higher than expected frequencies of CpG island genes for development, morphogenesis, synaptogenesis, and neuro-genesis (Table 2). The highest
2-statistic for the randomized analysis with tissue-specific genes defined according to the second heuristic (Category count, Fig. 1B) was 16.9; 42 GO terms had a higher
2-value. Again, many of the terms from biological_process were related to development. In addition to the four terms mentioned above, cell differentiation, sex differentiation, brain development, and organogenesis showed significantly higher than expected proportions of CpG island genes (see supplementary online material).
|
There is an alternative explanation for our finding of overrepresentation of CpG islands among genes annotated to synaptogenesis, neurogenesis and brain development. As has been previously reported (19), there is an association between neurally expressed genes and CpG islands. We could confirm this finding. We identified 4578 RefSeqs for which brain was the category with the highest number of ESTs. Owing to the relative imbalance of EST libraries (mentioned earlier), these RefSeqs are not necessarily preferentially expressed in brain, rather, the EST data provide evidence of neural expression of these RefSeqs. Of these RefSeqs 45% had a 5' CpG island compared with an overall frequency of 35.4%. Among 841 tissue specific RefSeqs with highest expression in brain, the frequency of 5' CpG islands was 36.5% compared with an overall frequency for tissue-specific genes of 22.5%. Therefore, it appears possible that the overrepresentation of 5' CpG island genes annotated to synaptogenesis, neurogenesis, and brain development could be related to neural expression, early developmental expression, or both.
The earlier mentioned analysis concentrated on individual GO terms. The structure of GO is such that individual GO terms are considered children of more general terms (Fig. 2). If a gene product is annotated to a specific term, then it is considered to be implicitly annotated to all the parent terms of the specific term (10). We therefore extended the analysis described earlier to traverse the subgraph consisting of all the descendents of each GO term, calculating the percentage of genes annotated to terms in the subgraph that had 5' CpG islands (Fig. 3). There were 29 upper-level terms (children of the root terms of the three subontologies biological_process, molecular_function and cellular_component). For most of these subgraphs, the observed number of 5' CpG was roughly the same as the expected number. Seven subgraphs had
2-statistics with Bonferroni-corrected P-values <0.05, with the two most significant being transcriptional regulator activity (P=2.13x1038), and development (P=5.76x1010). In a comparable analysis of tissue-specific RefSeqs defined according to number of EST libraries as mentioned previously, nine subgraphs reached significance. Remarkably, the significance of the overrepresentation of 5' CpG island genes annotated to development increased from 5.76x1010 in all RefSeqs to 4.41x1025 in tissue-specific RefSeqs (Table 3).
|
Murine 5' CpG island genes
We were thereupon motivated to investigate whether the earlier mentioned associations reflected expression of the genes in the early phases of development. We chose to analyze mouse data because of the availability of a relatively large amount of EST data for mouse embryonic stages. Of the 878 libraries of UniGene build Mm.134, 149 were classified as embryonic (17%). The mouse has significantly less CpG island genes than man, with about 20% of human CpG islands being absent from the homologous mouse genes (3), and the overall frequency of murine 5' CpG island genes at the same settings as mentioned earlier was 20.5%. Despite this, there is a high correlation of the frequency of CpG island genes for individual GO terms between human and murine genes (Fig. 4). Analysis of tissue-specific RefSeq genes defined as being in the first quartile of EST library counts revealed 21 GO terms with significant
2-statistics, including development, pattern specification, neurogenesis and angiogenesis (see supplementary online material).
|
In order to estimate embryonic expression levels of murine genes with and without 5' CpG islands, EST libraries were classified with respect to developmental stage using a regular-expression approach in order to extract the relevant information from the library annotations. Among 878 libraries, 149 were classified as embryonic (Theiler stages 122), representing 0.46 million of a total 3.14 million available murine ESTs. The UniGene cluster for each RefSeq was identified, and the murine RefSeqs were first classified as tissue-specific or not on the basis of the total count of EST libraries per RefSeq, similar to the analysis of human RefSeqs mentioned previously. Then the tissue-specific RefSeqs (i.e. the lowest quartile) were analyzed with respect to the proportion of ESTs derived from embryonic EST libraries. Among tissue-specific genes, the proportion was nearly twice as high for RefSeqs with 5' CpG islands as for those without 5' CpG islands (Fig. 5).
|
| DISCUSSION |
|---|
|
|
|---|
DNA methylation plays an important role in the epigenetic control of gene expression in a range of processes such as allele-specific expression in genomic imprinting and X-inactivation, control of Alu elements and other transposons, and pathological processes such as epigenetic silencing of tumor-suppressor genes (4,2022). However, the role of CpG islands and DNA methylation in the physiological regulation of gene expression (if any) has not been fully elucidated.
One plausible explanation for the lack of methylation of CpG islands would be that binding of transcription factors to promoters located in CpG islands might reduce accessibility of the CpG island to DNA methyltransferases during the period of de novo methylation. More recently, it has been suggested that CpG islands may serve simultaneously as promoters and DNA replication origins, with the position and length of a CpG island being defined by the extent of unidirectional replication (23,24). One implication of this hypothesis is that CpG islands might be a sort of genomic footprint or trace of the replication origin event, rather than entities with their own function (23). On the other hand, experiments on a limited number of genes have suggested that CpG island sequences may in fact have an independent function in preventing de novo methylation of themselves and flanking sequences (25). These results may indicate that CpG islands may contribute to the establishment of a heritable unmethylated state that is a prerequisite for later transcriptional potential.
CpG islands and development
In the present work, we have used a genome-wide GO-based datamining approach to examine potential correlations between the GO terms and the frequency of 5' CpG islands in genes annotated to those terms. We tested the proportion of 5' CpG island genes annotated to each of the 3849 GO terms for which there was at least one annotation to a human RefSeq (GO annotations for Swiss-Prot entries were mapped to the RefSeqs as described in the Materials and Methods).
Nineteen GO terms were identified for which there was a significantly higher or lower than expected frequency of 5' CpG island genes. Notably, the term development had a significantly higher than expected proportion of CpG island genes. As mentioned, a limited amount of experimental evidence exists to suggest that tissue-specific 5' CpG island genes are expressed during early development. These observations prompted us to extend our analysis to a subset of genes defined to be tissue-specific. We developed two heuristic definitions of tissue-specificity on the basis of the distribution of ESTs belonging to the same UniGene cluster as a RefSeq (Fig. 1). Although the results of these two analyses differed slightly, they both identified multiple terms related to development and embryogenesis as having significantly higher than expected proportions of 5' CpG island genes among tissue-specific RefSeqs. In addition, the results of the analysis of the proportions of ESTs from embryonal libraries for tissue-specific murine RefSeqs with and without 5' CpG islands are compatible with the idea that tissue-specific 5' CpG island genes are expressed in early development. We suggest that these results provide strong support for previous speculations that early embryonic expression is related to 5' CpG islands.
Our results confirm and extend results by Ponger et al. (26) who compared the embryonic expression patterns of the murine homologues of 367 human genes by means of EST analysis. They found that 93% of all genes expressed in the early embryo had a start CpG island compared with 56% of other genes. Using a stricter definition of tissue-specific than the definition we have used in this work (6% of all genes were classified as tissue-specific, whereas 25% of genes were classified as tissue-specific in the present work), they found that nine of nine tissue-specific genes with early embryonic expression had a 5' CpG island.
5' CpG island genes and cellular localization
The subontology cellular component refers to the place in the cell where a gene product is active. The GO term nucleus showed a higher than expected frequency of 5' CpG island RefSeqs (P=2.57x1027), including genes such as NFX2 (Nuclear RNA export factor 2) and GLI3 (Zinc finger protein GLI3). The term extracellular showed a significantly lower than expected frequency of 5' CpG island genes (P=7.4x1014), including interleukins and other cytokines, coagulation factors, and other genes whose protein products are active extracellularly. In addition, genes annotated to intermediate filament (intracellular structures involved in mechanically integrating the various components of the cytoplasmic space), also showed a significantly lower than expected frequency of 5' CpG island RefSeqs (Table 1).
5' CpG island genes and transcription
Many of the terms with significantly higher than expected proportions of CpG island genes were related in some way to transcription: regulation of transcription (DNA-dependent), transcription from Pol II promoter, regulation of transcription from Pol II promoter, transcription factor activity, RNA polymerase II transcription factor activity and DNA binding. To our knowledge, this has not been previously reported. The biological significance of this observation, if any, remains to be determined, although it may be related to a very basic functional role of many of the genes involved in RNA transcription.
Gene ontology for datamining
The human genome sequence can be regarded as a kind of genome anatomy that is likely to transform modern medicine and biology, just as the gross anatomy of Andreas Vesalius (1543) played a large role in the development of medicine by paving the way for such discoveries as Harvey's theory of the circulation of blood (27). In order to make use of this resource, biologists need new tools to explore the genome and other data sources. The GO consortium has provided a shared, structured vocabulary for the annotation of gene products across organisms (8,28) that has been used to predict phenotype and gene function from patterns of annotation (2931), to predict function on the basis of proteinprotein interaction data (32), to predict subcellular localization (33) and to infer biological roles for uncharacterized proteins based on gene expression time series data (34).
In the present work, we have used a genome-wide GO-based approach to analyze correlations between a characteristic DNA signal (CpG islands) and individual GO terms. The results of our analysis provide strong, if indirect, support on a genome-wide scale for the notion of a connection between early embryonic promoter activity and the presence of 5' CpG islands, and show how a GO-based analysis can be used to screen the genome to support an hypothesis for which only limited wetlab evidence is available. A similar approach can be used to examine the roles of other DNA signals or motifs on a genome-wide scale. The association between 5' CpG islands and genes involved in aspects of transcription such as RNA polymerase II transcription factor activity was not previously recognized, and may suggest new avenues to explore for further elucidation of the nature of CpG islands.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The analyses described in this report were performed with a set of Perl scripts, C programs, and a mysql database. A more detailed explanation of algorithms and results, are available online at www.charite.de/ch/medgen/cpg/, and the source code is available under the GNU General Public License upon request.
Databases, DNA Sequences and GO Annotations
The chromosome files of the hg16 release (build 34) of the human genome, as well as associated annotation tables were downloaded from the UCSC genome database website (12). The UCSC genome database offers positional information on all available RefSeq sequences (35), which allows the transcriptional start site (TSS) and first exon sequences for each RefSeq to be extracted using perl scripts; these sequences were then analyzed with the EMBOSS (14) program newcpgreport (algorithm described by Larsen et al. (2) using the parameters as indicated in the text).
GO annotations have been made available for UniProt/Swiss-Prot (13,36,37). Therefore, we produced a mapping of the RefSeqs to their corresponding Swiss-Prot entries on the basis of the Known Genes data set of the UCSC genome database (12), for which the linkage from Swiss-Prot to NCBI mRNA records is derived from the DR field of Swiss-Prot/TrEMBL records, and the linkage from mRNA to RefSeqs is produced from NCBI LocusLink data. There were a total of 17 567 distinct RefSeqs and 26 683 Swiss-Prot entries; 9448 RefSeqs had one associated Swiss-Prot entry, 3707 had two, and 2608 had three or more associated Swiss-Prot entries. The GO annotations for different Swiss-Prot entries that are all associated with one RefSeq may be divergent. Therefore, we weighted the GO annotations according to the number of associated Swiss-Prot entries. For instance, if gene X has two associated Swiss-Prot entries Y and Z, the GO annotations associated with Y and Z are each weighted by a factor of 1/2. Many RefSeqs are known to have alternate splice forms. We counted each distinct 5' TSS for each RefSeq separately. Splice forms with identical TSS were not counted separately. We used the GO (8) term database version of 12-2003, and the annotation file gene_association.human_goa (version 16, December 17, 2003). A total of 17 820 RefSeq sequences were extracted. 15 728 had at least one associated Swiss-Prot entry and served as the basis for further analysis.
Association between the frequency of 5' CpG islandgenes and GO terms
The 2000 nucleotides upstream of the indicated TSS and exon 1 of each RefSeq were analyzed for the presence of a CpG island. For each GO term, the number of associated 5' CpG island RefSeq genes was counted. A
2-statistic with one degree of freedom was generated by taking the expected frequency of genes to be the total number of genes associated to the term multiplied by the overall frequency of genes having at least one CpG island amongst all RefSeqs.
We produced randomized data by assigning a CpG island status randomly to each RefSeq according to a uniform distribution on the basis of the overall frequency of CpG islands among all RefSeq sequences. Then, analysis of GO terms was repeated for the randomized data as described previously. The randomization was repeated 1000 times. Observations in the real data were taken to be significant only if the
2-statistic was higher than the highest statistic observed in 1000 randomized experiments.
Tissue distribution versus CpG Islands
By using information from NCBI's locus link database (11), it is possible to identify UniGene clusters that correspond to RefSeqs. Each UniGene cluster contains information about the library from which each EST was derived. We used this information to estimate the anatomical and developmental distribution of the expression of RefSeqs.
Two heuristics were used to find a subset of RefSeqs that we denote as tissue-specific. The first was to count the total number of EST libraries in which ESTs corresponding to each RefSeq were found. All 8282 EST libraries were analyzed from UniGene build 166. Those genes in the lowest quartile were defined to be tissue-specific or low-expression genes for the purposes of this analysis, i.e. 75% of all RefSeqs belonged to UniGene clusters whose ESTs were distributed over a greater number of EST libraries (Fig. 1A).
The second heuristic involved annotating the 8282 EST libraries to belong to 42 different categories such as heart, brain, liver, and so on (a full list of categories is available on the accompanying website). All libraries from cancerous or diseased tissues as well as those for which an unambiguous assignment to a category was not possible were excluded from this analysis. We defined tissue-specific to mean that at least 70% of all ESTs associated to one RefSeq were found in three or less categories (Fig. 1B).
Analysis of the association between the presence or absence of CpG islands and GO terms was then undertaken independently for the RefSeqs classified as tissue-specific. The analysis was performed in essentially the same way as for the initial analysis of all RefSeqs as described previously.
Subgraph analysis
Subgraphs of GO were analyzed essentially as mentioned previously, but by traversing the subgraph emanating from any term (Fig. 2) and adding all RefSeqs annotated to the term or any of its descendents as a group, and applying a
2-statistic to the group. For instance, to analyze the term central nervous system development, genes annotated to this term as well as its descendents brain development and ventral midline development were grouped together (Fig. 3). If there were multiple paths to a descendent node, the node was only counted once.
Analysis of mouse data
Analysis of murine RefSeqs and EST data was performed essentially as for human data, with the exception that the murine EST libraries were classified with respect to their developmental stage. NCBI build 32 (UCSC version Mm4) and UniGene build Mm.134 were used.
| ACKNOWLEDGEMENTS |
|---|
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.
| FOOTNOTES |
|---|
* To whom correspondence should be addressed. Tel: +49 30450569124; Fax: +49 30450569915; Email: peter.robinson{at}charite.de
| REFERENCES |
|---|
|
|
|---|
-
Sved, J. and Bird, A. (1990) The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl Acad. Sci. USA, 87, 46924696.
[Abstract/Free Full Text] - Larsen, F., Gundersen, G., Lopez, R. and Prydz, H. (1992) CpG islands as gene markers in the human genome. Genomics, 13, 10951107.[CrossRef][ISI][Medline]
-
Antequera, F. and Bird, A. (1993) Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA, 90, 1199511999.
[Abstract/Free Full Text] -
Bird, A. (2002) DNA methylation patterns and epigenetic memory. Genes Dev., 16, 621.
[Free Full Text] -
Daniels, R., Lowell, S., Bolton, V. and Monk, M. (1997) Transcription of tissue-specific genes in human preimplantation embryos. Hum. Reprod., 12, 22512256.
[Abstract/Free Full Text] -
Macleod, D., Ali, R.R. and Bird, A. (1998) An alternative promoter in the mouse major histocompatibility complex class II I-Abeta gene: implications for the origin of CpG islands. Mol. Cell. Biol., 18, 44334443.
[Abstract/Free Full Text] - Wise, T.L. and Pravtcheva, D.D. (1999) The undermethylated state of a CpG island region in igf2 transgenes is dependent on the H19 enhancers. Genomics, 60, 258271.[CrossRef][ISI][Medline]
- Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. et al. (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 2529.[CrossRef][ISI][Medline]
-
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R. and Apweiler, R. (2004) The Gene Ontology annotation (GOA) database: sharing knowledge in Uniprot with gene ontology. Nucl. Acids Res., 32, D262266.
[Abstract/Free Full Text] -
Dwight, S.S., Harris, M.A., Dolinski, K., Ball, C.A., Binkley, G., Christie, K.R., Fisk, D.G., Issel-Tarver, L., Schroeder, M., Sherlock, G. et al. (2002) Saccharomyces genome database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucl. Acids Res., 30, 6972.
[Abstract/Free Full Text] -
Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A. et al. (2003) Database resources of the National Center for Biotechnology. Nucl. Acids Res., 31, 2833.
[Abstract/Free Full Text] -
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J. et al. (2003) The UCSC genome browser database. Nucl. Acids Res., 31, 5154.
[Abstract/Free Full Text] -
Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2004) UniProt: the universal protein knowledgebase. Nucl. Acids Res., 32, D115119.
[Abstract/Free Full Text] - Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European molecular biology open software suite. Trends Genet., 16, 276277.[CrossRef][ISI][Medline]
-
Vinogradov, A.E. (2003) Isochores and tissue-specificity. Nucl. Acids Res., 31, 52125220.
[Abstract/Free Full Text] - Brown, A.C., Kai, K., May, M.E., Brown, D.C. and Roopenian, D.C. (2004) ExQuest, a novel method for displaying quantitative gene expression from ESTs. Genomics, 83, 528539.[CrossRef][ISI][Medline]
- Haas, S.A., Beissbarth, T., Rivals, E., Krause, A. and Vingron, M. (2000) GeneNest: automated generation and visualization of gene indices. Trends Genet., 16, 521523.[CrossRef][ISI][Medline]
-
Romualdi, C., Bortoluzzi, S. and Danieli, G.A. (2001) Detecting differentially expressed genes in multiple tag sampling experiments: comparative evaluation of statistical tests. Hum. Mol. Genet., 10, 21332141.
[Abstract/Free Full Text] -
Gardiner-Garden, M. and Frommer, M. (1994) Transcripts and CpG islands associated with the pro-opiomelanocortin gene and other neurally expressed genes. J. Mol. Endocrinol., 12, 365382.
[Abstract/Free Full Text] - Jones, P.A. and Laird, P.W. (1999) Cancer epigenetics comes of age. Nat. Genet., 21, 163167.[CrossRef][ISI][Medline]
- Panning, B. and Jaenisch, R. (1998) RNA and the epigenetic regulation of X chromosome inactivation. Cell, 93, 305308.[CrossRef][ISI][Medline]
- Yoder, J.A., Walsh, C.P. and Bestor, T.H. (1997) Cytosine methylation and the ecology of intragenomic parasites. Trends Genet., 13, 335340.[CrossRef][ISI][Medline]
- Antequera, F. (2003) Structure, function and evolution of CpG island promoters. Cell. Mol. Life Sci., 60, 16471658.[CrossRef][ISI][Medline]
- Delgado, S., Gomez, M., Bird, A. and Antequera, F. (1998) Initiation of DNA replication at CpG islands in mammalian chromosomes. EMBO J., 17, 24262435.[CrossRef][ISI][Medline]
- Siegfried, Z., Eden, S., Mendelsohn, M., Feng, X., Tsuberi, B.Z. and Cedar, H. (1999) DNA methylation represses transcription in vivo. Nat. Genet., 22, 203206.[CrossRef][ISI][Medline]
-
Ponger, L., Duret, L. and Mouchiroud, D. (2001) Determinants of CpG islands: expression in early embryo and isochore structure. Genome Res., 11, 18541860.
[Abstract/Free Full Text] -
McKusick, V.A. (2001) The anatomy of the human genome: a neo-Vesalian basis for medicine in the 21st century. JAMA, 286, 22892295.
[Abstract/Free Full Text] -
The Gene Ontology Consortium (2001) Creating the Gene Ontology resource: design and implementation. Genome Res., 11, 14251433.
[Abstract/Free Full Text] - King, O.D., Lee, J.C., Dudley, A.M., Janse, D.M., Church, G.M. and Roth, F.P. (2003) Predicting phenotype from patterns of annotation. Bioinformatics, 19 (Suppl. 1), I183I189.[Medline]
-
King, O.D., Foulger, R.E., Dwight, S.S., White, J.V. and Roth, F.P. (2003) Predicting gene function from patterns of annotation. Genome Res., 13, 896904.
[Abstract/Free Full Text] -
Jensen, L.J., Gupta, R., Staerfeldt, H.H. and Brunak, S. (2003) Prediction of human protein function according to Gene Ontology categories. Bioinformatics, 19, 635642.
[Abstract/Free Full Text] - Letovsky, S. and Kasif, S. (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics, 19 (Suppl. 1), I197I204.[Medline]
- Chou, K.C. and Cai, Y.D. (2003) A new hybrid approach to predict subcellular localization of proteins by incorporating Gene Ontology. Biochem. Biophys. Res. Commun., 311, 743747.[CrossRef][ISI][Medline]
-
Hvidsten, T.R., Laegreid, A. and Komorowski, J. (2003) Learning rule-based models of biological process from gene expression time profiles using Gene Ontology. Bioinformatics, 19, 11161123.
[Abstract/Free Full Text] -
Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2003) NCBI reference sequence project: update and current status. Nucl. Acids Res., 31, 3437.
[Abstract/Free Full Text] -
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res., 31, 365370.
[Abstract/Free Full Text] -
Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann, W., Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A. et al. (2003) The Gene Ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res., 13, 662672.
[Abstract/Free Full Text] -
Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C. et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucl. Acids Res., 32 (Database issue), D258261.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
S. Vardhanabhuti, J. Wang, and S. Hannenhalli Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation Nucleic Acids Res., May 11, 2007; 35(10): 3203 - 3213. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Saxonov, P. Berg, and D. L. Brutlag A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters PNAS, January 31, 2006; 103(5): 1412 - 1417. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






