Publications (Supplements)

This page provides supplementary material to Brunner, H.G. & van Driel, M.A. From syndrome families to functional genomics. Nature Reviews Genetics 5, page 545-551 (2004)

Copyright (C) 2004, H.G. Brunner and M.A. van Driel



Introduction:

By placing the known human phenotypes into groups, we can examine whether relationships at the phenotype level reflect shared functions at other biological levels, such as the proteome, genome or interactome.

OMIM1 contains over 15,000 full text records. Of these, ~5,000 describe a human (disease) phenotype, including some 2,000 syndromes. For approximately 1,200 human phenotypes the corresponding genes are known. OMIM therefore holds data that can be used to validate ideas about phenotype to genotype relationships. However, its full text character makes it difficult to analyze systematically. Finding similar phenotypic descriptions in such databases therefore requires text-analysis techniques.

Data from Stickler syndrome and the Pallister-Hall syndrome and several other test cases confirm the hypothesis that relationships at the phenotype level reflect are located in genes that are involved in a similar function. An analysis of all data shows a clear relation between the phenotypic similarity scores and genetic similarity as measured by PFAM2, GO-annotations3, and even sequence alignments. These data strongly suggest that the principles underlying phenotype groups and syndrome families are relevant to all human genetic diseases, and should be explored further.




General background for the figures below.

We have used fully automated text mining to analyze all OMIM disease records(~5000). To allow this process to be automated, the keyword frequencies were represented as vectors, with one vector per OMIM record. Similar phenotypic descriptions have similar keyword frequencies, and therefore similar keyword vectors. We corrected for the length of the record and applied the inverse document frequency technique to compensate, at least partly, for keyword frequency differences. And subsequenctly determined the text-vector similarities.

The X and Y axes in the plots shown below relate to the phenotypic and genotypic similarity, respectively. Phenotypic similarity is defined as the distance between the keyword vectors for the terms in the OMIM records. Genotypic similarities have been defined in three different ways (with all known OMIM genes in Swissprot4 (~1200)): 1) Sharing at least one PFAM domain; 2) Sharing at least one GO entry; 3) Percentage sequence identity. In all three plots the phenotypic similarity values were placed in bins that cover 10%. The highest bin (i.e. near perfect phenotypic similarity), obviously, has very low counts, and therefore this data will not be shown.




Figure 1
Click on image to enlarge.
Figure 1: Phenotype similarity and PFam domain co-occurence
Figure 1: Phenotype similarity and PFAM domain co-occurence

The X-axis shows the phenotypic similarity bins. The Y-axis indicates the percentage of phenotypic pairs of which the associated genes share at least one PFAM domain.
I.e. 19% of all phenotypic pairs with a similarity score between 0.4 and 0.5 share at least one PFAM domain.


Figure 2
Click on image to enlarge.
Figure 2: Phenotype similarity and GO classification co-occurence
Figure 2: Phenotype similarity and GO classification co-occurence

The X-axis shows the phenotypic similarity bins. The Y-axis indicates the percentage of phenotypic pairs of which the associated genes share at least one or more GO links (resp. green and blue). GO annotation links of each gene were rescaled to the sixth GO-level.


Figure 3
Click on image to enlarge.
Figure 3: Phenotype similarity and Sequence similarity
Figure 3: Phenotype similarity and Sequence similarity

The X-axis shows the phenotypic similarity bins. The Y-axis indicates the percentage of phenotypic pairs of which the associated genes have at least a sequence similarity with a e-value is below 1e-6 (Paracel Smith-Waterman5, with all known OMIM genes from Swissprot (~1200), Blosum 80, and low-complexity filter on).



References

1. Hamosh, A., et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52-55 (2002).

2. Bateman A., et al. The Pfam protein families database. Nucleic Acids Res. 32, D138-141 (2004).

3. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium Nature Genet. 25, 25-29 (2000)

4. Boeckmann B., et al. Phan I., Pilbout S., Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 Nucleic Acids Res. 31, 365-370 (2003).

5. Smith and Waterman, "Identification of common molecular subsequences", J. Mol. Biol. 147, 195-197 (1981).


Copyright (C) 2004, H.G. Brunner and M.A. van Driel. Last update: May 26th ,2004