Advanced Campus Services
A Novel Approach to Information Integration of Heterogeneous Bioinformatics Sources
Bioinformatics is a field that studies the information content of life. It analyzes large-scale genomic and proteomic data. As more advanced DNA sequencing technologies are developed, genome research projects generate enormous amount of data, expanding exponentially and doubling every 12-18 months. However, each scientist may access and use only a small portion of this data mainly because it is physically impossible to see all relevant information due to the heterogeneity present in the data sources and the practical difficulty of retrieving all relevant information. Similar to the "accidental" and "essential" complexity of software systems articulated by Brooks in his widely cited paper ("No Silver Bullet: Essence and Accidents of Software Engineering"), the heterogeneity in the data sources is either of an "accidental" nature or of an "essential" nature. Accidental heterogeneity arises from the use of different formats and representation systems (e.g., relational databases, LDAP directories, legacy systems, XML schemas, flat files) and is being solved through better translation systems such as IBM's Directory Integrator. Essential heterogeneity arises from the use of different vocabularies to describe concepts and relationships (metadata) and the different contexts and emphases placed on the data by different scientists, so causing basic semantic problems (e.g. synonyms, homonyms). This project focuses on the essential heterogeneity problem with the goal of providing a uniform seamless interface that facilitates a scientist's access to the growing number of heterogeneous, independently developed bioinformatics data sources, utilizing available tools addressing accidental heterogeneity.
Metadata reconciliation and integration has been studied for a long time by the database and other research communities. The traditional approach uses mediation systems to reformulate user queries to reconcile metadata differences based on previously known correspondences and relationships in the metadata. This approach has not been widely successful because it requires mediation agreements that are difficult to build and maintain, particularly when the number of heterogeneous data sources is constantly growing. A more recent approach is to define "standard" metadata, which has led to the creation of many competing ontologies. Further, such standards and ontology development can be a lengthy process that lags behind the speed at which genomic data repositories, and associated essential heterogeneity are being created. This project proposes, in collaboration with DNA microarray scientists, to investigate an alternative novel approach to mitigate the essential heterogeneity problem.
The novel alternative approach is based on the proposition that monitoring, extraction, clustering, and appropriate visualization of metadata across disparately created data and ontology sources will identify patterns of practice and highlight evolving standards, so facilitating researchers' discovery of relevant, geographically dispersed data sources and their ability to formulate complex queries (transparently) across these sources. The key element of this approach is to provide the "big picture" of available resources and how they may relate to the needs of the researcher, with graduated drill-down capabilities to facilitate discovery and search. The research will draw on the project team's combined synergistic expertise, including ongoing research to facilitate semantic interoperability of inter-organizational directory metadata (Self-Organizing Maps, Latent Semantic Analysis/Latent Semantic Indexing, genetic algorithms), research in bioinformatics (text-mining using functional keywords, bond energy algorithm), visualization (Stereoscopic Field Analyzer), and specific microarray research investigations.
The following will drive the research and how it will be conducted:
· The focus of the research will be on development of appropriate techniques for clustering of unstructured DNAmicroarray metadata and ontologies, and their visualization at varying levels of abstraction such that the clusters can facilitate effective discovery of data and formulation of complex queries by the scientists.
· Prototyping will be used to carry out the research, to demonstrate the feasibility of the techniques developed, andto evaluate the effectiveness of the techniques. The extraction, annotation, and monitoring of metadata from publicly available microarray data sources, and a web crawler will be prototyped. Research prototype tools accessible from the World Wide Web will be developed to enable peer evaluation and dissemination of research.
· Experiments will be conducted to evaluate the clustering and visualization techniques for their quality and effectiveness, and for their ability to facilitate the discovery and formulation of complex queries.
The broader impacts of this research are reflected in the investigators' ongoing practice of engaging undergraduate and graduate students in all aspects of the work, including research, development, and publication; active collaboration across institutions; and integrating research with education and the educational infrastructure. The project is expected to have broad societal impact by helping to boost the productivity of scientists engaged in developing genetic understanding of diseases and drugs for diseases. The research is also expected to have relevance in non-bioinformatics domains that are beset with similar metadata heterogeneity problems.
Last Updated: March 2, 2006