Ontology Based Approach For Clinical Diagnosis Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Depending on experience and knowledge of the molecular biology experts and availability of laboratory tests, genetically evolving diseases, symptoms of those diseases and also genetic interactions are described in greater or lesser detail. Ontology is the concept that having store domain knowledge of particular research group. Therefore the information about diseases, symptoms and gene interactions are formally described in ontology. I have adapted semantic similar-ity metrics to measure similarity between queries and hereditary diseases annotated with the use of biomedical ontology that are published by Open Biological and Biomedical Ontologies Foundry such as Human Phenotype Ontology (HPO), Disease Ontology (DO), Symptom Ontology (SO) etc. Also I am going to introduce a statistical model to assign weights to resulting similarity scores which can be used to rank the candidate diseases.

If I describe the advantages of knowledge-bases (ontology) rather than the databases, ontology is a computational representation of a domain of knowledge based upon a controlled, standardized vocabulary for describing entities and the semantic relationships between them. In point of fact ontology is a general concept to express the knowledge of a person who may be an expertise of such domain. Many ontologies are structured as a directed acyclic graph (DAG), whereby the nodes of the DAG, which are also called terms of the ontology corresponding of the domain. After success of the biomedical project Gene Ontology in the past decade the ontologies have been developed for many fields other than the biomedical science. Recent years they are published by those domain experts to share their knowledge with others in web.


The approach of the research outperforms simpler sequential alignment approaches in traditional Bioinformatics that do not consider about the interrelationships among the diseases; domain experts found. Then the approach may not have the simpler term matching rather than alignment process. The semantic web searching concept that take the semantic interrelationships between terms into account is embedded.

The most important role of the physician is making the clinical diagnosis. Clinical diagnostics is often challenging, especially in the field of medical genetics where the differential diagnosis is complicated by the number of Mendalian and chromosomal disorders. Each of disorders is characterized by genetically evolving diseases. Not only the single gene variations are affected to the disease but also the genetics interactions are affected to such diseases.

The number of genetic databases is drastically grown-up and the experimenters have to do such analysis of using those data for biological purposes in an effective way. Therefore such new concepts are used for that task such as ontology concept etc. In this approach, the semantic web concept is going to be used for such analysis of the data to predict the diseases or diagnosis process. In the fact of disease prediction, initial user input is the genome of user. Then according to mutations of the genome user has been given such filtered-out list of corresponding clinical features or symptoms to enter whether the user has or not them. Users enter one or more features and are presented with a list of candidate diagnoses that are characterized by some or all of the features. Then the system uses the semantic search routines through biomedical ontologies to predict the candidate diseases that can be raised. Finally get the result of prioritized list of diseases.

Chapter 2 - Literature Review

Modern Bioinformatics

Recent few decades, Biology becomes in the middle of a major model driven paradigm by the Information Technology. This is the origin of Bioinformatics. Although Biology is an informational science in many aspects the field has been speedily becoming more computational as well as analytical. Because of rapid progress in genetics and biochemistry research combined with the tool provided by modern biotechnology has carried out massive volumes of genetics and sequence data.

Figure : structure of DNA

Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying, modeling, systemizing, storing, searching and ultimately distributing biological information which includes sequences, structures, function and phylogeny. Then Bioinformatics may be defined as a discipline that generates computational background for the Biological Science. It comprises the study of DNA structure and function, gene and protein expression, protein production, structure and function as well as the IT applications such as genetic regulatory systems and clinical applications. Bioinformatics needs the expertise from Computer Science, Mathematics, Statistics, Medicine, and Biology.

Molecular Biology

Molecular Biology is a specification of Biology; branch of Biology that deals with the nature of Biological Phenomena in molecular level through the study of genetic information such as DNA, RNA and proteins. Conceptually this overlaps with Biology and Chemistry as well as Biochemistry.[1] Molecular Biology primarily concerns itself with understanding and the interactions between the various systems of cell, including interactions between the different types of DNA, RNA and protein biosynthesis and how these interactions are regulated.


Although the emerging field of genetics was guided by Mendel's law of segregation and law of independent assortment in the early twentieth century, the actual mechanisms of gene reproduction, mutation and expression remaining unknown. Thomas Hunt Morgan and his colleagues utilized the fruit fly, Drosophila, as a model organism to study the relationship between the gene and the chromosomes in the hereditary process. A former student of Morgan's, Hermann J. Muller, recognized the gene as a basis of life,[2] and so set out to investigate its structure. Muller discovered the mutagenic effect of X-rays on Drosophila, and utilized this phenomenon as a tool to explore the size and nature of the gene.

According to the Aristotle's two features of life Biochemistry was concerned with nutrition or metabolism more generally and Molecular Biology (along with its more direct predecessor classical genetics) investigated reproduction [2]. It traced its roots to animal chemistry and medical chemistry of the nineteenth century (Kohler 1982). Much focus of Biochemistry as the perspective of genetic material was on proteins and enzymes. Until discovery of DNA in 1940s and 50s, the genomic concentration of biochemists is usually rare due to evidences. The discovery of twenty-some amino acids, building blocks of proteins, was major achievements of early twentieth century biochemistry. After Watson and Crick's discovery of the structure of DNA, Biochemistry showed increased emphasis on nucleic acids.[3]Figure : Friedrich Wohler German chemist who was a student of Berzelius. In attempting to prepare ammonium cyanate from silver cyanide and ammonium chloride, he accidentally synthesized urea in 1828. [4][5]

Friedrich Wöhler (Figure2) accidentally obtained one of the most important biological compounds called Urea (NH3), while he was attempting to prepare Ammonium cyanate in a laboratory reaction. After that concept called Vitalism was falsified in 1828.[4] In 1833, Anselme Payen became the first to discover an enzyme, diastase amylase. This research was a major turning point opened toward biochemical researches. Later, in 1896, Eduard Buchner demonstrated complex biochemical process that can be made to take place outside of a living cell: Alcoholic fermentation in cell extracts of yeast. Due to development of new techniques such as chromatography, X-ray diffraction, NMR spectroscopy, radioisotopic labeling, molecular dynamics simulations, electron microscopy, the biomedical research field was widely spread in mid-twentieth century. The discovery of the gene was happened then Biochemistry is known as Molecular Biology. In the 1950s, James D. Watson, Francis Crick, Rosalind Franklin, and Maurice Wilkins were instrumental in solving the structure of DNA and suggesting its relationship with the genetic transfer of information.

Information Technology

The article that was published in Harvard Business Review in 1958, defined a single term for technology senses as the words in; "the new technology dose not yet have a single established name. We shall call it information technology". That was the documentary wise inauguration of the term information technology [6]. Information Technology is the acquisition, processing, storage and dissemination of vocal, pictorial, textual and numerical information by a microelectronics-based combination of computing and telecommunications. When the domain of Molecular Biology field becomes broad, it combined with Information Technology towards so that is called Biology + Information Technology = Bioinformatics.

Human Genetics

Figure : Molecular structure of the living human. Cell, Chromosome pair, Genes, DNA and bases are clearly shows here.

The genome of Homo sapiens has been defined as Human Genome which is stored on 23 chromosome pairs. Twenty two pairs of these are autosomal chromosome pairs while other is sex-determining pair.[7] The molecular structure of the human being such cell, chromosomes, DNA (Deoxyribonucleic acid), bases etc are visibly depicted in Figure3. The reference sequen-ces (standard euchromatic human genome) that world-wide used for Biomedical Science researches are supplied by Human Genome Project According to Human Genome Project * currently genome has occupies a total number of just over three billion DNA base pairs. In other hand there are over twenty three thousands protein-coding genes has been discovered.

Deoxyribonucleic acid is a nucleic acid that stores the smallest instruction used in development and functioning in living organism. Simply as the DNA is the blueprint of a living organism. DNA is a long polymer made from repeating units called nucleotides discovered by James D. Watson and Francis Crik. Single nucleotide (Figure 4) has been consisted with Phosphate group, Sugar and Nitrogenous base. There are four types of bases are discovered as Adenin, Cytosin, Guanine and Thymin; abbreviated by A, C, G, and T respectively in Bioinformatics.

From molecular biology literature review theses nucleotide bases are structured as double helix structure. The double helix structure of DNA is stabilized by hydrogen bonds between the bases attached to the two strands. Somewhat if without hydrogen bonds the living cannot be in the Earth. As well the DNA has double helix structure that stores long-term storage of genetic information.

Figure : Structure of a nucleotide. It consists with Phosphate group, Sugar and Nitrogenous base.

Then the carrier of the genetic information from ancestor to descendents can be identified as genes (Figure 3). A modern definition of a gene is "A locatable region of genomic sequence corresponding to a unit of inheritance which is associated with regulatory regions, transcribed regions, and or other functional sequence regions". [9][10] Today the large number of Molecular Biology experiments and researches are ongoing with regards to the genes and DNA concepts.

Genotype and Phenotype

Wilhelm Johannsen was a Danish botanist, plant physiologist and geneticist who introduced the terms genotype and phenotype in his paper "Om arvelighed i samfund og i rene linier" and in his published hand book "Arvelighedslærens Elementer" *.

Genotype is the internally coded, inheritable information carried by inheritance in all living organisms. This information is used as a blueprint or set of instructions for building and maintaining a living creature. These DNA instructions are originated within almost all cells, they are written in genetic code or genetic sequences, and they are copied at the time of cell division or reproduction and are inherited to the next generation. These DNA instructions are implicitly involved with all aspects of the life of an organism. DNA sequences control everything from the formation of protein macromolecules, to the regulation of metabolism and synthesis.[12]

Human phenotype is the outward, physical manifestation of the organism. These are the physical parts, the sum of the atoms, molecules, macromolecules, cells, structures, energy utilization, metabolism, tissues, organs, reflexes and behaviors; anything that is part of the observable structure, function or behavior of a living organism. [11][12] In the sense of definition of genotype and phenotype there can re build up the relationship those types are Phenotype = Genotype + development in the respective environment


There are number of definitions are encountered for disease as well as the difference faces of the diseases can be seen in different journals and books. Some of the definitions are disease is any disturbance or anomaly in the normal functioning of the body that probably has a specific cause and identifiable symptoms.[19] Diseases are one of the factors threatening us from having a properly functional life. Throughout our history, epidemics have caused the extinction of whole populations. A disease is an abnormal condition affecting the body of an organism. It is often construed to be a medical condition associated with specific symptoms and signs. Over the last century, man has discovered many microorganisms that cause diseases in humans and animals, and has learned how to protect himself from them, by either prevention or treatment. It is hard to count the number of diseases that are identified in human body as well as categorized the diseases in a theoretical manner is very well difficult due to discovery of the diseases is large.

Genetic diseases

Genetic diseases are the specification of the domain of disease. When discovery of the human organism blueprint called DNA structure in later mid twentieth century, there are lots of researches were done[14-18] about how the diseases that could be inherited through the DNA blueprint. These inherent diseases are named as genetic diseases. According to the OMIM, today over 5000 diseases have been discovered. Basically the genetic patterns are affected to emerge particular disease. By comparing with the standard sequences of HGP, if there is genetic disorders may affect to such disease.

Semantic Web

In May 2001 TimBerners-Lee, James Hendler, OraLassila, defined Semantic Web as an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work together, in American journal called The Semantic Web, Scientific American. [20]

Semantic Web is also identified as Web 3.0 in different articles and publishers who involved developing web in a user attractive way [21]. As an instance the article published in The New York Times in November 2007, reporter John Markoff stated that "commercial interest in Web 3.0 or the 'Semantic Web,' for the idea of adding meaning is only now emerging." Although the naming convention was happened, due to this characterization of Web 3.0 or semantic web caused great confusion with respect to the relationships between the Semantic Web and the Web itself, as well as between the Semantic Web and some aspects of the so-called Web 2.0. Therefore some researchers wanted to reject the term "Web 3.0" as too business-oriented; others felt that the vision in the article was only part of the larger Semantic Web vision, and still others felt that, whatever it was called, the Semantic Web's arrival in the business section of The New York Times reflected an important coming of age[20,21].

Alternatively the semantic web researchers in contrast accept that paradoxes and unanswered questions are a price that must be paid to achieve versatility. We make the language for the rules as expressive as needed to allow the Web to reason as widely as desired. The challenge of the Semantic Web, therefore, is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web.

Semantic Web Technologies

It is important to review the technologies and standard used to accomplish the semantic web concept. The some worldwide used standards are

W3C (World Wide Web Consortium)


RIF (Rule Interchange Format)


ISO (International Standards Organization)

Common Logic (CLIF)

ISO/IEC 11179 Metadata Registry Std

OMG (Object Management Group)

Ontology Definition Metamodel (ODM).

Tim Berners-Lee suggested separating development of syntax and semantic of this MEGA-language called Resource Description Framework (RDF) that is syntax for documents of Semantic Web. It uses links to Ontologies Ontology Web Language (OWL) is a language for ontology description.

The timeline [22] of the Semantic Web can be regulated as:

1994: Foundation of W3C. They develop standards such as HTML, URL, XML, HTTP, PNG, SVG, CSS

1998: Tim Berners-Lee published "Semantic Web Road Map"

1999: W3C launched groups for designing Semantic Web foundations, the first version of RDF is published

2000: American defense research institution started investigations for ontology descriptions (DAML+OIL project)

2001: "The Semantic Web" paper in Scientific American

2004: New version of RDF, ontology description language OWL

2006: Candidate recommendation of SPARQL, a query language for Semantic Web


As earlier mentioned there are two important technologies for developing the Semantic Web: eXtensible Markup Language (XML) and the Resource, Description Framework (RDF). XML lets everyone create their own tags-hidden labels such as <zip code> or <alma mater> that annotate Web pages or sections of text on a page. Scripts, or programs, can make use of these tags in sophisticated ways, but the script writer has to know what the page writer uses each tag for. In short, XML allows users to add arbitrary structure to their documents but says nothing about what the structures mean [23].

Meaning is expressed by RDF, which encodes it in sets of triples, each triple being rather like the subject, verb and object of an elementary sentence. These triples can be written using XML tags. In RDF, a document makes assertions that particular have properties with certain values. This structure turns out to be a natural way to describe the vast majority of the data processed by machines.

The knowledge representation in a formalized manner is conceptualization: the objects, concepts or relationships among the objects and other entities. Conceptualization is an abstract, simplified view of the world that we wish to represent for a number of purposes. In the case of ontology every knowledge base, knowledge-based system, or knowledge-level agent is committed to some conceptualization, explicitly or implicitly. Ontology is an explicit specification of conceptualization. The term ontology derived from philosophy where the meaning is a systematic account of existence.

When the domain knowledge is represented in a declarative formalization, the set of classes that can be represented is called the universe of discourse. This set of objects, and the describable relationships among them, are reflected in the representational vocabulary with which a knowledge-based program represents knowledge. This is the case in the context of Artificial Intelligent; that can be described the ontology programs by defining a set of representational terms. In such ontology, definitions associate the names of entities in the universe of discourse (e.g., classes, relations, functions, or other objects) with human-readable text describing what the names mean, and standardized axioms that constrain the interpretation and well-formed use of those terms. Formally, ontology is the statement of a logical theory.

Biomedical Ontology

As mentioned in earlier, the ontology represents the domain knowledge of such domain similar to biomedical knowledge in biomedical domain. Researchers, domain experts, developers defined wide range of biomedical Ontologies and published via the standardized process. One of the biomedical ontology publishers is The Open Biological and Biomedical Ontologies Foundry (http://obofoundry.org/) The OBO Foundry is a collaborative experiment involving developers of science-based Ontologies who are establishing a set of principles for ontology development with the goal of creating a suite of orthogonal interoperable reference ontologies in the biomedical domain. The groups developing Ontologies who have expressed an interest in this goal are listed below, followed by other relevant efforts in this domain. When we visit the http://obofoundry.org / there are lots of Ontologies published by the person in around the world (Table 1)





Last changed

Biological process

biological process




Cellular component





Chemical entities of biological interest





Molecular function

biological function




Phenotypic quality





PRotein Ontology (PRO)





Xenopus anatomy and development





Related Research Works

Annotating the human genome with Disease Ontology

John D Osborne, Jared Flatow, Michelle Holko, Simon M Lin, Warren A Kibbe, Lihua (Julie) Zhu, Maria I Danila, Gang Feng and Rex L Chisholm

In this research they used the Unified Medical Language System (UMLS) MetaMap Transfer tool to discover gene-disease relationships from the GeneRIF database. The human genome has been extensively annotated with Gene Ontology for Biological functions, but minimally computationally annotated for diseases so they utilized a comprehensive subset of UMLS, which is disease-focused and structured as a directed acyclic graph (the Disease Ontology), to filter and interpret results from MMTx.

From disease ontology to disease-ontology lite: statistical methods to adapt a general-purpose ontology for the test of gene-ontology associations

Pan Du, Gang Feng, Jared Flatow, Jie Song, Michelle Holko3, Warren A, Kibbe and Simon M. Lin

They proposed statistical methods to adapt the general purpose, OBO Foundry Disease Ontology (DO) for the identification of gene-disease associations. Thus they need a simplified definition of disease categories derived from implicated genes. On the basis of the assumption that the DO terms having similar associated genes are closely related, we group DO terms based on the similarity of gene-to-DO mapping profiles. Two types of binary distance metrics are defined to measure the overall and subset similarity between DO terms. A compactness-scalable fuzzy clustering method is then applied to group similar DO terms. To reduce false clustering, the semantic similarities between DO terms are also used to constrain clustering results. As such, the DO terms are aggregated and the redundant DO terms are largely removed. Using these methods, they constructed a simplified vocabulary list from the DO called Disease Ontology Lite.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation

P. W. Lord, R. D. Stevens, A. Brass and C. A. Goble

In this paper authors investigate the use of ontological annotation to measure the similarities in knowledge content or 'semantic similarity' between entries in a data resource. These allow a bioinformatician to perform a similarity measure over annotation in an analogous manner to those performed over sequences. A measure of semantic similarity for the knowledge component of bioinformatics resources should afford a biologist a new tool in their repetoire of analyses. They present the results from experiments that investigate the validity of using semantic similarity by comparison with sequence similarity.

Clinical Diagnostics in Human Genetics with Semantic Similarity Searches in Ontologies

Sebastian Ko¨hler, Marcel H. Schulz, Peter Krawitz, Sebastian Bauer, Sandra Do¨lken, Claus E. Ott, Christine Mundlos, Denise Horn, Stefan Mundlos, and Peter N. Robinson

In this research they used semantic similarity metrics to measure phenotypic similarity between queries and hereditary diseases annotated with the use of the Human Phenotype Ontology. They have developed a statistical model to assign p values to the resulting similarity scores, which can be used to rank the candidate diseases. They show that this diagnosis approach outperforms simpler term-matching approaches that do not take the semantic interrelationships between terms into account.

A semantic web approach applied to integrative bioinformatics experimentation: a biological use case with genomics data.

Lennart J. G. Post, Marco Roos, M. Scott Marshall, Roel van Driel and Timo M. Breit

They constructed four OWL knowledge models, two RDFS data models, transformed and mapped relevant data to the data models, linked the data models to knowledge models using linkage statements, and ran semantic queries. Our biological use case demonstrates the relevance of these kinds of integrative bioinformatics experiments. Our findings show high startup costs for the SWEDI approach, but straightforward extension with similar data.

Finding disease specific alterations in the co-expression of genes

Dennis Kostka and Rainer Spang

In this research the author introduces a score for differential co-expression and suggests a computationally efficient algorithm for finding high scoring sets of genes. The use of our novel method is demonstrated in the context of simulations and on real expression data from a clinical study.

An example of food ontology for diabetes control

Jaime Cantais, David Dominguez, Valeria Gigante, Loredana Laera, and Valentina Tamma

This paper describes our experience in the rapid prototyping of a food ontology oriented to the nutritional and health care domain that is used to share knowledge between the different stakeholders involved in the PIPS project.

Towards a Semantic Web for Bioinformatics - ongoing research

With the explosion of online accessible bioinformatics data and tools, systems integration has become very important for further progress. Currently, bioinformatics relies heavily on the Web. But the Web is geared towards human interaction rather than automated processing. The vision of a Semantic Web facilitates this automation by annotating web content and by providing adequate reasoning languages.

Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions

Soumya Raychaudhuri, Robert M. Plenge, Elizabeth J. Rossin, Aylwin C. Y. Ng, International Schizophrenia Consortium, Shaun M. Purcell, Pamela Sklar, Edward M, Scolnick, Ramnik J. Xavier, David Altshuler, Mark J. Daly

In here they describe a statistical method, Gene Relationships among Implicated Loci (GRAIL), which takes a list of disease regions and automatically assesses the degree of relatedness of implicated genes using 250,000 PubMed abstracts. They first evaluated GRAIL by assessing its ability to identify subsets of highly related genes in common pathways from validated lipid and height SNP associations from recent genome-wide studies. We then tested GRAIL, by assessing its ability to separate true disease regions from many false positive disease regions in two separate practical applications in human genetics. First, we took 74 nominally associated Crohn's disease SNPs and applied GRAIL to identify a subset of 13 SNPs with highly related genes. Of these, ten convincingly validated in follow-up genotyping; genotyping results for the remaining three were inconclusive. Next, they applied GRAIL to 165 rare deletion events seen in schizophrenia cases

Aggregation of bioinformatics data using Semantic Web technology

Susie Stephens, David LaVigna, Mike DiLascio, Joanne Luciano

The integration of disparate biomedical data continues to be a challenge for drug discovery efforts. Semantic Web technologies provide the capability to more easily aggregate data and thus can be utilized to improve the efficiency of drug discovery. We describe an implement-ation of a Semantic Web infrastructure that utilizes the scalable Oracle Resource Description Framework (RDF) Data Model as the repository and Seamark Navigator for browsing and searching the data. The paper presents a use case that identifies gene biomarkers of interest and uses the Semantic Web infrastructure to annotate the data.

Biomedical Ontologies

Olivier Bodenreider and Anita Burgun

Ontology design is an important aspect of medical informatics, and reusability is a key issue that is determined by the level of compatibility among ontology concepts and among the theories of the biomedical domain they convey. In this article, we examine OpenGALEN, the UMLS Semantic Network, SNOMED CT, the Foundational Model of Anatomy, and the MENELAS ontology as well as descriptions of the biomedical domain in two general ontologies, OpenCyc and WordNet. Using the representation of Blood in each system, we examine issues in compatibility among these ontologies.