Mining disease

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Mining Disease Related Knowledge From Biomedical Literature

Abstract. TO DO

Keywords: text mining, biomedical literature, information extraction, disease.


The explosion in the amount of the biomedical literature1 in recent years has triggered the need of biomedical text mining solutions for effective management and reuse of this knowledge. In response, text mining community has come for- ward with solutions ranging from biomedical named entity recognition (BNER) to the literature-based discovery (Zweigenbaum et al., 2007; Dai et al., 2009). Though few well-defined tasks, such as gene or protein mentions recognition, have achieved a sufficient level of maturity, most of the biomedical text mining solutions are far from being robust and practically usable. Biomedical texts are not like general language texts. Ranging from the vocabulary to the valency of verbs, these texts are inherently complex. Moreover, as there is possibility of introducing serious health related risks due to any wrong information, it is very critical to provide information with maximum possible accuracy. So far, most of the efforts on biomedical text mining have been focused on de- veloping applications that can support curation of organism database (Zweigen- baum et al., 2007). Nevertheless, it is more likely that the solutions of the future would be more devoted on feeding information from the pure knowl- edge in the biomedical literature to the end user medical applications (such as or due to the advent of eHealth.

While there exist keywords based search systems such as Entrez2 which re- trieve documents from the PubMed database, information extraction based so- lutions (which will support representation of desired knowledge in biomedical literature in a structured format) that can be used by applications such as ques- tion answering, data mining and semantic web, are much more desirable. In my PhD research, I plan to develop novel methods for mining association among diseases and other medical concepts in biomedical literature. The goal is to al- low health related end user applications to be more accurate and effective by providing enriched structured knowledge from biomedical literature which will be ultimately used by the health experts, such as doctors, biomedical researchers etc, to facilitate better health care service.

The remaining of this document is organised as follows - section 2 includes brief discussion about the state-of-the-art. Following that, section 3 formally defines what I would like to do in my PhD research. Then, in sections 4, 5 and 6 I lay out my plans for approaching the problem. Finally, I conclude with the summary of my proposal in section 7.

State of the Art

In this section, I review various practices on biomedical text mining which are related to my PhD topic. Section 2.1 discusses about biomedical named entity recognition (BNER) which is a fundamental step on building biomedical text mining solution. Following that, section 2.2 discusses the works that have been published so far on relations extraction among diseases and other medical con- cepts.

Biomedical Named Entity Recognition

Named entity recognition is the problem of locating and tagging boundaries of the mentions of the entities in a text with their corresponding semantic types. While most of the work on BNER were focused on gene and protein mentions tagging, other concepts (such as disease) have not drawn enough attention (Ji- meno et al., 2008). Recently, (Leaman et al., 2009) released a disease annotated corpus to promote disease mention recognition research and reported perfor- mance of their own BNER system (which is primarily built for gene/protein mention tagging). State-of-the-art BNER systems, which are based on machine learning (ML) techniques such as conditional random fields (CRFs), support vector machines (SVMs) etc (Dai et al., 2009), often adopt the approach of combining results from multiple different classifiers (Torii et al., 2009). Such ap- proach makes these systems complex and computational resource intensive. In my opinion, the relative increment of the performance due to the use of multiple classifiers instead of a single classifier is not big enough to justify the use (Torii et al., 2009). It is also not clear how these classifiers complement each other, thus putting a big question mark on reliable error analysis and further improvement of the performance.

Extraction of relations of diseases with other concepts

There has been some interest for finding association between genes and diseases in biomedical literature in last few years. One of the notable work, proposed by (Gonzalez et al., 2007), was based on dependency relation and ranking of disease-related genes and gene products which has inspired other direction of research such as identification of disease candidate genes (Chen et al., 2009). However, apart from gene-disease relation extraction, there is hardly any work reported on extraction of relations3 between diseases and other concepts. The only relevant work, to the best of my knowledge, is reported by (Bundschus et al., 2008) which was a CRF-based approach for disease-treatment (and also disease-gene) relation extraction. However, there are some major spaces for im- provement in their work. First of all, they have taken for granted that BNER system (with exactly same set of features) used for gene mention identification would be equally robust to disease and treatment mention identification which resulted in low performance for identifying disease and treatment mentions. They also did not consider syntactic dependency relation which is another aspect that might provide very useful information.

Problem Statement

The problem that I intend to address during my PhD research is how to develop novel methods for mining associations of diseases with other biomedical concepts in biomedical literature". The concepts that I am interested to consider, for extraction of relations with diseases, include - (i) treatments, (ii) symptoms and (iii) drugs.

To be more precise, I would like to investigate the following research questions to address the above mentioned problem -

  1. which features can be useful for high performance named entity recognition
  2. for the concepts such as diseases, treatments, drugs and symptoms.
  3. what are the general expressions (i.e. patterns of the contextual linguistic
  4. structures) in the biomedical literature for describing relationships between
  5. diseases and other concepts?
  6. how can we extract relationships between diseases and other concepts with
  7. high performance?

As mentioned earlier in Section 2.1, most of the BNER works are focused on concepts such as genes and protein. There are compelling reasons to believe that various issues regarding recognition of gene and protein mentions would not apply to the other concepts.4 So, addressing this question would be fundamental before approaching the other questions.

The second research question is aimed to find patterns of the contextual lin- guistic structures which would be somewhat similar to the patterns described by (Bunescu and Mooney, 2006) for protein-protein interaction. These patterns will be used while searching for the answers of next research question which would be the core part of my PhD research.


My approach to the aforementioned problem would be to develop methods using supervised machine learning techniques. There are already a number of annotated corpora (e.g. AZDC (Leaman et al., 2009), BioText5 etc) freely available6 which I would be able to use for my experiments. Previous researches argue that the most potential improvement for the disease related relation extraction lies in the correct identification of concept mentions (Bundschus et al., 2008). Hence, the central piece of approach would be to build a BNER component which should be able to identify diseases and other concepts of interest in biomedical literature with high accuracy. As mentioned in Section 2.1, multi-classifier based approaches of the state-of-the-art BNER systems do not gain justifiable performance boost at the cost of complex and computational resource intensive systems architecture. I believe, a more pragmatic approach would be to invest more effort on selecting appropriate and better feature set which may lead to a single classifier based system with equally high but better analysable results. So, my initial task would be to extract rich and concept- specific orthographic, contextual and linguistic features for the corresponding concept mentions and use them for training machine learning models. I am also interested to use dependency relation information.

To simplify the task of relation extraction, I will concentrate on finding rela- tions among the entities that reside in the same sentence. The initial step would be to learn patterns represented in the form of sequences of words and phrasal chunks. These patterns along with other textual evidences (e.g. dependency re- lations, cue phrases etc) from the words and their context will be exploited as potential features by the machine learning models. Recently, (Reichartz et al., 2009) has shown that a composite kernel of phrase grammar parse tree and de- pendency parse tree achieves excellent result on relation extraction from text of newspaper domain. I would like to investigate the potential of this strategy for my own task of relation extraction.

Plan for BNER

The BNER component that I plan to develop will have the following properties

  • Single classifier based approach
  • Use of orthographic, contextual and linguistic features
  • More emphasis on contextual features
  • Extensive toknizations of the text (e.g. separating punctuation characters
  • and digits from alphabetic character sequences)
  • Normalization (i.e. detecting spelling variations, e.g. \localisation" vs \localization", using tools such as SPECIALIST lexicon tool7 and then representing with lemmatized form) and simplification of the tokens (e.g. replacing all the numbers with single digit)
  • Use of external dictionaries such as MetaThesaurus (Bodenreider, 2004)

I intend to use CRFs for training models8 as it is a state-of-the-art machine learning technique for a variety of text processing tasks including named entity recognition (Klinger and Tomanek, 2007). The model training step would be preceded by various pre-processing (e.g. tokenization). After the identification of the mentions (using the learned models), some post-processing techniques (e.g. fixing parentheses mismatches) will be applied to reduce the number of wrong identifications.

Plan for Relation Extraction

For learning the patterns of the contextual linguistic structures of various rela- tions (i.e. disease-drug, disease-treatment and disease-symptom), I will do corpus analyses of the training data. Only those sentences will be considered where there exist relations. The corresponding sentences will be parsed using syntactic parser and then the word sequences and other properties (e.g. phrase chunks) will be analysed to create a list of potential patterns. All these patterns and other po- tential features (e.g. dependency relations) will be used to train a CRF model for relation extraction.

Another CRF model will be built using n-grams of the words of the training sentences. This second model will be used to decide which test sentences are candidate for relation extraction. Once these candidate sentences are identified, the BNER component will be used to annotate concept mentions. Then there will be another filtering of the sentences if they do not have at least two concept mentions where one of them has to be disease and the other non-disease. Finally, the first CRF model will be used to extract relations.

Additionally, I intend to use composite kernel of phrase grammar parse tree and dependency parse tree for relation extraction. This approach has been effec- tive on other domains such as newspaper articles (Reichartz et al., 2009). So, it will be interesting to see whether the CRF based approach and composite kernel based approach can benefit from each other.


I have presented my PhD research proposal in the field of biomedical text mining. The main goal is to extract disease related knowledge from the available huge amount of biomedical literature in a structured format so that they can be utilized by the health related end user applications. In addition of that, I plan to develop a BNER component specifically tuned for tagging mentions of diseases and other concepts of interest which are not so well studied like BNER for genes/proteins.


  • Bodenreider, O. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl 1):D267-270, January.
  • Bundschus, M., Dejori, M., Stetter, M., Tresp, V., Kriegel, H. 2008. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics, 9:207.
  • Bunescu, R., Mooney, R. 2006. Subsequence kernels for relation extraction. In Proceedings of the 19th Conference on Neural Information Processing Systems, pages 171-178.
  • Chen, J., Aronow, B., Jegga, A. 2009. Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics, 10(1):73.
  • Dai, H., Chang, Y., Tsai, R., Hsu, W. 2009. New challenges for biological text-mining in the next decade. Journal of Computer Science and Technology, 25(1):169-179.
  • Gonzalez, G., Uribe, J., Tari, L., Brophy, C., Baral, C. 2007. Mining gene-disease relationships from biomedical literature: weighting protein-protein interactions and connectivity measures. Pac Symp Biocomput, pages 28-39.
  • Jimeno, A., Jimnez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., Rebholz-Schuhmann, D. 2008. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 9(S-3).
  • Klinger, R., Tomanek, K. 2007. Classical Probabilistic Models and Conditional Random Fields. Technical Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December.
  • Leaman, R., Miller, C., Gonzalez, G. 2009. Enabling recognition of diseases in biomedical text with machine learning: Corpus and benchmark. In Proceedings of the 3rd International Symposium on Languages in Biology and Medicine, pages 82-89.
  • Reichartz, F., Korte, H., Paass, G. 2009. Composite kernels for relation extraction. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 365-368, Suntec, Singapore, August. Association for Computational Linguistics.
  • Smith, L., Tanabe, L., Ando, R., Kuo, C., et al. 2008. Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2).
  • Torii, M., Hu, Z., Wu, C., Liu, H. 2009. Biotagger-GM: a gene/protein name recognition system. Journal of the American Medical Informatics Association : JAMIA, 16:247-255.
  • Zweigenbaum, P., Demner-Fushman, D., Yu, H., Cohen, K. 2007. Frontiers of biomedical text mining: current progress. Brief Bioinform, 8(5):358-375.