Cluster Analysis Using Ant-based Clustering

7348 words (29 pages) Full Dissertation in Full Dissertations

06/06/19 Full Dissertations Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Dissertation Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

CONTENTS

1.      INTRODUCTION

2.      BACKGROUND AND MOTIVATION

  1. CLUSTER ANALYSIS USING ANT-BASED CLUSTERING

2.1.1 ANT-BASED CLUSTERING

2.1.2 CLUSTERING SEQUENTIAL DATA USING ANT-BASED CLUSTERING

2.1.3 LONGEST COMMON SUBSEQUENCE AS ASIMILARITY MEASURE

2.2 TEMPORAL SEQUENCEPATTERN ANALYSIS IN HEALTH CARE

2.2.1 CLINICAL PATHWAYS

2.3 CVD RISK PREDICTION RESEARCH

2.3.1 PREDICT CVD

2.3.2. ADVERESE EVENT PREDICTION IN CVD

3.   OBJECTIVES AND SIGNIFICANCE

4.   INITIAL EXPERIMENTS

4.1 METHODS

4.1.1.  ANT BASED CLUSTERING

4.1.2. LONGEST COMMON SUBSEQUENCE(LCS) SIMILARIY MEASURE

4.1.3. DATA

4.1.4. EXPERIMENTAL DESIGN and RESULTS

5.   RESEARCH PLAN

REFERENCES

APPENDIX

1. Introduction

The health industry generates data in large volumes which is processed using data mining and machine learning techniques to gain knowledge [5].  In the current scenario, vast amounts of data are recorded as Electronic Health Records(EHRs), which are frequently collected and stored. In this process data mining plays a pivotal role in analysing these data sets. Data mining has become increasingly popular in recent years, partly because of the development of computer technology to enable computationally intensive methods to be carried out quickly and efficiently [4]. It is an essential step in the process of knowledge discovery from EHRs using data mining techniques to extract patterns. Data mining and its applications can help the healthcare sector in major areas such as the evaluation of treatment effectiveness, management of patient history and identification of risk factors related to symptoms. Anticipating patient’s future behaviour on the given history is one of the important application of data mining techniques that can be used in health care management.

The purpose of the proposed research work is to process large volumes of population health data sets, which are long time accumulation of patient’s health trajectory. These long time sequence of records related to chronic conditions are modelled as sequence of temporal patterns, which includes health consumer services like records of prescriptions, lab tests, hospital visits and early symptoms. This temporal patterns are used in predicting the adverse events, risk factors and patient’s journey in management of chronic diseases.

The project work analyses and represents the temporal events sequences of the patient’s trajectory without discarding the temporal nature of the data and also to understand their chronic condition and its complication by using data mining techniques. The proposed study also focuses on developing a cohesive and efficient clustering algorithm to handle the temporal sequence pattern and also to interpret and utilise the outcome of clustering to predict future sequence of events including accurately inferring meaningful progression stages.

The rest of the paper is organised as follows, Section 2 discuss the previous study and research work related to ant based clustering, temporal sequence of events and risk prediction related to CVD. Section 3 outlines the main objectives and significance of the research work. Methodology, computational experiments and analysis of the results are presented in section 4. section 5 discuss the research plan and timeline

2. Background and Motivation

This section consists of literature review on: 1. Cluster Analysis using Ant Based Clustering Algorithm 2. Temporal sequence pattern analysis in health care 3. CVD Research for Risk Prediction

2.1 Cluster Analysis using Ant Based Clustering Algorithm

Clustering is a widely used technique to group a set of objects which are similar to each other in the same group when compared to the other groups. It is one of the main aspects of exploratory data mining, and a general approach for statistical data analysis. The main purpose of clustering is to segregate an infinite unlabelled data set into a more discrete and finite set of natural hidden data structures, rather than providing a precise characterization of unobserved data samples formulated from the same probability distribution [40]. algorithms based out of swarm based technique is developing as an alternative method to conventional methods like hierarchical clustering and k-means. Among the various swarm based clustering methods, Ant-based clustering is the most widely used and accepted technique. In a theoretical perspective, ant-based clustering methods are classified into two types. The first method directly uses the clustering behaviour observed in real ant colonies, whereas the other method is less directly influenced by the nature and this kind of clustering task is reformulated as an optimization task [39].

2.1.1. Ant based Clustering

In the literature various clustering methods based on ant behaviour have been proposed. Ant colonies provide a platform to develop some effective and nature-inspired heuristics for solving the clustering problems [54]. Ant-based clustering and sorting is a nature-inspired heuristic approach first introduced as a technique for describing two types of emergent behaviour found in real ant colonies. The same principle is applied in the data-mining context to perform clustering and topographic mapping [48]. Ant-based clustering and sorting was first introduced by Deneubourg et al., to explain different types of nature-inspired heuristics and also models the behaviour of real ants [48,50,51,52].  The work of Deneubourg et al., mainly focuses on deriving a technique applicable to collective robotics and data analysis [48].

Lumer and Faieta [53] introduce various modifications to the model, which enable the handling of numerical data and improve the quality of the outcome and convergence time. The proposed technique is widely used and it is partially motivated by scaling. In this algorithm they have introduced short-term memory among agents, in which each agent remembers a small number of locations where it has successfully dropped an item. When picking a new item this memory is considered for the direction in which the ant will move, hence resulting in the ant moving towards the location where it last dropped a similar item.

Various modifications and extensions of this algorithm have been proposed by several authors to improve its performance.  An improved Ant-based clustering and sorting has been proposed for document retrieval which is used as a visual document retrieval for world wide web searches [55,48,56,57], which also includes multidimensional scaling. The combination of ant-based clustering with fuzzy c-means and k-means algorithms proposed by Gu and Hall 2006[58], Kanade and Hall 2003, 2004[59,60]; Monmarché et al., 1999[61] to improve the efficiency of the algorithm. According to [62,63], this algorithm has been modified with the simultaneous transportation of entire stacks of data items on the grid, which is further enhanced by Lumer and Faieta [53] by using short-term memory feature. Pheromone traces proposed by (Ramos and Merelo 2002[64]; Vizine et al., 2005b [57], (Montes de Oca et al., 2005[65]) to direct ant movement towards promising grid positions, Information exchange between agents, and the replacement of picking and dropping probabilities by fuzzy rules has been used by (Schockaert et al.,. 2004a, 2004b [66,67] to improve the accuracy. In addition to all these enhancements several other research has been carried out in ant based clustering depending on their requirements.

2.1.2 Clustering Sequential Data using Ant Based Clustering:

Sequential data is comprised of sequences with different length and distinct characteristics, e.g., large volume, dynamic behaviours and time constraints [40]. Cluster analysis explores the possible patterns hidden in the large amount of sequential data in the circumstance of unsupervised learning and therefore provides a critical way to overcome the current challenges. Clustering of sequential data depends on similarity of sequences which measures the distance or similarity between each pair of sequences. Sequential data can be generated from: DNA sequencing, speech processing, text mining, medical diagnosis, stock market, customer transactions, web data mining, and robot sensor analysis, to name a few [41].

Ant-based clustering due to its high flexibility and self-organization, has been widely applied in problematic areas from E-Commerce to circuit design and text-mining to web-mining. (Jianbin et al.,., 2007) [42]. According to [46] this algorithm is used to extract patterns from web log files for pattern discovery and also to predict the users next request. In other similar methods ant colony clustering is applied to segregate visitors [46,47].  According to [43,44,46] ant-based clustering is applied to pre-processed logs to extract frequent patterns for pattern discovery. The knowledge extracted from the clusters is been used to predict the users next request using the algorithm based on Longest Common subsequence [44,45].

2.1.3 Longest Common Subsequence as a similarity measure

In data mining, computing similarity between objects is an essential task in order to identify regularities or to build homogeneous clusters of objects. One of the most important aspect of clustering is to choose the appropriate distance measure, which is used to compare the overall features of data [69]. In the literature we can find many proposed distance measures, with each of them being appropriate for different applications and having specific advantages and disadvantages. In the past similarity between the time series is modelled using various techniques which include Euclidean distance, Lp Norm, Dynamic time wrapping (DTW) distance, EDIT and Longest Common Subsequence(LCS). Choosing the Euclidean distance as the similarity model is unrealistic, since its performance degrades rapidly in the presence of noise [70]. Dynamic Time Warping (DTW) has been used so far for one-dimensional time series. Euclidean matching completely disregards the variations in the time axis, while DTW performs excessive matchings, which distorting the true distance between sequences [70]. The figure2 explains in detail about the variations among the three techniques, out of which LCS gives the optimal output compared with the other two techniques. Longest Common Subsequence (LCS), specially adapted for continuous values.  However, LCS is more robust than DTW under noisy conditions [71]. The LCS produces the most robust and intuitive correspondence between points. LCS is an improvised version of the edit distance model; the basic purpose of this model is to match two given sequences by allowing them to stretch, without disturbing the format of the elements, but it also allows some elements to be unmatched based on the requirements [72].  Zhang et al., [19] analyse the treatment information of chronic kidney patients where hierarchical clustering based on LCS distance is introduced to cluster the temporal sequences of patients. Among the various approaches LCS has been widely applied in biomedical research as a distance measure used in trajectory and protein sequence analysis [19]. DNA sequence clustering technique uses a advanced filtering method based on the LCS to find sequence pair of same type [73]. Chen Y. et al., proposed an improvised LCS method by adding semantic similarity feature, which can increase the accuracy of processing in Chinese disease mapping [74]. Park et al., [75] proposes a classification method that adopts LCS for the similarity function for classifying abnormal human behavioural pattern.

In the past researchers have developed LCS methods based on their research requirement in order to address the following issues:

  • The time- series generated, are generally may not be the outcome of sampling at fixed time intervals. Moreover, in the case of two time-series moving exactly in the similar way with one moving at double the speed of the other will most probably result in a huge Euclidean distance.
  • The given data may contain a significant amount of outliers or incorrect data measurements; the similarity measure should handle the situation of outliers.
  • Efficiency: It has to be expressed in detail but should be simpler, so as to incorporate computation of efficient similarity.

To overcome these limitations, we incorporate the LCS model which is a variation of the EDIT similarity measure.

Figure2: The Quality matching of LCS compared to other Distance functions. The Euclidean distance performs an inflexible matching, while DTW gives many superfluous and spurious matching, in the presence of noise [70].

2.2 Temporal sequence pattern Analysis in Health Care

A temporal sequence of events is a collection of events arranged on the basis of time when the event was recorded. Temporal data is sourced to analyse weather patterns and other environmental variables, study demographic trends, extract patterns from Electronic health record (EHR) for predicting risk factors of diseases, monitor traffic condition and so on. The vast availability of EHR data provides an opportunity for discovering knowledge about various diseases [34]. The difficulty in using EHRs for research purposes, however, is limited for analysis and patient selection [33]. Moreover, extracting patterns from temporal sequences of events is one of the fundamental issues in datamining [15].

In order to analyse the available temporal sequence of events, various techniques in data mining, machine learning are used to gain knowledge. Among these, one of the promising and widely accepted techniques is sequential pattern mining. Sequential pattern mining refers to the extraction of frequently occurring patterned events or sub sequences [16]. Initially introduced by Agarwal and Srikant [6] in 1995, its original applications were in the retail industry, where it has been used to predict the time period of a customer for purchasing a sequel after purchasing a certain book. Likewise, this technique is also used in other areas of application including web mining, text mining and health care.  According to Wright et al., this technique is used for identifying temporal relationships among adverse events and medications [13]. This sort of mining technique is generally used to predict disease susceptibility [8,9] and pharmacovigilance practice which is used to monitor the effects of medical drugs, especially in order to identify and evaluate past unreported adverse reactions [11,12]. This form of mining is also used to predict readmission, an episode which covers a patient’s journey from the time of discharge to his subsequent admission to the hospital within a specific timeframe [10]. Readmission rates are considered as the quality benchmark for health care systems and used as an outcome measure. In addition to that, it is also useful in mining unanticipated episodes where a certain set of events leads to unexpected outcomes, e.g., sometimes taking combinations of medicine leads to an adverse reaction [12]. Detecting these kinds of unanticipated events is of great service to mankind in correction or prevention, especially for life threatening outcomes. Due to its unexpected nature, these episodes may not necessarily occur as an event pattern. Hence, these prediction of unexpected events cannot be considered as sequential pattern outcomes [35]– [37]. Jin et al., [12] introduces a technique referred to as unexpected temporal association rules (UTARs) which describes these kind of unanticipated episodes and provides an algorithm to discover them. Batal et al., [9] propose a novel temporal pattern mining method for classifying complex EHR data and to familiarise with classification models in EHR System from complex multivariate temporal data.

Wright et al., [20] determine the effectiveness of sequential pattern mining in identifying temporal relationships between medications in order to predict the next prescribed medication for a patient. Temporal event sequence mining is also applied in predicting survival outcome of Glioblastoma (GBM) cancer patients [14]. The predictive model mentioned in GBM survival is an initial step in a long-term plan of formulating personalized treatment course for patients. This technique helps in predicting which patients survive longer than the median survival time among the set of GBM cancer patients based on clinical and genomic factors, it also helps in assessing the predictive power of treatment patterns.

2.2.1 Clinical Pathways

A Clinical pathway(CP) is a treatment process which represent the steps required to achieve a specific treatment objective in patient care flow [21,22,23]. In the past It has been proven that CPs can break functional boundaries and provide a well-defined process-oriented view of health-care. Among the various components of clinical pathway analysis pattern mining is considered as one of the most important, which aims to discover the medical behaviour which is considered to be essential aspect for clinical pathways and is quantified with numerical bounds. The limitation of the existing techniques are they rarely provide quantified temporal order information of critical medical behaviours. The study of Zheng zing et al., [38] uses new process mining approach to discover a set of clinical pathway patterns for a specific given clinical workflow log and also explores the performed critical medical behaviours. It provides comprehensive knowledge about quantified temporal orders of medical behaviours in clinical pathways. Zhang et al., [19] study analyses the treatment information of chronic kidney patients to learn practice-based clinical pathways and visit history, in this method visit history is referred as a sequences of visits containing data on visit type, diagnoses, date and procedure, it helps the Chronic kidney disease (CKD) patients to manage their disease and its complications and provide a review guide for clinicians. Once the sequences are properly represented, the hierarchical clustering based on Longest common subsequence(LCS) distance is introduced to cluster.

2.3. CVD Research for Risk Prediction

2.3.1 PREDICT-CVD

PREDICT is a web based Cardiovascular disease (CVD) risk assessment and management decision support system developed for primary care implemented in 2002[24]. PREDICT is integrated with general practice EHR allowing systematically coded CVD risk data to be automatically extracted from the EHR. Clinicians (chiefly the General Practitioner [GP] and their staff) use PREDICT at the time of clinic visits, then compute and display a patient’s CVD and diabetes profile including risk scores and tailored evidence-based treatment recommendations. At the same time, the electronic CVD risk factor profile is stored at a central server for each patient. From these profiles, each patient’s first recorded risk assessment is identified as a baseline record [25]. The PREDICT cohort is linked to routinely collected national databases though National Health Index (NHI) number [24,25]. Since 2002, the PREDICT software has been used by approximately 35-40% of general practices in New Zealand, mainly in the Auckland and Northland regions, covering approximately 1.6 million people and representing around 35% of the New Zealand resident population [25]. The main aim of developing this cohort was to validate Framingham CVD score and generate new CVD Risk equations.

 Framingham Risk Score

The Framingham-Risk score[FRS] is an advanced algorithm which is more related to gender which is primarily used in finding the cardiovascular risk of an individual over a 10-year period. The FRS was initially developed using the data obtained from the Framingham Heart Study, to predict the development of coronary heart disease in ten years [26].  Reddy et al., [28] used Framingham risk score in order to predict the risk for CVD in non-cardiac patients. A 5-year cardiovascular risk score was calculated using the Framingham risk equation for individuals with no previous cardiovascular history [29]. An adjusted version of the Framingham score used in current New Zealand cardiovascular risk management guidelines for high and low risk ethnic groups [27].  Lloyd-Jones et al.,[77] investigated the Framingham risk score to estimate the 10-year risk of coronary heart disease (CHD) and to differentiate  lifetime risk for CHD Patients.

2.3.3 Adverse Event Prediction in CVD

An adverse event (AE) is any untoward medical occurrence in a patient related to clinical investigation drug or other medical activity. Adverse event prediction is the process of identifying potential adverse events of an investigational drug before they actually occur. Predicting adverse events accurately represents a significant challenge in health care. Adverse effects are described as mortality, kidney disorder, reinfection, myocardial infarction or coma, in patients who have underwent coronary artery bypass grafting (CABG)[31].  Geraci et al., [76] determines whether adverse events occurring after coronary artery bypass surgery from clinical variables representing illness severity at admission. EHR provide opportunities to leverage vast arrays of data to help prevent adverse events, improve patient outcomes, and reduce hospital costs [30]. The work of Mortazavi et al., [30] on a postoperative complications prediction system allows clinicians to interpret the likelihood of an adverse event occurring, general causes for these events, and the contributing factors for each specific patient. The work of Tsipouras et al., [32] represents a Treatment Tool, which is a decision support frame work, focuses on the management of heart failure (HF) patients. It is a web-based element with primary function including calculation of various risk scores along with the adverse events related to appearance prediction for treatment assessment (eg., hospital readmission prediction). The Treatment Tool provides two functionalities risk scores calculation, and treatment prediction based on risk for adverse events appearance.

3.    Objectives :

The main objective of this study is to adopt datamining approaches to advance the analysis of temporal sequences through data sources, so as to understand and predict adverse events of CVD more precisely.

Objective 1:

Representing temporal event sequences of patient’s trajectories in such a way that preserves the temporal nature of sequence data and also investigate the significance of temporal event sequences associated with a patient’s journey for better understanding of their chronic conditions and its compilation.

Objective 2:

Improvised ant based clustering algorithm with Longest common subsequence(LCS) as a distance measure is introduced for a better understanding of clusters to handle the temporal sequences of data in order to find the meaningful clusters.

Objective 3:

Evaluate and understand the outcome of clusters to predict future events in a sequence while also accurately inferring meaningful progression stages.

4. Initial Experiments:

4.1. Methods:

4.1.1 Ant Based Clustering Algorithm

Deneubourg et al., [51] proposed the ant-based clustering and sorting algorithm which is natural heuristic first introduces as a model for explaining the two types of emergent behaviours of real ant colonies. An improved version of this algorithm was proposed by Lumer and Faieta [53], which is known as ant-based data clustering algorithm, which resembles the ant behaviour as described in [51]. According to Lumer and Faieta procedure [53], initially ants and data elements were randomly scattered in a 2D grid, where ants can perform a random walk, in each step an ant is selected randomly and it can either pick or drop an element at its current locations.  Assuming that an unloaded ant come across an element, the probability of picking that element increases when low density increases and similarity of the element decreases with other elements in a small surrounding area. Accordingly, the probability of picking an element i  is defined as

Ppick(i)= (kpkp+f(i) )  2

(1)

Where kp is a constant and f(i) a local estimation of the density of elements and their similarity to i. In the same way, ant will drop  a carried element should increase with density of similar elements in its surrounding area.

Pdrop (i)=2fi            f(i)<kd1          otherwise,

(2)

Where kp and kd are constants, f(i) is a neighbourhood function

fi=max⁡(0, 1σ2

∑(1-d(i,j)α)

(3)

where, d(i, j)  is a measure of the similarity between data points i and j

α

[0, 1] is a data-dependent scaling parameter, and

σ2

is the size of the local neighbourhood.

Improved Ant based Clustering Algorithm:

Ant based clustering is intended to improve the quality of clusters as compared to other clustering techniques. However, ant clustering takes long of time to produce good quality clusters. Improvements over existing ant based clustering algorithm has been proposed with Longest Common Subsequence(LCS) to handle the temporal sequence of data, guide the agents to select centroids, and also directs them to process until number of centroids equals to the number of clusters groups in the grid. An outline of improve ant based clustering algorithm as follows:

1. Randomly scatter data items and Agents on the 2d grid

2. Move agents randomly by step size grid cells and randomly select any data then calculates the pickup score and drop score based on similarity measure // see equation (1) (2)(3)

3.If (Pickup score > drop score)

{

Agent will carry the data and travel to the new location

Calculate pickup and drop score again

// see equation (1)(2)(3)

If (drop score > pickup score)

place the carried data in the current position

Else

Move to the new location

}

Else

Agent will move to the new location without data

4.Condition for Termination

If (number of centroids == number of group)

End

Else

Go to step3

5. Then select one centroid from each group.

Figure- 1   Ant –based clustering Algorithm

 4.1.2. Longest Common Subsequence(LCS) Similarity measure:

Longest Common Subsequence (LCS):

The Longest Common Subsequence (LCS) distance [69] is a variation of EDIT distance. LCS allows time series to stretch in the time axis and does not match all elements, thus being less sensitive to outliers than Lp-norms and DTW. Specifically, the LCS distance between two real-valued sequences S1 and S2 of length m and n respectively is computed as follows:

Definition 1:

(a) LCS(δ,ε)(S1,S2) = 0, if n = 0 or m = 0

(b) LCS(δ,ε)(S1,S2) = 1+LCS(δ,ε)(HEAD(S1),HEAD(S2))

if |S1,m −S2,m| < ε and |m−n| ≤ δ

(c)

maxLCSSδ,ε,   (HEADS1, S2) DLCSS,δ,ε(S1,HeadS2,   Otherwise

(4)

where HEAD(S1) is the subsequence [(S1,1),(S1,2),…,(S1,m−1)], δ is an integer that controls the maximum distance in the time axis between two matched elements and ε is a real number 0 < ε < 1 that controls the maximum distance that two elements are allowed to have to be considered matched, as depicted in the following figure.

Figure 3 LCS Distance[69]

The value of LCS is unbounded and depends on the length of the compared sequences. We need to normalize it, in order to support sequences of variable length. The distance derived from the LCS similarity can be defined as follows:

Definition 2.

The distance D() expressed in terms of the LCS similarity between two trajectories A and B is given by:

D ()  (A,B) =

1-LCSSδ,ε(A,B)min⁡n.m

(5)

 4.1.3.   Experimental design and Results:

Experimental setup is described in this section along with a comparison of the results for our performed experiments. Results from our proposed algorithm were compared in the basis ofspeed, compactness and cohesiveness with two other popular clustering algorithms, kmeans and hierarchical clustering. Different similarity measures like euclidean distance, LCS distance has been tried to handle timeseries dataset to obtain high quality clusters.

4.1.3.1.   Data:

Experiments have been conducted on Diabetes time series data set [69] which is in fact a benchmarking data set commonly used. A few combinations of subsets of documents are selected for experiments based on various degrees of difficulty. This dataset represents the health state of diabetes patients, which is defined by a vector of categorical variables associated with the presence of chronic conditions. It contains the distribution for diabetes patients daily activities of data recorded for several weeks’ to months’ worth of glucose, insulin, and lifestyle data per patient and a description of the problem domain. Diabetes patient records were obtained from two sources, an automatic electronic recording device and paper records with time slots. This paper utilizes data from a unique study of 70 patients with diabetes that provided such data, including continuous monitoring of blood glucose in the basis of pre an post food intake, regular, intermediate acting neutral protamine Hagedorn (NPH) insulin dose, hypoglycemic symtoms, and exercise activities over a several months which are considered to be sequence of events. These kind of events denoted as codes, 20 different codes and its corresponding measurements of values is taken into account for experiments. The following table shows some of the important events form the diabetes dataset.

33 = Regular insulin dose
34 = NPH insulin dose
57 = Unspecified blood glucose measurement
58 = Pre-breakfast blood glucose measurement
59 = Post-breakfast blood glucose measurement
60 = Pre-lunch blood glucose measurement
61 = Post-lunch blood glucose measurement

62 = Pre-supper blood glucose measurement
63 = Post-supper blood glucose measurement
64 = Pre-snack blood glucose measurement
65 = Hypoglycemic symptoms

Figure:

4.1.3.2 Experimental setup:

This section briefly describes about the initial experiment which has been carried out on diabetes dataset and also compares the selected algorithms. The data set consisting  of temporal  sequence of events and its corresponding measurements, out of which, pre-breakfast glucose  level is considered to be the most important measure

For evaluating the effectiveness of similarity measure in this paper, the average silhouette index coefficient is calculated for the temporal sequence data in clusters. The silhouette coefficient consider both cohesion and separation of data points for evaluating the clusters[78]. The silhouette coefficient value calculated as follows:

si  = bi- aimax⁡bi- ai                                                                   (6)

In equation(6) ai is the average distance of ith  object from other objects in its corresponding cluster, where as the parameter bi  is the minimum of average distances of  ith object  to other objects in different clusters. The value of silhouette coefficient can varry from -1 to +1, the closer value to 1 which shows the better clustering result. The average silhouette coefficient values from different number of clusters for the two distance matrices are shown in figure

5. Research Plan:

The following figure depicts the program schedule for the project and the current state of the research tasks.

Figure 4 Research timeline

References:

1. Institute of Medicine (U.S.), Committee on Improving the Patient Record (eds), Dick RS, Steen EB, Detmer DE (1997) The computerbased patient record: an essential technology for health care. Revised edition. National Academy Press, Washington, DC

[2]. Stewart WF, Shah NR, Selna MJ, Paulus RA, Walker JM (2007) Bridging the inferential gap: the electronic health record and clinical evidence. Health Aff (Millwood) 26: w181–w191

[3]. Kohane IS (2011) Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 12:417–428

[4]. Coorevits P, Sundgren M, Klein GO, Bahr A, Claerhout B, Daniel C et al (2013) Electronic health records: new opportunities for clinical research. J Intern Med 274(6):547–560

[5]. Kukafka R, Ancker JS, Chan C, Chelico J, Khan S, Mortoti S et al (2007) Redesigning electronic health record systems to support public health. J Biomed Inform 40:398–409

[6] R. Agarwal, R. Srikant, Mining sequential patterns, in: Proc. International Conference on Data Engineering, ICDE, IEEE Computer Society, 1995, pp. 3–14.

[7] K.-Y. Whang, J. Jeon, K. Shim, J. Srivatava, Finding event oriented patterns in Long temporal sequences, PAKDD 2003, Springer-Verlag Berlin Heidelberg 2003, LNAI 2637, pp. 15–26.

[8] Reps J, Garibaldi JM, Aickelin U, Soria D, Gibson JE, Hubbard RB, editors. Discovering sequential patterns in a UK general practice database. Biomedical and health informatics (BHI). In: 2012 IEEE-EMBS international conference on.IEEE; 2012.

[9] Batal I, Valizadegan H, Cooper GF, Hauskrecht M, editors. A pattern mining approach for classifying multivariate temporal data. Bioinformatics and biomedicine (BIBM). In: 2011 IEEE international conference on. IEEE; 2011.

[10] McAullay D, Williams G, Chen J, Jin H, He H, Sparks R et al.,., editors. A delivery framework for health data mining and analytics. In: Proceedings of the twenty-eighth australasian conference on computer science, vol. 38. Australian Computer Society, Inc.; 2005.

[11] Norén GN, Bate A, Hopstadius J, Star K, Edwards IR, editors. Temporal pattern discovery for trends and transient effects: its application to patient records. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2008.

[12] Jin H, Chen J, He H, Williams GJ, Kelman C, O’Keefe CM. Mining unexpected temporal associations: applications in detecting adverse drug reactions. Inform Technol Biomed IEEE Trans 2008;12(4):488–500.

[13] Wright, A. P., Wright, A. T., Mccoy, A. B., & Sittig, D. F. (2015). The use of sequential pattern mining to predict next prescribed medications. Journal of Biomedical Informatics, 53, 73-80. doi:10.1016/j.jbi.2014.09.003

[14] Malhotra, K., Navathe, S. B., Chau, D. H., Hadjipanayis, C., & Sun, J. (2016). Constraint based temporal event sequence mining for Glioblastoma survival prediction. Journal of Biomedical Informatics,61, 267-275. doi:10.1016/j.jbi.2016.03.020

[15] Gotz, D., Wang, F., & Perer, A. (2014). A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. Journal of Biomedical Informatics, 48, 148-159. doi:10.1016/j.jbi.2014.01.007

[16]  M. Kamber, J. Han, Data Mining: Concepts and Techniques, second ed., Elsevier, 2006.

[17] Toma T et al.,. Discovery and inclusion of SOFA score episodes in mortality prediction. J Biomed Inform 2007;40(6):649–60.

[18] Toma T et al.,. Learning predictive models that use pattern discovery ÑA bootstrap evaluative approach applied in organ functioning sequences. J Biomed Inform 2010;43(4):578–86.

[19] Zhang, Y., Padman, R., & Wasserman, L. (2014). On Learning and Visualizing Practice-based Clinical Pathways for Chronic Kidney Disease. AMIA Annual Symposium Proceedings2014, 1980–1989.

[20] Aileen P. Wright, Adam T. Wright, Allison B. McCoy, Dean F. Sittig, The use of sequential pattern mining to predict next prescribed medications, Journal of Biomedical Informatics, Volume 53, 2015, Pages 73-80, ISSN 1532-0464.

[21] Lenz R, Reichert M. IT support for healthcare processes-premises, challenges, perspectives. Data Knowl Eng 2007;61(1):39–58.

[22] Quaglini S, Stefanelli M, Lanzola G, Caporusso V, Panzarasa S. Flexible guideline-based patient careflow systems. Artif Intell Med 2001;22(1):65–80.

[23] Lenz R, Blaser R, Beyer M, Heger O, Biber C, BAumlein M, et al.,. IT support for clinical pathways-lessons learned. Int J Med Inform 2007;76(3):S397–402

[24] L. Bannink, S. Wells, J. Broad, T. Riddell, and R. Jackson, “Web-based assessment of cardiovascular disease risk in routine primary care practice in New Zealand: The first 18,000 patients (PREDICT CVD-1),” New Zealand Medical Journal, Vol. 119, 1245 2006.

[25] S. Wells, T. Riddell, A. Kerr, R. Pylypchuk, C. Chelimo, R. Marshall, D.J. Exeter, S. Mehta, J. Harrison, C. Kyle, C. Grey, P. Metcalf, J. Warren, T. Kenealy, P.L. Drury, M. Harwood, D. Bramley, G. Gala, and R. Jackson, “Cohort Profile: The PREDICT Cardiovascular Disease Cohort in New Zealand Primary Care (PREDICT-CVD)

[26] P.W., Wilson; D’Agostino, R.B.; Levy, D.; Belanger, A.M.; Silbershatz, H.; Kannel, W.B. (12 May 1998). “Prediction of coronary heart disease using risk factor categories.”. Circulation. 97 (18): 1837–1847. PMID 9603539

[27] Wannamethee, S., Shaper, A., Lennon, L. and Morris, R. (2005). Metabolic Syndrome vs Framingham Risk Score for Prediction of Coronary Heart Disease, Stroke, and Type 2 Diabetes Mellitus. Archives of Internal Medicine, 165(22), p.2644

[28] Reddy Y.V, R. (2015). Identification of Predictable Biomarkers in Conjunction to Framingham Risk Score to Predict the Risk for Cardiovascular disease (CVD) in Non Cardiac Subjects. JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH.

[29] Anderson KM, Odell PM, Wilson PWF, Kannel WB. Cardiovascular disease risk profiles. American Heart Journal 1991;121(1, Part 2):293.

[30] Mortazavi B, Desai N, Zhang J, Coppi A, Warner F, Krumholz H, Negahban S. Prediction of Adverse Events in Patients Undergoing Major Cardiovascular Procedures. IEEE J Biomed Health Inform. 2017 Mar 1.

[31] Elizabeth B Fortescue, Katherine Kahn, David W Bates, Development and validation of a clinical prediction rule for major adverse outcomes in coronary bypass grafting, The American Journal of Cardiology, Volume 88, Issue 11, 2001, Pages 1251-1258, ISSN 0002-9149, http://dx.doi.org/10.1016/S0002-9149(01)02086-0.

[32] Tsipouras, Karvounis EC, Tzallas AT, Katertsidis NS, Goletsis Y, Frigerio M, Verde A, Trivella MG, Fotiadis DI. Adverse event prediction in patients with left ventricular assist devices. Conf Proc IEEE Eng Med Biol Soc. 2013

[33] Monroe, M., Rongjian Lan, Hanseung Lee, Plaisant, C. and Shneiderman, B. (2013). Temporal Event Sequence Simplification. IEEE Transactions on Visualization and Computer Graphics, 19(12), pp.2227-2236.

[34] Chen, E. S., & Sarkar, I. N. (2014). Mining the Electronic Health Record for Disease Knowledge. Methods in Molecular Biology Biomedical Literature Mining, 269-286. doi:10.1007/978-1-4939-0709-0_15

[35] R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc. ICDE 1995, pp. 3–14.

[36] H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovery of frequent episodes in event sequences,” Data Mining Knowl. Discovery, vol. 1, no. 3, pp. 259–289, 1997.

[37] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, “FreeSpan: Frequent pattern-projected sequential pattern mining,” in Proc. KDD 2000, pp. 355–359.

[38] Zhengxing Huang, Xudong Lu, Huilong Duan, On mining clinical pathway patterns from medical behaviors, Artificial Intelligence in Medicine, Volume 56, Issue 1, 2012, Pages 35-50, ISSN 0933-3657.

[39] Handl, J. and Meyer, B. (2007). Ant-based and swarm-based clustering. Swarm Intelligence, 1(2), pp.95-113.

[40] Rui Xu and D. Wunsch, “Survey of clustering algorithms,” in IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 645-678, May 2005. doi: 10.1109/TNN.2005.845141

[41] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, U.K.: Cambridge Univ. Press, 1998

[42] Janbin, C., Jie, S. and Yunfei, C. (2007) A New Antbased Clustering Algorithm on High Dimensional Data Space, Part 12, Complex Systems Concurrent Engineering, Springer London, Pp. 605-611.

[43] Kobra Etminani. Web Usage Mining: users’ navigational patterns extraction from web logs using Ant-based Clustering Method, Proceedings of IFSAEUSFLAT pp.396-401, 2009

[44] K. Devipriyaa, B.Kalpana  Users’ Navigation Pattern Discovery using Ant Based Clustering and LCS Classification, Proceedings of Journal of Global Research in Computer Science, 2010.

[45] Jalali, M., Mustapha, N., Sulaiman, N.B. and Mamat, A. (2008c) A web usage mining approach based on LCS algorithm in online predicting recommendation systems, 12th International Conference Information Visualization, IEEE Computer Society, Pp. 302-307.

[46] Kobra Etminani. Web Usage Mining: users’ navigational patterns extraction from web logs using Ant-based Clustering Method, Proceedings of IFSAEUSFLAT pp.396-401, 2009

[47] A. Abraham, V. Ramos, Web Usage Mining Using Artificial Ant Colony Clustering and Genetic Programming, Congress on Evolutionary Computation (CEC), IEEE 2003.

[48] Handl, J., Knowles, J. and Dorigo, M. (2006). Ant-Based Clustering and Topographic Mapping. Artificial Life, 12(1), pp.35-62.

[49] Handl, J. and Meyer, B. (2007). Ant-based and swarm-based clustering. Swarm Intelligence, 1(2), pp.95-113.

[50] Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm intelligence— From natural to artificial systems. New York: Oxford University Press.

[51]. Deneubourg, J.-L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., & Chre´tien, L. (1991). The dynamics of collective sorting: Robot-like ants and ant-like robots. In Proceedings of the First International Conference on Simulation of Adaptive Behavior: From animals to animats 1 (pp. 356 –365). Cambridge, MA: MIT Press.

[52]. Dorigo, M., Bonabeau, E., & Theraulaz, G. (2000). Ant algorithms and stigmerg y. Future Generation Computer Systems, 16(8), 851 –871.

[53] Lumer, E., & Faieta, B. (1994). Diversity and adaptation in populations of clustering ants. In Proceedings of the Third International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3 (pp. 501 –508). Cambridge, MA: MIT Press

[54]  Jafar, O. and Sivakumar, R. (2010). Ant-based Clustering Algorithms: A Brief urvey. International Journal of Computer Theory and Engineering, pp.787-796.

[55] Handl, J., & Meyer, B. (2002). Improved ant-based clustering and sorting in a document retrieval interface. In J. J. Merelo, P. Adamidis, & H.-G. Beyer (Eds.), Lecture notes in computer science: Vol. 2439. Parallel problem solving from nature—PPSN VII (pp. 913–923). Berlin: Springer

[56] Vizine, A. L., de Castro, L. N., & Gudwin, R. R. (2005a). Text document classification using swarm intelligence. In International conference on the integration of knowledge intensive multi-agent systems (pp. 134–139). Piscataway: IEEE Press.

[57] Vizine, A. L., de Castro, L. N., Hruschka, E. R., & Gudwin, R. R. (2005b). Towards improving clustering ants: an adaptive ant clustering algorithm. Informatica, 29, 143–154.

[58] Gu, Y., & Hall, L. O. (2006). Kernel based fuzzy ant clustering with partition validity. In Proceedings of the IEEE international conference on fuzzy systems (pp. 263–267). Piscataway: IEEE Press.

[59] Kanade, P. M., & Hall, L. O. (2003). Fuzzy ants as a clustering concept. In NAFIPS 2003: 22nd international conference of the North American fuzzy information processing society (pp. 227–232). Piscataway: IEEE Press.

[60] Kanade, P. M., & Hall, L. O. (2004). Fuzzy ant clustering by centroid positioning. In Proceedings of the IEEE international conference on fuzzy systems (Vol. 1, pp. 371–376). Piscataway: IEEE Press.

[61] Monmarché, N., Slimane, M., & Venturini, G. (1999). On improving clustering in numerical databases with artificial ants. In D. Floreano, J.-D. Nicoud, & F. Mondada (Eds.), Lecture notes in artificial intelligence: Vol. 1674. Advances in artificial life: 5th European conference, ECAL 99 (pp. 626–635). Berlin: Springer.

[62] Monmarché, N., Ramat, E., Desbarats, L., & Venturini, G. (2000). Probabilistic search with genetic algorithms and ant colonies. In A. S. Wu (Ed.), Workshop on optimization by building and using probabilistic models, GECCO 2000 (pp. 209–211).

[63] Li, Q., Shi, Z., Shi, J., & Shi, Z. (2005). Swarm intelligence clustering algorithm based on attractor. In L. Wang, K. Chen, & Y.-S. Ong (Eds.), Lecture notes in computer science: Vol. 3612. Advances in natural computation, first international conference, ICNC 2005 (pp. 496–504). Berlin: Springer.

[64] Ramos, V., & Merelo, J. (2002). Self-organized stigmergic document maps: environments as a mechanism for context learning. In Proceedings of the first Spanish conference on evolutionary and bio-inspired algorithms (pp. 284–293). Mérida: Centro Univ. Mérida

[65] Montes de Oca, M. A., Garrido, L., & Aguirre, J. L. (2005). Effects of inter-agent communication in antbased clustering algorithms: a case study on communication policies in swarm systems. In A. Gelbukh & H. Terashima (Eds.), Lecture notes in artificial intelligence: Vol. 3789. MICAI 2005: advances in artificial intelligence: 4th Mexican international conference on artificial intelligence (pp. 254–263). Berlin: Springer.

[66] Schockaert, S., Cock, M. D., Cornelis, C., & Kerre, E. E. (2004a). Efficient clustering with fuzzy ants. In D. Ruan, P. D’hondt, M. D. Cock, M. Nachtegael, & E. E. Kerre (Eds.), Applied computational intelligence, proceedings of the 6th international FLINS conference (pp. 195–200). River Edge: World Scientific.

[67] Schockaert, S., Cock, M. D., Cornelis, C., & Kerre, E. E. (2004b). Fuzzy ant based clustering. In M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mondada, & T. Stützle (Eds.), Lecture notes in computer science: Vol. 3172. Ant colony optimization and swarm intelligence, 4th international workshop, ANTS 2004 (pp. 342–349). Berlin: Springer

[68] Retrieved Jan. & feb., 2016, from https://archive.ics.uci.edu/ml/datasets/Diabetes

[69] Aggarwal, C. C., & Reddy, C. K. (2014). Data clustering: algorithms and applications. Boca Raton: Chapman and Hall/CRC.

[70] Vlachos, M., Hadjieleftheriou, M., Gunopulos, D., & Keogh, E. (2003, August). Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 216-225). ACM.

[71] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. In Proc. of ICDE, 2002.

[72] G. Das, D. Gunopulos, and H. Mannila. Finding Similar Time Series. In Proc. of the First PKDD Symp., pages 88{100, 1997).

[73] Namiki, Y., Ishida, T., & Akiyama, Y. (2013). Acceleration of sequence clustering using longest common subsequence filtering. BMC Bioinformatics14(Suppl 8), S7. http://doi.org/10.1186/1471-2105-14-S8-S7.

[74] Chen, Y., Lu, H., & Li, L. (2017). Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity. PLoS ONE12(3), e0173410. http://doi.org/10.1371/journal.pone.0173410

[75] Park, K., Lin, Y., Metsis, V., Le, Z., & Makedon, F. (2010, June). Abnormal human behavioral pattern detection in assisted living environments. In Proceedings of the 3rd International Conference on PErvasive Technologies Related to Assistive Environments (p. 9). ACM.

[76] Geraci, J. M., Rosen, A. K., Ash, A. S., McNIFF, K. J., & Moskowitz, M. A. (1993). Predicting the occurrence of adverse events after coronary artery bypass surgery. Annals of internal medicine118(1), 18-24.

[77] Lloyd-Jones, D. M., Wilson, P. W., Larson, M. G., Beiser, A., Leip, E. P., D’Agostino, R. B., & Levy, D. (2004). Framingham risk score and prediction of lifetime risk for coronary heart disease. The American journal of cardiology94(1), 20-24.

[78] M. Azimpour-Kivi and R. Azmi, “A webpage similarity measure for web sessions clustering using sequence alignment,” 2011 International Symposium on Artificial Intelligence and Signal Processing(AISP),Tehran,2011,pp.20-24.
doi: 10.1109/AISP.2011.5960993

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please:

McAfee SECURE sites help keep you safe from identity theft, credit card fraud, spyware, spam, viruses and online scams Prices from
£29

Undergraduate 2:2 • 250 words • 7 day delivery

Order now

Delivered on-time or your money back

Rated 4.1 out of 5 by
Reviews.co.uk Logo (21 Reviews)

Get help with your dissertation