The common way for Diabetes Educators to inform diabetes patients of their nutrition therapy is by introducing food substitution. The existing categorization mechanism is not efficiently for classify the food for diabetic patient. Clustering Data Mining (DM) Techniques can be a very useful tool to collect food items with the same elements into groups. This paper looks at the use of K-mean to Cluster food dataset into groups based on food elements using RapidMiner tool .The output from the clustering algorithm will help other recommendation systems software to provide patient with a good recommendation for there diabetes diet.
data mining; diabetes, data set ,K-meant.
Food and nutrition are a key to have good health. They are important for everyone to maintain a healthy diet especially for diabetic patients who have several limitations. Nutrition therapy is a major solution to prevent, manage and control diabetes by managing the nutrition based on the belief that food provides vital medicine and maintains a good health. Typically, diabetic patients need to avoid additional sugar and fat for finding the substitution from the same food group .The effective clustering from the various actual nutrients is needed to apply. The clustering will encourage diabetics to eat the widest possible variety of permitted food to ensure getting the full range of trace elements and other nutrients. This paper is set out as follows. Section 2, introduces some related work of data mining and diabetic diet. Section 3, describes the used data set and summarize the main features that it contains. Data preparation process is presented in Section 4. Section 5, describes the materials and methods used in this study. In Section 6, the conclusion is given.
2. Literature Review
Li et al , this study proposed an automated food ontology constructed for diabetes diet care. The methods include generating an ontology skeleton with hierarchical clustering algorithms (HCA)also it is used intersection naming for class naming and instance ranking by granular ranking and positioning .This study based on dataset from food nutrition composition database of the Department Of Health the dataset. Phanich et al , proposed Food Recommendation System (FRS) by using food clustering analysis for diabetic patients. The system will recommend the proper substituted
foods in the context of nutrition and food characteristic. They used Self-Organizing Map (SOM) and K-mean clustering for food clustering analysis which is based on the similarity of eight significant nutrients for diabetic patient. This study is based on the dataset â€œNutritive values for Thai foodâ€ provided by Nutrition Division, Department of Health, Ministry of Public Health (Thailand).
3. Dataset Description
This study is based on the dataset provided by The USDA National Nutrient Database for Standard Reference (SR).the Values in the database based on the results of laboratory analyses or calculated by using appropriate algorithms, factors, or recipes, as indicated by the source in the Nutrient Data file. Not every food item contains a complete nutrient profile. The used data set is an abbreviated file with fewer nutrients but all the food items was included. The Dataset contains all the food items with nutrients with 7540 records and 52 attributes. Table1, 2 and 3 show data set attributes and their description. In order to check for missing value I used Rapid Miner tool. Table 4 present sample of data set.
4. Data Preparation
The quality of the results of the mining process is directly proportional to the quality of the data. I need first to prepare the data set by applying Data preprocessing strategies. Data preprocessing is an important and critical step in the data mining process, and it has a huge impact on the success of a data mining project. The purpose of data preprocessing is to cleanse the dirty/noise data. Fig. 1 shows the different strategies in the data preprocessing phase. In this study I focused on data cleaning and data reduction.
Figure 1 strategies in data preprocessing
Table 1 description of data set attributes from 1- 24Table 2 description of data set attributes from 25-48
Table 3 description of data set attributes from 49-52
Table 4 Sample of dataset
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and Inconsistencies from data in order to improve the quality of data . The aim of data cleaning is to raise the data quality to a level suitable for the clustering analyses. The Methods used for data cleaning are fill in missing values and eliminate data redundancy.
It is common for the dataset to have fields that contain unknown or missing values. There are a variety of legitimate reasons why this can happen. There are a number of methods for treating records that contain missing values :
1. Omit the incorrect field(s)
2. Omit the entire record that contains the incorrect field(s)
3. Automatically enter/correct the data with default values e.g. select the mean from the range
4. Derive a model to enter/correct the data
5. Replace all values with a global constant
Within this study both missing and unknown data have been set to zero.
Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors  . The data set used in this study include data objects that are duplicate. Using RapidMiner to removing duplication .As result from this process the 7540 records decreased to 7139 record.
Data reduction can be achieved in many ways one way is by selecting features , The used data set contains many Irrelevant features that contain almost no useful information for data mining task As  I will focus only on eight attributes out of fifty two attributes, as they are important for diabetes diet.
The eight nutrients include:
Vitamin B1(also known as thiamine)
Data normalization is one of the preprocessing procedures in data mining, where the attribute data are scaled so as to fall within a small specified range such as -1.0 to 1.0 or 0.0 to 1.0.
Normalization before clustering is specially needed for distance metric, such as Euclidian distance, which are sensitive to differences in the magnitude or scales of the attributes.
The K-Means typically uses Euclidean distance to measure the distortion between a data object and its cluster centroid .However, the clustering results can be greatly affected by differences in scale among the dimension from, which the distances are computed. Data normalization is the linear transformation of data to a specific range. Therefore, it is worthwhile to enhance clustering quality by normalizing the dynamic range of input data objects into specific range .in this study I will normalize data to the range of [0, 1] . Figure 2 show the result from the data preprocessing
Figure 2 Result from Preprocessing(Data cleaning , Data Reduction , Data Normalization)
5. Data Analysis Methodology
After data preparation, a second step is using a K-means to cluster food data set. In order to work with optimal k-value as  used the Davies-Bouldin index  to evaluate the optimal k-value. The k-value is optimal when the related index is smallest. For this study,
I used K=19 since it gives the smallest value.
The final result is the food clusters which foods in the same group provide the approximate amount of the eight nutrients. Data analysis solution RapidMiner was used to analysis the data set and cluster food item. The whole process sequence shown in figure 3.figure 4, 5, 6 shows the final result.
Figure 3 data analysis process
Figure4 food Items clustered into 19 clusters
Figure4 distribution of 8 Nutrients into clusters from (0-12)
Figure4 distribution of 8 Nutrients into clusters from (13-18)
5.1 K-mean Evaluation
a performance based on the number of clusters.
This operation builds a derived index from the number of clusters by using the formula 1 – (k / n) with k number of clusters and n covered examples. It is used for optimizing the coverage of a cluster result in respect to the number of clusters. By applying the K-mean model to this data set the Cluster number index = 0.997 witch indicate a good coverage.
Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.View our services
Data mining has been widely used in many health care fields. The Diabetes Diet Care was one of the health problems that data mining play role on it .this experiment are conducted based on USDA National Nutrient dataset. The results demonstrate that K-mean is very effective and it can successfully create food groups that will help in many recommendations systems.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: