# Identifying Clusters in High Dimensional Data

**Disclaimer:** This dissertation has been submitted by a student. This is not an example of the work written by our professional dissertation writers. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

“Ask those who remember, are mindful if you do not know).” (Holy Qur'an, 6:43)

### Removal Of Redundant Dimensions To Find Clusters In N-Dimensional Data Using Subspace Clustering

### Abstract

The data mining has emerged as a powerful tool to extract knowledge from huge databases. Researchers have introduced several machine learning algorithms to explore the databases to discover information, hidden patterns, and rules from the data which were not known at the data recording time. Due to the remarkable developments in the storage capacities, processing and powerful algorithmic tools, practitioners are developing new and improved algorithms and techniques in several areas of data mining to discover the rules and relationship among the attributes in simple and complex higher dimensional databases. Furthermore data mining has its implementation in large variety of areas ranging from banking to marketing, engineering to bioinformatics and from investment to risk analysis and fraud detection. Practitioners are analyzing and implementing the techniques of artificial neural networks for classification and regression problems because of accuracy, efficiency. The aim of his short research project is to develop a way of identifying the clusters in high dimensional data as well as redundant dimensions which can create a noise in identifying the clusters in high dimensional data. Techniques used in this project utilizes the strength of the projections of the data points along the dimensions to identify the intensity of projection along each dimension in order to find cluster and redundant dimension in high dimensional data.

### 1 Introduction

In numerous scientific settings, engineering processes, and business applications ranging from experimental sensor data and process control data to telecommunication traffic observation and financial transaction monitoring, huge amounts of high-dimensional measurement data are produced and stored. Whereas sensor equipments as well as big storage devices are getting cheaper day by day, data analysis tools and techniques wrap behind. Clustering methods are common solutions to unsupervised learning problems where neither any expert knowledge nor some helpful annotation for the data is available. In general, clustering groups the data objects in a way that similar objects get together in clusters whereas objects from different clusters are of high dissimilarity. However it is observed that clustering disclose almost no structure even it is known there must be groups of similar objects. In many cases, the reason is that the cluster structure is stimulated by some subsets of the space's dimensions only, and the many additional dimensions contribute nothing other than making noise in the data that hinder the discovery of the clusters within that data. As a solution to this problem, clustering algorithms are applied to the relevant subspaces only. Immediately, the new question is how to determine the relevant subspaces among the dimensions of the full space. Being faced with the power set of the set of dimensions a brute force trial of all subsets is infeasible due to their exponential number with respect to the original dimensionality.

In high dimensional data, as dimensions are increasing, the visualization and representation of the data becomes more difficult and sometimes increase in the dimensions can create a bottleneck. More dimensions mean more visualization or representation problems in the data. As the dimensions are increased, the data within those dimensions seems dispersing towards the corners / dimensions. Subspace clustering solves this problem by identifying both problems in parallel. It solves the problem of relevant subspaces which can be marked as redundant in high dimensional data. It also solves the problem of finding the cluster structures within that dataset which become apparent in these subspaces. Subspace clustering is an extension to the traditional clustering which automatically finds the clusters present in the subspace of high dimensional data space that allows better clustering the data points than the original space and it works even when the curse of dimensionality occurs. The most of the clustering algorithms have been designed to discover clusters in full dimensional space so they are not effective in identifying the clusters that exists within subspace of the original data space. The most of the clustering algorithms produces clustering results based on the order in which the input records were processed [2].

Subspace clustering can identify the different cluster within subspaces which exists in the huge amount of sales data and through it we can find which of the different attributes are related. This can be useful in promoting the sales and in planning the inventory levels of different products. It can be used for finding the subspace clusters in spatial databases and some useful decisions can be taken based on the subspace clusters identified [2]. The technique used here for indentifying the redundant dimensions which are creating noise in the data in order to identifying the clusters consist of drawing or plotting the data points in all dimensions. At second step the projection of all data points along each dimension are plotted. At the third step the unions of projections along each dimension are plotted using all possible combinations among all no. of dimensions and finally the union of all projection along all dimensions and analyzed, it will show the contribution of each dimension in indentifying the cluster which will be represented by the weight of projection. If any of the given dimension is contributing very less in order to building the weight of projection, that dimension can be considered as redundant, which means this dimension is not so important to identify the clusters in given data. The details of this strategy will be covered in later chapters.

### 2 Data Mining

### 2.1 - What is Data Mining?

Data mining is the process of analyzing data from different perspective and summarizing it for getting useful information. The information can be used for many useful purposes like increasing revenue, cuts costs etc. The data mining process also finds the hidden knowledge and relationship within the data which was not known while data recording. Describing the data is the first step in data mining, followed by summarizing its attributes (like standard deviation & mean etc). After that data is reviewed using visual tools like charts and graphs and then meaningful relations are determined. In the data mining process, the steps of collecting, exploring and selecting the right data are critically important. User can analyze data from different dimensions categorize and summarize it. Data mining finds the correlation or patterns amongst the fields in large databases.

Data mining has a great potential to help companies to focus on their important information in their data warehouse. It can predict the future trends and behaviors and allows the business to make more proactive and knowledge driven decisions. It can answer the business questions that were traditionally much time consuming to resolve. It scours databases for hidden patterns for finding predictive information that experts may miss it might lies beyond their expectations. Data mining is normally used to transform the data into information or knowledge. It is commonly used in wide range of profiting practices such as marketing, fraud detection and scientific discovery. Many companies already collect and refine their data. Data mining techniques can be implemented on existing platforms for enhance the value of information resources. Data mining tools can analyze massive databases to deliver answers to the questions.

Some other terms contains similar meaning from data mining such as “Knowledge mining” or “Knowledge Extraction” or “Pattern Analysis”. Data mining can also be treated as a Knowledge Discovery from Data (KDD). Some people simply mean the data mining as an essential step in Knowledge discovery from a large data. The process of knowledge discovery from data contains following steps.

* Data cleaning (removing the noise and inconsistent data)

* Data Integration (combining multiple data sources)

* Data selection (retrieving the data relevant to analysis task from database)

* Data Transformation (transforming the data into appropriate forms for mining by performing summary or aggregation operations)

* Data mining (applying the intelligent methods in order to extract data patterns)

* Pattern evaluation (identifying the truly interesting patterns representing knowledge based on some measures)

* Knowledge representation (representing knowledge techniques that are used to present the mined knowledge to the user)

### 2.2 - Data

Data can be any type of facts, or text, or image or number which can be processed by computer. Today's organizations are accumulating large and growing amounts of data in different formats and in different databases. It can include operational or transactional data which includes costs, sales, inventory, payroll and accounting. It can also include nonoperational data such as industry sales and forecast data. It can also include the meta data which is, data about the data itself, such as logical database design and data dictionary definitions.

### 2.3 - Information

The information can be retrieved from the data via patterns, associations or relationship may exist in the data. For example the retail point of sale transaction data can be analyzed to yield information about the products which are being sold and when.

### 2.4 - Knowledge

Knowledge can be retrieved from information via historical patterns and the future trends. For example the analysis on retail supermarket sales data in promotional efforts point of view can provide the knowledge buying behavior of customer. Hence items which are at most risk for promotional efforts can be determined by manufacturer easily.

### 2.5 - Data warehouse

The advancement in data capture, processing power, data transmission and storage technologies are enabling the industry to integrate their various databases into data warehouse. The process of centralizing and retrieving the data is called data warehousing. Data warehousing is new term but concept is a bit old. Data warehouse is storage of massive amount of data in electronic form. Data warehousing is used to represent an ideal way of maintaining a central repository for all organizational data. Purpose of data warehouse is to maximize the user access and analysis. The data from different data sources are extracted, transformed and then loaded into data warehouse. Users / clients can generate different types of reports and can do business analysis by accessing the data warehouse.

Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It allows these organizations to evaluate associations between certain internal & external factors. The product positioning, price or staff skills can be example of internal factors. The external factor examples can be economic indicators, customer demographics and competition. It also allows them to calculate the impact on sales, corporate profits and customer satisfaction. Furthermore it allows them to summarize the information to look detailed transactional data. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by its capabilities.

Data mining usually automates the procedure of searching predictive information in huge databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data very quickly. The targeted marketing can be an example of predictive problem. Data mining utilizes data on previous promotional mailings in order to recognize the targets most probably to increase return on investment as maximum as possible in future mailings. Tools used in data mining traverses through huge databases and discover previously unseen patterns in single step. Analysis on retail sales data to recognize apparently unrelated products which are usually purchased together can be an example of it. The more pattern discovery problems can include identifying fraudulent credit card transactions and identifying irregular data that could symbolize data entry input errors. When data mining tools are used on parallel processing systems of high performance, they are able to analyze huge databases in very less amount of time. Faster or quick processing means that users can automatically experience with more details to recognize the complex data. High speed and quick response makes it actually possible for users to examine huge amounts of data. Huge databases, in turn, give improved and better predictions.

### 2.6 - Descriptive and Predictive Data Mining

Descriptive data mining aims to find patterns in the data that provide some information about what the data contains. It describes patterns in existing data, and is generally used to create meaningful subgroups such as demographic clusters. For example descriptions are in the form of Summaries and visualization, Clustering and Link Analysis. Predictive Data Mining is used to forecast explicit values, based on patterns determined from known results. For example, in the database having records of clients who have already answered to a specific offer, a model can be made that predicts which prospects are most probable to answer to the same offer. It is usually applied to recognize data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g. to determine the cluster or divisions of customers), in which case drill-down descriptive and tentative methods need to be applied. Predictive data mining is goad oriented. It can be decomposed into following major tasks.

* Data Preparation

* Data Reduction

* Data Modeling and Prediction

* Case and Solution Analysis

### 2.7 - Text Mining

The Text Mining is sometimes also called Text Data Mining which is more or less equal to Text Analytics. Text mining is the process of extracting/deriving high quality information from the text. High quality information is typically derived from deriving the patterns and trends through means such as statistical pattern learning. It usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. The High Quality in text mining usually refers to some combination of relevance, novelty, and interestingness. The text categorization, concept/entity extraction, text clustering, sentiment analysis, production of rough taxonomies, entity relation modeling, document summarization can be included as text mining tasks.

Text Mining is also known as the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. Linking together of the extracted information is the key element to create new facts or new hypotheses to be examined further by more conventional ways of experimentation. In text mining, the goal is to discover unknown information, something that no one yet knows and so could not have yet written down. The difference between ordinary data mining and text mining is that, in text mining the patterns are retrieved from natural language text instead of from structured databases of facts. Databases are designed and developed for programs to execute automatically; text is written for people to read. Most of the researchers think that it will need a full fledge simulation of how the brain works before that programs that read the way people do could be written.

### 2.8 - Web Mining

Web Mining is the technique which is used to extract and discover the information from web documents and services automatically. The interest of various research communities, tremendous growth of information resources on Web and recent interest in e-commerce has made this area of research very huge. Web mining can be usually decomposed into subtasks.

* Resource finding: fetching intended web documents.

* Information selection and pre-processing: selecting and preprocessing specific information from fetched web resources automatically.

* Generalization: automatically discovers general patterns at individual and across multiple website

* Analysis: validation and explanation of mined patterns.

Web Mining can be mainly categorized into three areas of interest based on which part of Web needs to be mined: Web Content Mining, Web Structure Mining and Web Usage Mining. Web Contents Mining describes the discovery of useful information from the web contents, data and documents [10]. In past the internet consisted of only different types of services and data resources. But today most of the data is available over the internet; even digital libraries are also available on Web. The web contents consist of several types of data including text, image, audio, video, metadata as well as hyperlinks. Most of the companies are trying to transform their business and services into electronic form and putting it on Web. As a result, the databases of the companies which were previously residing on legacy systems are now accessible over the Web. Thus the employees, business partners and even end clients are able to access the company's databases over the Web. Users are accessing the applications over the web via their web interfaces due to which the most of the companies are trying to transform their business over the web, because internet is capable of making connection to any other computer anywhere in the world [11]. Some of the web contents are hidden and hence cannot be indexed. The dynamically generated data from the results of queries residing in the database or private data can fall in this area. Unstructured data such as free text or semi structured data such as HTML and fully structured data such as data in the tables or database generated web pages can be considered in this category. However unstructured text is mostly found in the web contents. The work on Web content mining is mostly done from 2 point of views, one is IR and other is DB point of view. “From IR view, web content mining assists and improves the information finding or filtering to the user. From DB view web content mining models the data on the web and integrates them so that the more sophisticated queries other than keywords could be performed. [10].

In Web Structure Mining, we are more concerned with the structure of hyperlinks within the web itself which can be called as inter document structure [10]. It is closely related to the web usage mining [14]. Pattern detection and graphs mining are essentially related to the web structure mining. Link analysis technique can be used to determine the patterns in the graph. The search engines like Google usually uses the web structure mining. For example, the links are mined and one can then determine the web pages that point to a particular web page. When a string is searched, a webpage having most number of links pointed to it may become first in the list. That's why web pages are listed based on rank which is calculated by the rank of web pages pointed to it [14]. Based on web structural data, web structure mining can be divided into two categories. The first kind of web structure mining interacts with extracting patterns from the hyperlinks in the web. A hyperlink is a structural component that links or connects the web page to a different web page or different location. The other kind of the web structure mining interacts with the document structure, which is using the tree-like structure to analyze and describe the HTML or XML tags within the web pages.

With continuous growth of e-commerce, web services and web applications, the volume of clickstream and user data collected by web based organizations in their daily operations has increased. The organizations can analyze such data to determine the life time value of clients, design cross marketing strategies etc. [13]. The Web usage mining interacts with data generated by user's clickstream. “The web usage data includes web server access logs, proxy server logs, browser logs, user profile, registration data, user sessions, transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as a result of interaction” [10]. So the web usage mining is the most important task of the web mining [12]. Weblog databases can provide rich information about the web dynamics. In web usage mining, web log records are mined to discover the user access patterns through which the potential customers can be identified, quality of internet services can be enhanced and web server performance can be improved. Many techniques can be developed for implementation of web usage mining but it is important to know that success of such applications depends upon what and how much valid and reliable knowledge can be discovered the log data. Most often, the web logs are cleaned, condensed and transformed before extraction of any useful and significant information from weblog. Web mining can be performed on web log records to find associations patterns, sequential patterns and trend of web accessing. The overall Web usage mining process can be divided into three inter-dependent stages: data collection and pre-processing, pattern discovery, and pattern analysis [13]. In the data collection & preprocessing stage, the raw data is collected, cleaned and transformed into a set of user transactions which represents the activities of each user during visits to the web site. In the pattern discovery stage, statistical, database, and machine learning operations are performed to retrieve hidden patterns representing the typical behavior of users, as well as summary of statistics on Web resources, sessions, and users.

### 3 Classification

3.1 - What is Classification?

As the quantity and the variety increases in the available data, it needs some robust, efficient and versatile data categorization technique for exploration [16]. Classification is a method of categorizing class labels to patterns. It is actually a data mining methodology used to predict group membership for data instances. For example, one may want to use classification to guess whether the weather on a specific day would be “sunny”, “cloudy” or “rainy”. The data mining techniques which are used to differentiate similar kind of data objects / points from other are called clustering. It actually uses attribute values found in the data of one class to distinguish it from other types or classes. The data classification majorly concerns with the treatment of the large datasets. In classification we build a model by analyzing the existing data, describing the characteristics of various classes of data. We can use this model to predict the class/type of new data. Classification is a supervised machine learning procedure in which individual items are placed in a group based on quantitative information on one or more characteristics in the items. Decision Trees and Bayesian Networks are the examples of classification methods. One type of classification is Clustering. This is process of finding the similar data objects / points within the given dataset. This similarity can be in the meaning of distance measures or on any other parameter, depending upon the need and the given data.

Classification is an ancient term as well as a modern one since classification of animals, plants and other physical objects is still valid today. Classification is a way of thinking about things rather than a study of things itself so it draws its theory and application from complete range of human experiences and thoughts [18]. From a bigger picture, classification can include medical patients based on disease, a set of images containing red rose from an image database, a set of documents describing “classification” from a document/text database, equipment malfunction based on cause and loan applicants based on their likelihood of payment etc. For example in later case, the problem is to predict a new applicant's loan's eligibility given old data about customers. There are many techniques which are used for data categorization / classification. The most common are Decision tree classifier and Bayesian classifiers.

### 3.2 - Types of Classification

There are two types of classification. One is supervised classification and other is unsupervised classification. Supervised learning is a machine learning technique for discovering a function from training data. The training data contains the pairs of input objects, and their desired outputs. The output of the function can be a continuous value which can be called regression, or can predict a class label of the input object which can be called as classification. The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this goal, the learner needs to simplify from the presented data to hidden situations in a meaningful way.

The unsupervised learning is a class of problems in machine learning in which it is needed to seek to determine how the data are organized. It is distinguished from supervised learning in that the learner is given only unknown examples. Unsupervised learning is nearly related to the problem of density estimation in statistics. However unsupervised learning also covers many other techniques that are used to summarize and explain key features of the data. One form of unsupervised learning is clustering which will be covered in next chapter. Blind source partition based on Independent Component Analysis is another example. Neural network models, adaptive resonance theory and the self organizing maps are most commonly used unsupervised learning algorithms. There are many techniques for the implementation of supervised classification. We will be discussing two of them which are most commonly used which are Decision Trees classifiers and Naïve Bayesian Classifiers.

### 3.2.1 - Decision Trees Classifier

There are many alternatives to represent classifiers. The decision tree is probably the most widely used approach for this purpose. It is one of the most widely used supervised learning methods used for data exploration. It is easy to use and can be represented in if-then-else statements/rules and can work well in noisy data as well [16]. Tree like graph or decisions models and their possible consequences including resource costs, chance event, outcomes, and utilities are used in decision trees. Decision trees are most commonly used in specifically in decision analysis, operations research, to help in identifying a strategy most probably to reach a target. In machine learning and data mining, a decision trees are used as predictive model; means a planning from observations & calculations about an item to the conclusions about its target value. More descriptive names for such tree models are classification tree or regression tree. In these tree structures, leaves are representing classifications and branches are representing conjunctions of features those lead to classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or decision trees. Decision trees are simple but powerful form of multiple variable analyses [15]. Classification is done by tree like structures that have different test criteria for a variable at each of the nodes. New leaves are generated based on the results of the tests at the nodes. Decision Tree is a supervised learning system in which classification rules are constructed from the decision tree. Decision trees are produced by algorithms which identify various ways splitting data set into branch like segment. Decision tree try to find out a strong relationship between input and target values within the dataset [15].

In tasks classification, decision trees normally visualize that what steps should be taken to reach on classification. Every decision tree starts with a parent node called root node which is considered to be the "parent" of every other node. Each node in the tree calculates an attribute in the data and decides which path it should follow. Typically the decision test is comparison of a value against some constant. Classification with the help of decision tree is done by traversing from the root node up to a leaf node. Decision trees are able to represent and classify the diverse types of data. The simplest form of data is numerical data which is most familiar too. Organizing nominal data is also required many times in many situations. Nominal quantities are normally represented via discrete set of symbols. For example weather condition can be described in either nominal fashion or numeric. Quantification can be done about temperature by saying that it is eleven degrees Celsius or fifty two degrees Fahrenheit. The cool, mild, cold, warm or hot terminologies can also be sued. The former is a type of numeric data while and the latter is an example of nominal data. More precisely, the example of cool, mild, cold, warm and hot is a special type of nominal data, expressed as ordinal data. Ordinal data usually has an implicit assumption of ordered relationships among the values. In the weather example, purely nominal description like rainy, overcast and sunny can also be added. These values have no relationships or distance measures among each other.

Decision Trees are those types of trees where each node is a question, each branch is an answer to a question, and each leaf is a result. Here is an example of Decision tree.

Roughly, the idea is based upon the number of stock items; we have to make different decisions. If we don't have much, you buy at any cost. If you have a lot of items then you only buy if it is inexpensive. Now if stock items are less than 10 then buy all if unit price is less than 10 otherwise buy only 10 items. Now if we have 10 to 40 items in the stock then check unit price. If unit price is less than 5£ then buy only 5 items otherwise no need to buy anything expensive since stock is good already. Now if we have more than 40 items in the stock, then buy 5 if and only if price is less than 2£ otherwise no need to buy too expensive items. So in this way decision trees help us to make a decision at each level. Here is another example of decision tree, representing the risk factor associated with the rash driving.

The root node at the top of the tree structure is showing the feature that is split first for highest discrimination. The internal nodes are showing decision rules on one or more attributes while leaf nodes are class labels. A person having age less than 20 has very high risk while a person having age greater than 30 has a very low risk. A middle category; a person having age greater than 20 but less than 30 depend upon another attribute which is car type. If car type is of sports then there is again high risk involved while if family car is used then there is low risk involved.

In the field of sciences & engineering and in the applied areas including business intelligence and data mining, many useful features are being introduced as the result of evolution of decision trees.

* With the help of transformation in decision trees, the volume of data can be reduced into more compact form that preserves the major characteristics and provides the accurate summary [16].

* Decision tree discovers classes even when there are well separated classes of objects are there in the data such that classes can be interpreted meaningfully.

* It maps the data in the form of tree so that the prediction values can be generated by backtracking from the leaves to root and can be used to predict the outcome for new data.

* Decision tree produce results in well symbolic and visual means. Decision trees are usually easy to generate, easy to understand and are easy to implement. Ability to add multiple predictors in simple way is one of the useful features of the decision tree. Highly complex rules can be built incrementally in a simple and powerful way.

* Various levels of measurements can be incorporated easily in decision trees, including qualitative and quantitative. Examples of qualitative can be “good” or “bad”. The quantitative measurements can include ordinal values like “high”, “medium” and “low” categories and interval levels.

* Various twists and turn can easily be adapted in decision trees - unbalanced effects, nested effects, offsetting effects, interactions and nonlinearities can be adapted by decision trees [15].

* Decision trees are nonparametric and highly robust and produce similar effects regardless of the level of measurements the fields that are used to construct the decision trees branches. For example, decision tree of income distribution will have same effects regardless of whether the income is measure in hundreds or thousands or even in discrete values from 1 to 10.

Decision trees normally have reasonable training time, fast application, easy interpretation, understanding and implementation and ability to handle large number of features. They are very suitable for knowledge discovery because they do not make any assumptions about the underlying data distribution. The disadvantages of decision trees include following.

* When there are complicated relationship exists between features within the data, decision trees are unable to handle them.

* Decision trees sometimes generates simple axis-parallel decision boundaries

* When there is a lot of missing data within the dataset, sometimes it becomes difficult to build decision trees from that data.

There are many decision tree classifier models which include ID3, SPRINT. In ID3 the information theoretic approach is used. In its procedure the feature that provides the greatest gain in information or equivalently greatest decrease in entropy is monitored. In its first step, the initial value of entropy is calculated then in next step the feature which gives maximum decrease in entropy or gain in information is selected to serve as a root node of decision tree. The next level of decision tree is built by providing the greatest decrease in entropy and this process is repeated until all leaf nodes are of single class and system entropy is zero. At this stage the leaf nodes of the decision trees are obtained where patterns are of a single class. There can be chances of having some types of nodes which cannot be resolved anymore. The SPRINT (Serial PaRallelizable INduction of decision Trees) is also a decision tree classifier for data mining which is able to handle large disk resident training sets with no restriction on training set size and easily parallelizable [16]. A list is maintained for every attribute in the dataset and the attribute list contains the attribute value, class value and record ID.

### 3.2.2 - Naive Bayesian Classifier

A Naive Bayesian classifier is a simple probabilistic classifier based on applying Bayesian's theorem with strong independent assumptions. A more descriptive name for the underlying probability model would be "independent feature model". So the Naïve Bayesian classification is a statistical technique, based on estimates of probabilities to predict class membership of sample data. In simple words, a naive Bayesian classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, data consists of fruits described by their colors and shapes, and then a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter. Even though these features depend on the existence of the other features, a naive Bayesian classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. Depending on the specific nature of the probability model, naive Bayesian classifiers can be trained very efficiently in a supervised learning. In many practical applications, parameter estimation for naive Bayesian models uses the method of maximum likelihood; in other words, one can work with the naive Bayesian model without believing in Bayesian probability or using any Bayesian methods. It assumes that the effect of a variable value on a given class is independent of the values of other variable. This assumption is termed as class conditional independence. It is built to simplify the computation involved. It provides the way of computing the probability of a particular event X given some set of observations. For training pattern or sample X the posteriori probability P (H\X) of a class or hypothesis H conditioned on X can be expressed as

### Equation 3.1

Where P(H) is the priori probability of hypothesis H and is independent of X and the P(X) is the priori probability of data sample regardless of whether the H is true or false. The P(X\H) gives the posterior (a posteriori) probability of observing X if H is true and P(H\X) gives the posterior probability of H being true given that data sample X is observed

### 4 Clustering

### 4.1 - What is clustering?

Clustering is an unsupervised learning technique of classification. It is a data mining technique used to place / arrange data elements into related groups without advance knowledge of the group definitions. It is a mathematical technique designed for revealing classification structures in the data collected in real world phenomenon [17]. Clustering can also be defined as a data mining technique that separates the data into groups whose members belong together. Each object is assigned to the group to which it is most similar. So process of grouping similar items together can also be called clustering. This is closely related to arranging animals and plants into families where the members are similar. Clustering doesn't need previous knowledge of the groups that are created and the members who must belong to those groups. So it is a process of partitioning a set of data or objects into the set of significant subclasses called clusters [17]. The items within a cluster should be very similar to the each other but should be different from the items in the other clusters, which means the intra-cluster similarity should be very high but inter-cluster similarity should be low. Clustering is a natural activity in every type of field, like in human and even starts from early childhood, distinguishing and categorizing between different items like cars, toys, animals plants etc. Clustering has been a popular area in the research field. Several methods and techniques have been developed to determine the natural grouping of the objects.

### 4.1.1 - Cluster

The input data can have various types of points or objects spread. Cluster is a set of data points or objects which are similar to each other and which are residing together. If similar data points are not together then some technique is used to collect them to make a cluster. The data points or objects within a cluster will be very similar to each other but those will be usually entirely different from the data points or objects of some other cluster. So the similarity of data points or objects within the cluster is high but the data points or objects similarity with respect to other cluster it is very low. A cluster is a piece of data (usually a subset of objects considered or subset of variables or both) consisting of entities which are much “alike” in terms of data, versus to the other part of data [17]. Like if input data contains dots of different colors like 3 colors say red, green and blue. Then all red points will be collected to make a red cluster, all green points will be brought together to make a green cluster and all blue points will be collected together to make a blue cluster so all related dots are collected each other to make a cluster. Dots within red cluster are similar to each other but different from the green and blue cluster and so on. There are different methodologies and techniques are used to create and identify cluster within the dataset. The most common K-Means is the well known method for clustering.

### 4.1.2 - K-Means

The K-Means is the most common clustering algorithm (also known as “Moving Centers”) which was developed by different countries in many versions and programs [17]. It was introduced by MacQueen in 1967. The K-Means groups data with similar characteristics or features together. These groups of data are called clusters. The data in each cluster will have similar features or characteristics which will be dissimilar from the data in other clusters. This method is a extensively used for clustering. It has a distinguished feature which is its speed in practice. However in its worst-case its running-time becomes exponential and leaves a gap between practical and theoretical performance. This is a very popular algorithm for clustering specially for high-dimensional data [25]. The K-Mean algorithm takes the input parameter k and partitions a set of n objects into k clusters so that the resulting intra-cluster similarity is high but inter-cluster similarity is low. The cluster similarity is measured as in terms of mean value of the objects within a cluster which can be viewed as a cluster's center of gravity [12]. The K-Means algorithm works like; at first it selects randomly k of objects each of which initially represent a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is most similar, based on the distance between the objects and cluster mean. It then computes the new mean for each cluster and this process continues until the criterion function converges, means the distance from each object to its center / mean becomes equal [12]. The algorithm steps can be defined as follows.

1) It accepts the number of clusters to group data into, and the dataset to cluster as input values.

2) It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset.

3) Arithmetic mean of each cluster formed in the dataset is calculated in K-Means algorithm. Mean of all the individual records or objects in the cluster is called the arithmetic mean. In each of the first K initial clusters, only one record exists. The Arithmetic Mean of a cluster with respect to one record is the set of values that make up that record.

4) In the next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure

5) K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster.

6) The K-Means algorithm then re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is then assigned to the nearest cluster i.e. the cluster which it is most similar to it using a measure of distance or similarity, for example Euclidean Distance Measure.

7) The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed once new repetitions or iterations of the K-Means clustering algorithm does not generate new clusters as the cluster center or arithmetic mean of each cluster generated is the same as the old cluster center. There are different methods for determining when a stable cluster is generated or when the k-means clustering algorithm is completed.

The K-Means algorithm can be explained by following example in which different data points are drawn in 2 dimensions. 4.1 is showing the various numbers of data points available in XY Plane.

At first step, three points are taken randomly as cluster centers and the related data points are associated with those cluster centers using Euclidean distance. The Euclidean distance between points p and q is the length of the straight line segment between p and q. In Cartesian coordinates, if p = (p1, p2, ..., pn) and q = (q1, q2, ..., qn) are two points in Euclidean n - space, then the distance from p to q is written as below in Equal 4.1:

The 4.2 is showing three random cluster centers and nearest data points are associated with them. For convenience the clusters and their centers are differentiated by different colors.

In the next step the cluster center is moved to mean of its assigned data points. Here it is important to note that on each movement of the cluster center to mean of its data points, the association of data points to the cluster centers can be changed.

The cluster centers are moved to the means of their data items but due to distance measures with the cluster centers, association of different data points to the cluster centers has been changed. The previous step is repeated and cluster centers are again reached to the mean of their data points and this process stops when no further movement of cluster centers is possible.

Once the cluster centers are settled down and their next iteration does not cause the cluster centers to move then the k-means process will be finished. From 4.4 it is clear that one point on the left top belonged previously with cluster K1 but after iteration the K1 was moved to new position and due to the distance measures calculated by Euclidean distance formula tht point start belong to K2 and similarly with those other 2 points mention in 4.4.

### Pros and Cons of K-Means

K-Means is good algorithm but it has some pros and cons.

* Means works well when input data have compact clouds and is well separated from one another.

* It is scalable and is quite efficient for large scale datasets with complexity O(nkt) where k is the no. of clusters, n is no. of objects and t is no. of iterations required. Usually t << n and k << n.

* It can only be applied where mean of the cluster is defined, means it can only work with numeric data. If there is categorical, nominal or binary data then this algorithm cannot be applied.

* Sometimes it becomes difficult to select a good value of k.

* It is sensitive algorithm due to which noise and outlier points can significantly influence the mean attribute value.

### 4.1.3 - K-Mediods

The k-medoids is a clustering algorithm related to the k-means algorithm. Both the k-means and k-medoids algorithms are partitional which breaks the dataset up into groups and both try to minimize squared error. The distance between points labeled needs to be in a cluster and a point is designated as the center of that cluster. The k-mediodsis a method in which the objective is to minimize the sum of distances to the nearest center and the geometric k-center [24]. In contrast to the k-means algorithm, the k-medoids chooses data points as centers (medoids). It is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters. It is more robust to noise and is outliers as compared to k-means. It is one of the most commonly studied clustering problem; these are those problems in which the aim is to divide a given set of points into clusters so that the points within a cluster are relatively closer with respect to some measure [23]. A medoid can be defined as the object of a cluster, whose average variation from all the objects in the cluster is negligible i.e. it is a point which is most centrally located in the given dataset. The K-Mediods algorithm has two main aspects which are more important and deserves the attention. One of them is computing efficiency and other is initialization [22] and due to its features it is most commonly used and studied in clustering. The algorithm of the k-mediods can have following steps.

* Randomly choose k as mediods from the n data objects.

* Assign each of the remaining (non-mediod) objects to the cluster represented by the most similar nearest mediods.

* Randomly choose a non-mediod object Orandom

* For each current object Oj Compute the total cost S of swapping Oj with Orandom. It includes the cost contributions of reassigning non-mediod objects by swapping.

* If S < 0 then swap Oj with Orandom to create a new set of k-mediods.

* Repeat the steps until changes are stopped.

### 4.2 - Subspace Clustering

Subspace clustering is a newer form of clustering which can find different clusters in subspaces within a dataset. Often in high dimensional data, many dimensions can be redundant and can create a noisy data for existing clusters. “Features selection eliminates the redundant and unrelated dimensions by analyzing the whole dataset. Subspace clustering algorithms searches for relevant dimensions allowing them to find clusters those exist in multiple overlapping dimensions. This is a particularly important challenge with high dimensional data where the curse of dimensionality occurs [4]”.

### 4.2.1 - What is subspace clustering?

Automatically identifying clusters present in the subspace of a high dimensional data space that allows better clustering of the data points than the original space. Cluster analysis exposes the groups or clusters of similar groups. “Objects are normally shown as point in multidimensional space. Similarity between objects is often determined by distance measures over the various dimensions in dataset. Changes to existing algorithms are essential to keep up the cluster quality and speed since datasets have become larger and more varied [4]”.

Conventional clustering algorithm gives importance to all dimensions to learn about each object. In high dimensional data, often more dimensions are unimportant and can be considered as redundant. These irrelevant & redundant dimensions can confuse the clustering algorithms by hiding clustering in noisy data. In very high dimensions it is common for all of the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. Feature selection methods have been working somewhat successfully to improve cluster quality. These algorithms find a subset of dimensions on which to perform clustering by removing irrelevant and redundant dimensions. Unlike feature selection methods which examine the dataset as a whole, subspace clustering algorithms localize their search and are able to uncover clusters that exist in multiple, possibly overlapping subspaces [5].

“Another thing with which clustering algorithms fight is the curse of dimensionality [6]”. As the number of dimensions in a dataset increases, distance measures become increasingly worthless. Additional dimensions spread out the points until, in very high dimensions; they are almost equidistant from each other. 4.1 illustrates how additional dimensions spread out the points in a sample dataset. The dataset contains 20 points arbitrarily placed between 0 and 2 in each of three dimensions. 4.1(a) shows the data projected on one axis. The points are close together and are about half of them in a one unit sized area [6]. 4.1(b) shows the same data in extended form into the second dimension. By adding another dimension, points are spread out along another axis, pulling them further apart. Now only about a quarter of the points fall into a unit sized area. In 4.1(c) a third dimension is added which spreads the data further apart. A one unit sized bin now holds only about one eighth of the points. If we continue to add dimensions, the points will continue to spread out until they are all almost equally far apart and distance is no longer very important. The problem is made worse when objects are related in different ways in different subsets of dimensions [6]. It is this type of relationship that subspace clustering algorithms seek to uncover. In order to find such clusters, the irrelevant features must be removed to allow the clustering algorithm to focus on only the relevant dimensions. Clusters found in lower dimensional space also tend to be more easily interpretable, allowing the user to better direct further study [6].

Subspace clustering is also more general than feature selection in that each subspace is local to each cluster, instead of global to everyone. It also helps to get smaller descriptions of the clusters found since clusters are defined on fewer dimensions than the original number of dimensions. An example of subspace clustering can be in bioinformatics with DNA micro array data. One population of cells in a micro array experiment may be similar to another because they both produce chlorophyll, and thus be clustered together based on the expression levels of a certain set of genes related to chlorophyll. However, another population might be similar because the cells are regulated by the circadian clock mechanisms of the organism. In this case, they would be clustered on a different set of genes. These two relationships represent clusters in two distinct subsets of genes. These datasets present new challenges and goals for unsupervised learning. Subspace clustering algorithms are one answer to those challenges. They excel in situations like those described above, where objects are related in multiple, different ways.

### 4.2.2 - Why subspace clustering?

Clustering is a great data exploring methodology which is able to identify previously unknown patterns in data. Subspace clustering is an extension of conventional clustering, based on the observation that different clusters (groups of data points) may exist in different subspaces within a given dataset. This point is particularly important with respect to high dimensional data where the curse of dimensionality can occur and can reduce the worth of the results. Subspace clustering is important due to following points.

* Most of the clustering algorithms have been designed to identify clusters in the full dimensional space so they are not effective in identifying clusters that exist in the subspaces of the original data space [2].

* Many times the data records contain some missing objects. Such missing objects are normally replaced with objects taken from given distribution [2].

* The clustering results formed by most of the clustering algorithms rely a lot on the order in which input records are processed [2].

### 5 Dimensions Reduction

Feature conversion or transformation and feature selection methodologies are integrated in methodologies of clustering the high dimensional data. Feature transformation methodologies try to sum up a dataset in fewer dimensions by creating combinations of the original attributes. These methodologies are very successful in exposing hidden structure in datasets [6]. However, since they protect the relative distances between objects, they are less effective when there are large numbers of unrelated attributes that hide the clusters in sea of noise. Also, the new features are combinations of the originals and may be very difficult to understand the new features in the context of the domain. Feature selection methods choose only the most relevant of the dimensions from a dataset to expose groups of objects that are similar on only a subset of their attributes [3]. While quite successful on many datasets, feature selection algorithms have difficulty when clusters are found in different subspaces. It is this type of data that motivated the evolution to subspace clustering algorithms. These algorithms take the concepts of feature selection one step further by selecting relevant subspaces for each cluster separately.

Feature selection is commonly used for high dimensional datasets. These methods include techniques such as principle component analysis and singular value decomposition. The transformations generally save the original, relative distances between objects. In this way, they sum up the dataset by creating linear combinations of the attributes, and hopefully, expose hidden structure. Feature transformation is often a preprocessing step, allowing the clustering algorithm to use just a few of the newly created features. A few clustering methods have included the use of such transformations to identify important features and iteratively improve their clustering [4]. While often very useful, these techniques do not actually remove any of the original attributes from consideration. Thus, information from unrelated dimensions is sealed, making these techniques ineffective at exposing clusters when there are large numbers of unrelated attributes that mask the clusters. Another disadvantage of using combinations of attributes is that they are difficult to understand, often making the clustering results useless. Because of this, feature transformations are best suitable to datasets where most of the dimensions are relevant to the clustering task, but many are highly correlated or redundant [6].

Feature selection attempts to find out the attributes of a dataset that are most relevant to the data mining task at hand. It is a commonly used and strong methodology for removing the dimensionality of a problem to more convenient levels. Feature selection involves searching through various feature subsets and evaluating each of these subsets using some criterion [5]. The most popular search strategies are greedy sequential searches through the feature space, either forward or backward. The evaluation criteria follow one of two basic models, the wrapper model and the filter model. The wrapper model techniques evaluate the dataset using the data mining algorithm that will ultimately be employed. Thus, they “wrap” the selection process around the data mining algorithm. Algorithms based on the filter model examine inherent properties of the data to evaluate the feature subset prior to the data mining [4].

Much of the work in feature selection has been directed at supervised learning. The main difference between feature selection in supervised and unsupervised learning is the evaluation criterion. Classification accuracy is used as a measure of goodness in supervised wrapper models. The filter based approaches almost always rely on the class labels, most commonly assessing correlations between features and the class labels [3]. In the unsupervised clustering problem, there are no universally accepted measures of accuracy and no class labels. However, there are a number of methods that adapt feature selection to clustering.

### 5.1 - Principal Component Analysis

Principal component analysis (PCA) is a standard tool used for analyzing the modern data analysis, like data retrieved from diverse fields from neuroscience to computer graphics etc. because it is a simple, non-parametric method for extracting relevant information from confusing data sets. With minimal effort Principal Component Analysis provides a roadmap for how to reduce a complex data set to a lower dimension to expose the sometimes hidden, simplified structures that often cause it [20]. It involves the mathematical procedure which transforms a number of possibly interrelated variables into a smaller number of unrelated variables called principal components. The first principal component accounts for as much of the inconsistency in the data as possible, and each succeeding component accounts for as much of the remaining inconsistency as possible. Principal Component Analysis is a useful statistical methodology that has found application in different fields especially in artificial intelligence such as face recognition and image compression, and is a common methodology for finding patterns in data of high dimension [21].

Principal Component Analysis was invented in 1901 by Karl Pearson. Now it is most commonly used as a tool in data exploration analysis and for making predictive models. Principal component analysis is appropriate when you have obtained measures on a number of observed variables and it is needed to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables. The principal components may then be used as predictor or criterion variables in subsequent analyses.

### 5.1.1 - Variable Reduction

Principal component analysis is a variable reduction procedure. It is helpful when you have obtained data on a number of variables and usually on a large number of variables, and it is known that there is some redundancy in those variables. Variable can also be called as objects. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, it would be assumed that it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables. Because it is a variable reduction procedure, principal component analysis is similar in many respects to exploratory factor analysis. In fact, the steps followed when conducting a principal component analysis are virtually the same to those followed when conducting an exploratory factor analysis. However, there are major theoretical differences between the two procedures, and it is important that it is not mistakenly claimed by someone that he or she is performing factor analysis when actually performing principal component analysis.

### 5.1.2 - Principal Components

How principal components are computed. Technically a principal component can be defined as a linear combination of observed variables which are optimally weighted. In order to understand the meaning of this definition, it is necessary to first describe how subject scores on a principal component are computed. In the course of performing a principal component analysis, it is possible to calculate a score for each subject on a given principal component. For example, in the preceding study, each subject would have scores on two components: one score on the satisfaction with supervision component, and one score on the satisfaction with pay component. The subject's actual scores on the seven questionnaire items would be optimally weighted and then summed to compute their scores on a given component.

### 5.1.3 - Characteristics of Principal Components

The first component extracted in a principal component analysis accounts for a maximal amount of total variance in the observed variables. Under typical conditions, this means that the first component will be interrelated with at least some of the observed or experimental variables. It may be interrelated with many. The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the data set that was not accounted for by the first component. Again under typical conditions, this means that the second component will be 8 Principal Component Analysis interrelated with some of the observed or experimental variables that did not display strong interrelations with component 1. The second characteristic of the second component is that it will not be interrelated with the first component. If the relationship or correlation between components 1 and 2 is computed, the result would be zero. The remaining components extracted in the analysis will be displaying the same two characteristics. One is each compon

### Cite This Dissertation

To export a reference to this article please select a referencing stye below: