### Chapter-2

Real life data rarely comply with the necessities of various data mining tools. It is usually inconsistent and noisy. It may contain redundant attributes, unsuitable formats etc. Hence data has to be prepared vigilantly before the data mining actually starts. It is well known fact that success of a data mining algorithm is very much dependent on the quality of data processing. Data processing is one of the most important tasks in data mining. In this context it is natural that data pre-processing is a complicated task involving large data sets. Sometimes data pre-processing take more than 50% of the total time spent in solving the data mining problem. It is crucial for data miners to choose efficient data preprocessing technique for specific data set which can not only save processing time but also retain the quality of the data for data mining process.

A data pre-processing tool should help miners with many data mining activates. For example, data may be provided in different formats as discussed in previous chapter (flat files, database files etc). Data files may also have different formats of values, calculation of derived attributes, data filters, joined data sets etc. Data mining process generally starts with understanding of data. In this stage pre-processing tools may help with data exploration and data discovery tasks. Data processing includes lots of tedious works,

### Data pre-processing generally consists of

* Data Cleaning

* Data Integration

* Data Transformation And

* Data Reduction.

In this chapter we will study all these data pre-processing activities.

### 2.1 Data Understanding

In Data understanding phase the first task is to collect initial data and then proceed with activities in order to get well known with data, to discover data quality problems, to discover first insight into the data or to identify interesting subset to form hypothesis for hidden information. The data understanding phase according to CRISP model can be shown in following .

### 2.1.1 Collect Initial Data

The initial collection of data includes loading of data if required for data understanding. For instance, if specific tool is applied for data understanding, it makes great sense to load your data into this tool. This attempt possibly leads to initial data preparation steps. However if data is obtained from multiple data sources then integration is an additional issue.

### 2.1.2 Describe data

Here the gross or surface properties of the gathered data are examined.

### 2.1.3 Explore data

This task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These include:

* Sharing of key attributes, for instance the goal attribute of a prediction task

* Relations between pairs or small numbers of attributes

* Results of simple aggregations

* Properties of important sub-populations

* Simple statistical analyses.

### 2.1.4 Verify data quality

In this step quality of data is examined. It answers questions such as:

* Is the data complete (does it cover all the cases required)?

* Is it accurate or does it contains errors and if there are errors how common are they?

* Are there missing values in the data?

* If so how are they represented, where do they occur and how common are they?

### 2.2 Data Preprocessing

Data preprocessing phase focus on the pre-processing steps that produce the data to be mined. Data preparation or preprocessing is one most important step in data mining. Industrial practice indicates that one data is well prepared; the mined results are much more accurate. This means this step is also a very critical fro success of data mining method. Among others, data preparation mainly involves data cleaning, data integration, data transformation, and reduction.

### 2.2.1 Data Cleaning

Data cleaning is also known as data cleansing or scrubbing. It deals with detecting and removing inconsistencies and errors from data in order to get better quality data. While using a single data source such as flat files or databases data quality problems arises due to misspellings while data entry, missing information or other invalid data. While the data is taken from the integration of multiple data sources such as data warehouses, federated database systems or global web-based information systems, the requirement for data cleaning increases significantly. This is because the multiple sources may contain redundant data in different formats. Consolidation of different data formats abs elimination of redundant information becomes necessary in order to provide access to accurate and consistent data. Good quality data requires passing a set of quality criteria. Those criteria include:

* Accuracy: Accuracy is an aggregated value over the criteria of integrity, consistency and density.

* Integrity: Integrity is an aggregated value over the criteria of completeness and validity.

* Completeness: completeness is achieved by correcting data containing anomalies.

* Validity: Validity is approximated by the amount of data satisfying integrity constraints.

* Consistency: consistency concerns contradictions and syntactical anomalies in data.

* Uniformity: it is directly related to irregularities in data.

* Density: The density is the quotient of missing values in the data and the number of total values ought to be known.

* Uniqueness: uniqueness is related to the number of duplicates present in the data.

### 2.2.1.1 Terms Related to Data Cleaning

Data cleaning: data cleaning is the process of detecting, diagnosing, and editing damaged data.

Data editing: data editing means changing the value of data which are incorrect.

Data flow: data flow is defined as passing of recorded information through succeeding information carriers.

Inliers: Inliers are data values falling inside the projected range.

Outlier: outliers are data value falling outside the projected range.

Robust estimation: evaluation of statistical parameters, using methods that are less responsive to the effect of outliers than more conventional methods are called robust method.

### 2.2.1.2 Definition: Data Cleaning

Data cleaning is a process used to identify imprecise, incomplete, or irrational data and then improving the quality through correction of detected errors and omissions. This process may include

* format checks

* Completeness checks

* Reasonableness checks

* Limit checks

* Review of the data to identify outliers or other errors

* Assessment of data by subject area experts (e.g. taxonomic specialists).

By this process suspected records are flagged, documented and checked subsequently. And finally these suspected records can be corrected. Sometimes validation checks also involve checking for compliance against applicable standards, rules, and conventions.

The general framework for data cleaning given as:

* Define and determine error types;

* Search and identify error instances;

* Correct the errors;

* Document error instances and error types; and

* Modify data entry procedures to reduce future errors.

Data cleaning process is referred by different people by a number of terms. It is a matter of preference what one uses. These terms include: Error Checking, Error Detection, Data Validation, Data Cleaning, Data Cleansing, Data Scrubbing and Error Correction.

We use Data Cleaning to encompass three sub-processes, viz.

* Data checking and error detection;

* Data validation; and

* Error correction.

A fourth - improvement of the error prevention processes - could perhaps be added.

### 2.2.1.3 Problems with Data

Here we just note some key problems with data

Missing data : This problem occur because of two main reasons

* Data are absent in source where it is expected to be present.

* Some times data is present are not available in appropriately form

Detecting missing data is usually straightforward and simpler.

Erroneous data: This problem occurs when a wrong value is recorded for a real world value. Detection of erroneous data can be quite difficult. (For instance the incorrect spelling of a name)

Duplicated data : This problem occur because of two reasons

* Repeated entry of same real world entity with some different values

* Some times a real world entity may have different identifications.

Repeat records are regular and frequently easy to detect. The different identification of the same real world entities can be a very hard problem to identify and solve.

Heterogeneities: When data from different sources are brought together in one analysis problem heterogeneity may occur. Heterogeneity could be

* Structural heterogeneity arises when the data structures reflect different business usage

* Semantic heterogeneity arises when the meaning of data is different n each system that is being combined

Heterogeneities are usually very difficult to resolve since because they usually involve a lot of contextual data that is not well defined as metadata.

Information dependencies in the relationship between the different sets of attribute are commonly present. Wrong cleaning mechanisms can further damage the information in the data. Various analysis tools handle these problems in different ways. Commercial offerings are available that assist the cleaning process, but these are often problem specific. Uncertainty in information systems is a well-recognized hard problem. In following a very simple examples of missing and erroneous data is shown

Extensive support for data cleaning must be provided by data warehouses. Data warehouses have high probability of “dirty data” since they load and continuously refresh huge amounts of data from a variety of sources. Since these data warehouses are used for strategic decision making therefore the correctness of their data is important to avoid wrong decisions. The ETL (Extraction, Transformation, and Loading) process for building a data warehouse is illustrated in following .

Data transformations are related with schema or data translation and integration, and with filtering and aggregating data to be stored in the data warehouse. All data cleaning is classically performed in a separate data performance area prior to loading the transformed data into the warehouse. A large number of tools of varying functionality are available to support these tasks, but often a significant portion of the cleaning and transformation work has to be done manually or by low-level programs that are difficult to write and maintain.

A data cleaning method should assure following:

1. It should identify and eliminate all major errors and inconsistencies in an individual data sources and also when integrating multiple sources.

2. Data cleaning should be supported by tools to bound manual examination and programming effort and it should be extensible so that can cover additional sources.

3. It should be performed in association with schema related data transformations based on metadata.

4. Data cleaning mapping functions should be specified in a declarative way and be reusable for other data sources.

### 2.2.1.4 Data Cleaning: Phases

1. Analysis: To identify errors and inconsistencies in the database there is a need of detailed analysis, which involves both manual inspection and automated analysis programs. This reveals where (most of) the problems are present.

2. Defining Transformation and Mapping Rules: After discovering the problems, this phase are related with defining the manner by which we are going to automate the solutions to clean the data. We will find various problems that translate to a list of activities as a result of analysis phase.

Example:

- Remove all entries for J. Smith because they are duplicates of John Smith

- Find entries with `bule' in colour field and change these to `blue'.

- Find all records where the Phone number field does not match the pattern (NNNNN NNNNNN). Further steps for cleaning this data are then applied.

- Etc …

3. Verification: In this phase we check and assess the transformation plans made in phase- 2. Without this step, we may end up making the data dirtier rather than cleaner. Since data transformation is the main step that actually changes the data itself - so there is a need to be sure that the applied transformations will do it correctly. Therefore test and examine the transformation plans very carefully.

Example:

- Let we have a very thick C++ book where it says strict in all the places where it should say struct

4. Transformation: Now if it is sure that cleaning will be done correctly, then apply the transformation verified in last step. For large database, this task is supported by a variety of tools

Backflow of Cleaned Data: In a data mining the main objective is to convert and move clean data into target system. This asks for a requirement to purify legacy data. Cleansing can be a complicated process depending on the technique chosen and has to be designed carefully to achieve the objective of removal of dirty data. Some methods to accomplish the task of data cleansing of legacy system include:

n Automated data cleansing

n Manual data cleansing

n The combined cleansing process

### 2.2.1.5 Missing Values

Data cleaning addresses a variety of data quality problems, including noise and outliers, inconsistent data, duplicate data, and missing values. Missing values is one important problem to be addressed. Missing value problem occurs because many tuples may have no record for several attributes. For Example there is a customer sales database consisting of a whole bunch of records (lets say around 100,000) where some of the records have certain fields missing. Let's say customer income in sales data may be missing. Goal here is to find a way to predict what the missing data values should be (so that these can be filled) based on the existing data. Missing data may be due to following reasons

* Equipment malfunction

* Inconsistent with other recorded data and thus deleted

* Data not entered due to misunderstanding

* Certain data may not be considered important at the time of entry

* Not register history or changes of the data

### How to Handle Missing Values?

Dealing with missing values is a regular question that has to do with the actual meaning of the data. There are various methods for handling missing entries

1. Ignore the data row. One solution of missing values is to just ignore the entire data row. This is generally done when the class label is not there (here we are assuming that the data mining goal is classification), or many attributes are missing from the row (not just one). But if the percentage of such rows is high we will definitely get a poor performance.

2. Use a global constant to fill in for missing values. We can fill in a global constant for missing values such as "unknown", "N/A" or minus infinity. This is done because at times is just doesn't make sense to try and predict the missing value. For example if in customer sales database if, say, office address is missing for some, filling it in doesn't make much sense. This method is simple but is not full proof.

3. Use attribute mean. Let say if the average income of a a family is X you can use that value to replace missing income values in the customer sales database.

4. Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to "Luxury" and "Low budget" and you're dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you'd get if you factor in the low budget

5. Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Bayesian formalism, decision trees, clustering algorithms etc.

### 2.2.1.6 Noisy Data

Noise can be defined as a random error or variance in a measured variable. Due to randomness it is very difficult to follow a strategy for noise removal from the data. Real world data is not always faultless. It can suffer from corruption which may impact the interpretations of the data, models created from the data, and decisions made based on the data. Incorrect attribute values could be present because of following reasons

* Faulty data collection instruments

* Data entry problems

* Duplicate records

* Incomplete data:

* Inconsistent data

* Incorrect processing

* Data transmission problems

* Technology limitation.

* Inconsistency in naming convention

* Outliers

### How to handle Noisy Data?

The methods for removing noise from data are as follows.

1. Binning: this approach first sort data and partition it into (equal-frequency) bins then one can smooth it using- Bin means, smooth using bin median, smooth using bin boundaries, etc.

2. Regression: in this method smoothing is done by fitting the data into regression functions.

3. Clustering: clustering detect and remove outliers from the data.

4. Combined computer and human inspection: in this approach computer detects suspicious values which are then checked by human experts (e.g., this approach deal with possible outliers)..

Following methods are explained in detail as follows:

Binning: - Data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For instance, age can be changed to bins such as 20 or under, 21-40, 41-65 and over 65. Binning methods smooth a sorted data set by consulting values around it. This is therefore called local smoothing. Let consider a binning example

Binning Methods

n Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

n Equal-depth (frequency) partitioning

1. It divides the range (values of a given attribute) into N intervals, each containing approximately same number of samples (elements)

2. Good data scaling

3. Managing categorical attributes can be tricky.

n Smooth by bin means- Each bin value is replaced by the mean of values

n Smooth by bin medians- Each bin value is replaced by the median of values

n Smooth by bin boundaries - Each bin value is replaced by the closest boundary value

Example

Let Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

n Partition into equal-frequency (equi-depth) bins:

o Bin 1: 4, 8, 9, 15

o Bin 2: 21, 21, 24, 25

o Bin 3: 26, 28, 29, 34

n Smoothing by bin means:

o Bin 1: 9, 9, 9, 9 ( for example mean of 4, 8, 9, 15 is 9)

o Bin 2: 23, 23, 23, 23

o Bin 3: 29, 29, 29, 29

n Smoothing by bin boundaries:

o Bin 1: 4, 4, 4, 15

o Bin 2: 21, 21, 25, 25

o Bin 3: 26, 26, 26, 34

Regression: Regression is a DM technique used to fit an equation to a dataset. The simplest form of regression is linear regression which uses the formula of a straight line (y = b+ wx) and determines the suitable values for b and w to predict the value of y based upon a given value of x. Sophisticated techniques, such as multiple regression, permit the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation. Regression is further described in subsequent chapter while discussing predictions.

Clustering: clustering is a method of grouping data into different groups , so that data in each group share similar trends and patterns. Clustering constitute a major class of data mining algorithms. These algorithms automatically partitions the data space into set of regions or cluster. The goal of the process is to find all set of similar examples in data, in some optimal fashion. Following shows three clusters. Values that fall outside the cluster are outliers.

4. Combined computer and human inspection: These methods find the suspicious values using the computer programs and then they are verified by human experts. By this process all outliers are checked.

### 2.2.1.7 Data cleaning as a process

Data cleaning is the process of Detecting, Diagnosing, and Editing Data. Data cleaning is a three stage method involving repeated cycle of screening, diagnosing, and editing of suspected data abnormalities. Many data errors are detected by the way during study activities. However, it is more efficient to discover inconsistencies by actively searching for them in a planned manner. It is not always right away clear whether a data point is erroneous. Many times it requires careful examination. Likewise, missing values require additional check. Therefore, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can monitor for suspect features in survey questionnaires, databases, or analysis data. In small studies, with the examiner intimately involved at all stages, there may be small or no difference between a database and an analysis dataset.

During as well as after treatment, the diagnostic and treatment phases of cleaning need insight into the sources and types of errors at all stages of the study. Data flow concept is therefore crucial in this respect. After measurement the research data go through repeated steps of- entering into information carriers, extracted, and transferred to other carriers, edited, selected, transformed, summarized, and presented. It is essential to understand that errors can occur at any stage of the data flow, including during data cleaning itself. Most of these problems are due to human error.

Inaccuracy of a single data point and measurement may be tolerable, and associated to the inherent technological error of the measurement device. Therefore the process of data clenaning mus focus on those errors that are beyond small technical variations and that form a major shift within or beyond the population distribution. In turn, it must be based on understanding of technical errors and expected ranges of normal values.

Some errors are worthy of higher priority, but which ones are most significant is highly study-specific. For instance in most medical epidemiological studies, errors that need to be cleaned, at all costs, include missing gender, gender misspecification, birth date or examination date errors, duplications or merging of records, and biologically impossible results. Another example is - in nutrition studies, date errors lead to age errors, which in turn lead to errors in weight-for-age scoring and, further, to misclassification of subjects as under- or overweight. Errors of sex and date are particularly important because they contaminate derived variables. Prioritization is essential if the study is under time pressures or if resources for data cleaning are limited.

### 2.2.2 Data Integration

This is a process of taking data from one or more sources and mapping it, field by field, onto a new data structure. Idea is to combine data from multiple sources into a coherent form. Various data mining projects requires data from multiple sources because

n Data may be distributed over different databases or data warehouses. (for example an epidemiological study that needs information about hospital admissions and car accidents)

n Sometimes data may be required from different geographic distributions, or there may be need for historical data. (e.g. integrate historical data into a new data warehouse)

n There may be a necessity of enhancement of data with additional (external) data. (for improving data mining precision)

### 2.2.2.1 Data Integration Issues

There are number of issues in data integrations. Consider two database tables. Imagine two database tables

Database Table-1

Database Table-2

In integration of there two tables there are variety of issues involved such as

1. The same attribute may have different names (for example in above tables Name and Given Name are same attributes with different names)

2. An attribute may be derived from another (for example attribute Age is derived from attribute DOB)

3. Attributes might be redundant( For example attribute PID is redundant)

4. Values in attributes might be different (for example for PID 4791 values in second and third field are different in both the tables)

5. Duplicate records under different keys( there is a possibility of replication of same record with different key values)

Therefore schema integration and object matching can be trickier. Question here is - how equivalent entities from different sources are matched? This problem is known as entity identification problem. Conflicts have to be detected and resolved. Integration becomes easier if unique entity keys are available in all the data sets (or tables) to be linked. Metadata can help in schema integration (example of metadata for each attribute includes the name, meaning, data type and range of values permitted for the attribute)

### 2.2.2.1 Redundancy

Redundancy is another important issue in data integration. Two given attribute (such as DOB and age for instance in give table) may be redundant if one is derived form the other attribute or set of attributes. Inconsistencies in attribute or dimension naming can lead to redundancies in the given data sets.

### Handling Redundant Data

We can handle data redundancy problems by following ways

n Use correlation analysis

n Different coding / representation has to be considered (e.g. metric / imperial measures)

n Careful (manual) integration of the data can reduce or prevent redundancies (and inconsistencies)

n De-duplication (also called internal data linkage)

o If no unique entity keys are available

o Analysis of values in attributes to find duplicates

n Process redundant and inconsistent data (easy if values are the same)

o Delete one of the values

o Average values (only for numerical attributes)

o Take majority values (if more than 2 duplicates and some values are the same)

### Correlation analysis is explained in detail here.

Correlation analysis (also called Pearson's product moment coefficient): some redundancies can be detected by using correlation analysis. Given two attributes, such analysis can measure how strong one attribute implies another. For numerical attribute we can compute correlation coefficient of two attributes A and B to evaluate the correlation between them. This is given by

Where

n n is the number of tuples,

n and are the respective means of A and B

n σA and σB are the respective standard deviation of A and B

n Σ(AB) is the sum of the AB cross-product.

a. If -1 < rA, B ≤ +1 is calculated and if rA,B is greater than 0 , then A and B are positively correlated , meaning that if values of A increases then values of B also increases. In this case higher the value of rA,B , stronger is the correlation between A and B. hence higher vales indicates that one of A or B may be removed as a redundancy.

b. If rA, B is equal to zero it indicates A and B are independent of each other and there is no correlation between them.

c. If rA, B is less than zero then A and B are negatively correlated. , where if value of one attribute increases value of another attribute decreases. This means that one attribute discourages another attribute.

It is important to note that correlation does not imply causality. That is, if A and B are correlated, this does not essentially mean that A causes B or that B causes A. for example in analyzing a demographic database, we may find that attribute representing number of accidents and the number of car theft in a region are correlated. This does not mean that one is related to another. Both may be related to third attribute, namely population.

For discrete data, a correlation relation between two attributes, can be discovered by a χ²(chi-square) test. Let A has c distinct values a1,a2,……ac and B has r different values namely b1,b2,……br The data tuple described by A and B are shown as contingency table, with c values of A (making up columns) and r values of B( making up rows). Each and every (Ai, Bj) cell in table has.

X^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} {(O_{i,j} - E_{i,j})^2 \over E_{i,j}} .

Where

n Oi, j is the observed frequency (i.e. actual count) of joint event (Ai, Bj) and

n Ei, j is the expected frequency which can be computed as

E_{i,j}=\frac{\sum_{k=1}^{c} O_{i,k} \sum_{k=1}^{r} O_{k,j}}{N} \, ,

Where

n N is number of data tuple

n Oi,k is number of tuples having value ai for A

n Ok,j is number of tuples having value bj for B

The larger the χ² value, the more likely the variables are related. The cells that contribute the most to the χ² value are those whose actual count is very different from the expected count

Chi-Square Calculation: An Example

Suppose a group of 1,500 people were surveyed. The gender of each person was noted. Each person has polled their preferred type of reading material as fiction or non-fiction. The observed frequency of each possible joint event is summarized in following table.( number in parenthesis are expected frequencies) . Calculate chi square.

Play chess

Not play chess

Sum (row)

Like science fiction

250(90)

200(360)

450

Not like science fiction

50(210)

1000(840)

1050

Sum(col.)

300

1200

1500

E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 and so on

For this table the degree of freedom are (2-1)(2-1) =1 as table is 2X2. for 1 degree of freedom , the χ² value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage point of the χ² distribution typically available in any statistic text book). Since the computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that two attributes are strongly correlated for given group.

Duplication must also be detected at the tuple level. The use of renormalized tables is also a source of redundancies. Redundancies may further lead to data inconsistencies (due to updating some but not others).

### 2.2.2.2 Detection and resolution of data value conflicts

Another significant issue in data integration is the discovery and resolution of data value conflicts. For example, for the same entity, attribute values from different sources may differ. For example weight can be stored in metric unit in one source and British imperial unit in another source. For instance, for a hotel chain, room rent in different cities may not only involve different currencies but also different services and taxes.

An attribute in one source may be stored at a lower level of abstraction than the “same” attribute in another source. For instance, the total sales in one database may refer to one branch of a electronics store, while an attribute of the same name in another database may refer to the total sales for that stores in a specified region.

Structure of data must be given sufficient attention. This is to make sure that any attribute functional dependencies and referential constraints in the source system match those in the target system. For instance in one scheme, a discount may be functional for an order, whereas in another scheme it is applied to each individual line item within the order. If this is not caught before integration, items in the target system may be improperly discounted.

The semantic heterogeneity and structure of data pose great challenges in data integration. Careful integration of the data frommultiple sources can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent mining process.

### 2.2.3 Data Transformation

In data transformation, data are consolidated into appropriate form to make them suitable for mining. Data transformation involves following:

n Data Smoothing: data smoothing is done in order to remove noise from data. Binning, regression and clustering are few techniques used in data smoothing. It is a form of data cleaning and was discussed in previous section.

n Data Aggregation: here data are summarized or aggregation operation is applied on them. This is generally used in constructing data cubes for analysis of the data at multiple granularities. For example daily sales data can be aggregated to compute monthly total sales amount. This is a form of data reduction and we will discuss it in next section.

n Data Generalization: In this low level data are replaced by higher level concepts using concept hierarchies. For example, attributes like street, can be generalized to higher level concepts like city or country. This is also a form of data reduction and we will discuss it in next section.

n Normalization: in normalization attribute data are scaled so as to fall within a small specified range such as -1 to 1 or 0 to 1.

n Attribute Construction (feature construction): to help mining process new attributes are constructed and added from the given set of attributes.

Here in this section we will discuss normalization and attribute construction.

### 2.2.3.1 Normalization

Normalization is generally useful in classification algorithms involving neural networks or distance measurement such as nearest neighbor classification and clustering. In neural network normalization of input values for each attribute will helps in speeding up learning process. For distance based methods, normalization helps prevent attribute with initially large ranges from outweighing attributes from initially smaller ranges.

The main methods of normalization are-

n Min-max normalization

n Z-score normalization

n Decimal normalization

1. Min-max normalization

It performs a linear transformation on original data. Min-max normalization subtracts the minimum value of an attribute from each value of the attribute and then divides the difference by the range of the attribute. These new values are multiplied by the new range of the attribute and finally added to the new minimum value of the attribute. These operations transform the data into a new range, generally [0,1]. Removes classes before normalization, and returns normalized data set complete with classes rejoined. Suppose the minimum and maximum values of an attribute A are minA and maxA respectively. This normalization maps a value v of A to v' in range [new_minA , new_maxA] by following computation:

This normalization method preserves the relationship among data values. An out of bound error is encountered if a future input case for normalization falls outside the original data values.

Example: Suppose that the minimum and maximum value for attribute income is $12000 and $98000, respectively. We would like to map income in a range [0.0, 1.0]. Normalize value $73000 using min-max normalization.

Solution:

Using

$73,000 is mapped to

2. Z-score normalization (Zero-mean normalization):

In this normalization the values of an attribute A are normalized based on mean and standard deviation of A. A value v of A is normalized to v' using following computation

Here μA is mean of A and σA is standard deviation of A

Example: Suppose that the mean and standard deviation value for attribute income is $54000 and $16000, respectively. Normalize value $73000 using z-score normalization.

Solution:

Using

$73,000 is mapped to

3. Normalization by Decimal scaling:

This method normalizes by moving the decimal point values of attribute A. The number of points moved depends on the maximum absolute value of A. A value v of A is normalized to a value v' using following computation

Where j is the smallest integer such that Max (|ν'|) < 1

Example: Suppose that the recorded value of A range from -986 to 917. The maximum absolute value of A is 986. Normalize using decimal

Solution:

To normalize using decimal scaling we divide each value by 1000 (j=3). So that -986 normalize to -0.986 and 917 normalizes to 0.917.

Normalization may change the data quite a bit. Therefore it is necessary to save normalization parameters so that future data can be normalizes in uniform manner.

### 2.2.3.2 Attribute Construction

In attribute construction (or feature construction), new attributes are constructed and added from the given set of attributes to help the mining process. These new attributes are constructed and added in order to get better accuracy and understanding of structure in high-dimensional data. For instance, we may wish to add the attribute area based on the attributes height and width. By combining attributes, attribute construction can discover missing information about the relationships between data attributes that can be useful for knowledge discovery.

### 2.2.4 Data Reduction

If the data set is quite huge then the task of data mining and analysis can take much longer time, making the whole exercise of analysis useless and infeasible. The data reduction actions are of fundamental importance to machine learning and data mining. Data reduction achieve a reduced version of the data set that is much smaller in volume but yet generate the same (or almost the same) analytical results. At this point the objective is to aggregate or integrate the information contained in large datasets into manageable (smaller) information nuggets. But the time spent on data reduction should not overshadow or remove the time saved by mining on the reduced data set. For example, in trying to analyze car sales, we might focus on the role of model, year and color of the cars in sale. therefore, we overlook the differences between two sales along the dimensions of date of sale or dealership but analyze the totals sale for cars by model, by year and by color only. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques. The data reduction strategies include:

n Data Cube Aggregation

n Attribute Subset Selection

n Dimensionality Reduction

n Numerosity Reduction

n Data Discretization and Concept Hierarchy Generation.

### 2.2.4.1 Data Cube Aggregation

Consider that we have collected data for analysis. These data consists of the car sales per quarter, for year 2007 to 2009. We are however interested in annual sales. Thus there is a need to aggregate data results. This aggregation is illustrated in following .

The data cube is used to represent data along some measure of interest. Even though called a "cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Every dimension represents a few attribute in the database and the cells in the data cube represent the measure of interest. For illustration they could contain a count for the number of times that attribute combination occurs in the database, or the minimum, maximum, sum or average value of some attribute. Queries are performed on the cube to retrieve decision support information. For example following shows data cube for multidimensional analysis of sales data with respect to annual sales per cal model for each dealer. Each sale holds an aggregate data value, corresponding to the data point in multidimensional space.

We will discuss data cubes in detail in chapter on data warehousing.

### 2.2.4.2 Attribute Subset Selection (Feature Subset Selection)

The inspiration of data mining is how to dig out valuable information from huge data in very large database. However, in very large databases, some redundant and irrelevant attributes, which result in low performance and high computing complexity, are included in general. So, in the field of data mining Feature Subset Selection (FSS) becomes one very important issue. Feature subset selection is an important component of knowledge discovery and data mining systems to help reduce the data dimensionality. For example if the goal is to classify customers as to whether or not they are likely to purchase a popular new accessory for car when notified of a sale, attributes such as customer's phone number are likely to be irrelevant, unlike attribute such as age, annual earning. It is possible for domain experts to find out useful attributes. , but this is time consuming and difficult task. Leaving out relevant attributes or keeping irrelevant attributes may be injurious, causing confusion for the mining algorithm. This can result in poor results. In addition irrelevant and redundant information volumes can further slow down the mining system. Attribute subset selection reduces the data set size by removing irrelevant attributes or dimensions. Attribute subset selection finds minimum set of attribute such that resulting probability distribution of the data classes is as close as possible to original distribution obtained using all attributes. It further simplifies the understanding of the mined patterns.

Now the question is how we can find a good subset of the original attributes? For n attribute there are 2n possible subsets. Exhaustive search of these 2n subsets could be very expensive. Therefore heuristic methods for searching a subset are commonly used. These methods are normally greedy in that, while searching through attribute space, they always choose local optimal solution. But these methods are effective in practice and can give close to optimal solution.

Definition: Attribute selection is a process in which a subset of M attributes out of N is chosen, complying with the constraint M ≤ N, in such a way that characteristic space is reduced according to some criterion. Attribute selection guarantees that data getting to the mining phase are of good quality

Algorithms used for attribute selection can be normally separated in two main activities: search for the attributes subset and evaluation of the subsets found, as can be seen in Fig.

Search algorithms used in the first stage as shown in above can be subdivided in 3 main groups: exponential, random and sequential Algorithms. Exponential algorithms, as for instance the exhaustive search, try all possible attribute combinations before returning the attribute subset. Normally, they are not computationally feasible, since their running time grows exponentially in the number of available attributes.

Genetic algorithms are one example of random search methods, and their main advantage over sequential ones is that they are capable of dealing with the problem of attribute interaction.

Sequential algorithms are relatively efficient in the solution of many attribute selection problems; despite they have the disadvantage of not taking attribute interaction into account. Two examples of sequential algorithms are forward selection and backward elimination.

Sequential forward selection starts the search for the best attribute subset with an empty set of attributes. Initially, attribute subsets with only one attribute are evaluated, and the best attribute A* is selected. This attribute A* is then combined with all other available attributes (pairwise), and the best subset of two attributes is selected. The search goes on with this procedure, incorporating one attribute at a time to the best attribute subset already selected, until the quality of the best selected attribute subset cannot be further improved. Contrary to forward selection, sequential backward elimination starts the search for the best attribute subset with a solution representing all attributes, and at each iteration one attribute is removed from the actual solution, until no further improvement in the quality of the solution can be attained. In decision tree induction, a tree is constructed from the given data. Where internal nodes denote a test on attribute, branches corresponds to the result of the test and the external nodes denote class prediction . the irrelevant attribute never appears in the tree. In the following basic heuristic methods have been shown diagrammatically. These include forward selection, backward selection and decision tree induction.

Regarding the evaluation of the generated attribute subsets, two main approaches can be implemented: filter approach or wrapper approach. Both approaches are independent from the algorithm used for the selection of the candidate subsets, and they are characterized by their degree of dependence regarding the classification algorithm.

The wrapper approach defines an adequate subset of solutions to a previous chosen database and a particular induction algorithm, taking into account the inductive bias of the algorithm and its interaction with the training set. Fig. represents an attribute selection algorithm that uses the wrapper approach.

Different from the wrapper approach, the filter approach tries to chosen an attribute subset independently from the classification algorithm to be used, making an estimate of attribute quality looking just to the data. Fig. presents the schema of attribute selection with a filter approach that makes the selection using a preprocessing step based only on the training data. During this phase, the generated attribute sets can be evaluated according to some simple heuristic, as, for instance, the ortogonality of the data .

Normally, the wrapper approach has a large algorithm running time, but the number of correctly classified instances tends to be greater than that obtained by the filter approach. There are many techniques to evaluate an attribute subset with the filter approach. Among the evaluation measures some deserve attention, as Relevance and Consistency. Relevance measure quantifies how much two attributes are associated, that is to say, whether it is possible to predict some attribute's values, when some other attribute's value is know. Within the attribute selection context, the best evaluated attribute is the one that best predicts the class. By using consistency, the evaluation of attributes subset tries to determine the class' consistency level when the training instances are projected onto the attributes subset.

ANN (artificial neural network) can be used to construct empirical models in a number of examples, where mathematical models are unavailable but real world data relating inputs to outputs exist. These models then may be used to predict the outputs for a set of new inputs which are not used while construction of the model. But one of the main drawbacks of these methods is that the structure of the model must be specified a priori and it requires a set of data for training and developing the model, which may not be necessarily available.

### 2.2.4.3 Dimensionality Reduction

Data sets with high dimensions present many mathematical challenges as well as some opportunities, and are bound to give increase to new theoretical developments. One of the major drawbacks with high-dimensional datasets is that, in some cases, not all the measured variables are “important" for understanding the fundamental phenomena of interest. Though, some computationally expensive methods can construct predictive models with high precision from this type of data set, but it is still of interest in many application to reduce the dimension of the original data prior to any modeling of the data.

Mathematically, the problem we explored can be stated as follows: given the p- dimensional random variable x = (x1,….., xp)T , find a lower dimensional representation of it, s = (s1,…..,sk)T with k ≤ p, that captures the content in the original data, according to some criterion. The components of s are also called the hidden components. Different fields use different names for the p multivariate vectors: the term “variable" is mostly used in statistics, while “feature" and “attribute" are alternatives commonly used in the computer science and machine learning literature.

The objective of dimension reduction algorithms is to obtain an economical description of multivariate data. The aim is to get a compressed, precise, representation of the data that reduces or eliminates statistically redundant components. Dimension reduction is fundamental to a range of data processing goals. Input selection for classification and regression problems is a task-specific form of dimension reduction. Mental picture of high-dimensional data needs mapping to a lower dimension—generally three or fewer. Transform coding typically involves dimension reduction. The initial high-dimensional signal (e.g., image blocks) is first changed in order to reduce statistical reliance, and hence redundancy, between the components. The transformed components are then scalar quantized. Dimension reduction can be imposed explicitly, by eliminating a subset of the transformed components. Alternatively, the allocation of quantization bits among the transformed components (e.g., in increasing measure according to their variance) can result in eliminating the low-variance components by assigning them zero bits.

Recently several authors have used neural network implementations of dimension reduction to signal equipment failures by novelty, or outlier, detection. In these schemes, high-dimensional sensor signals are projected onto a sub-space that best describes the signals obtained during normal operation of the monitored system. New signals are categorized as normal or abnormal according to the distance between the signal and its projection.

Traditionally, Principal Component Analysis (PCA) has been the technique of choice for dimension reduction. Wavelet spectral analysis of hyper spectral images has been recently proposed as a method for dimension reduction and, when tested for the classification of data, has shown promising results. One of the interesting features of wavelet spectral analysis reduction is that it can ignore data anomalies due to the use of low pass filters.

### PCA dimension reduction

The Principal Component Analysis (PCA) is one of the most commonly used dimension reduction techniques. It computes orthogonal projections that maximize the amount of data variance, and yields a data set in a new un-correlated coordinate system. But the information from hyper spectral images does not always match with such projections. This rotational transform is also time-consuming because of its global nature. Moreover, it might more preserve local spectral signatures and therefore might not preserve all information useful to a successful classification. The idea of principal component analysis (PCA) is to clarify the variance-covariance composition on a set of variables through a smaller number of uncorrelated linear arrangements of these variables. One of the data reduction techniques is also known as optimization. By using this, one is able to explain the complete data set with the least number of components. This process will engage the process of the Lagrange Multipliers, the use of Eigen vectors and Eigen values, matrices and its properties and the change of basis theorem and other mathematical techniques.

### The PCA algorithm consists of following main steps:

* In first step , to make each attribute fall within the same range the input data is normalized. This step helps to make sure that attributes with large domains will not dominate attributes with smaller domains.

* This will then compurte k orthonormal vectors that offers a basis for the normalized input data. These are unit vectors such that each point in a direction at a 90 degree angle to the others. These vectors are called “principal components” nd the input data are a linear combination of the principal components.

* The principal components are sorted in order of decreasing “significance” or strength. The principal components basically serve as a new set of axes for the data, giving significant information about variance. That is, the sorted axes are such that the first axis shows the most variance among the data, the second axis shows the next highest variance, and so on. For example, following shows the first two principal components, Y1 and Y2, for the given set of data originally mapped to the axes X1 and X2. This information helps identify groups or patterns within the data.

* Since the components are sorted in decreasing order of “significance,” we can reduce the size of data by removing the weaker components (components with lower variance). It should be possible to reconstruct a good approximation of the original data by using the strongest principal components,. PCA is computationally economical and can be applied to ordered and unordered attributes. It can also handle sparse and skewed data. It can handle data of more than two dimensions by reducing the problem to two dimensions. These components may be used as inputs to multiple regression and cluster analysis. PCA tends to be better at handling sparse data as compared to wavelet transforms. On the other hand wavelet transforms are more suitable for data of high dimensionality as compare to PCA.

### Wavelet dimension reduction

Among many different kinds of transformation, the wavelet transform has been chosen to develop data compression algorithms. There is a large difference between wavelet transform and Fourier transform. In the Fourier domain, all the elements of the basis are active for all time t, i.e., they are non-local. Consequently, Fourier series converge very slowly when approximating a localized function. Wavelet transform makes up for the deficiencies of Fourier transform. Wavelet basis function is a novel basis localizing in both time domain and frequency domain. Therefore, wavelet basis function can provide a good approximation for a localized function with only a few terms. In general, the discrete wavelet transform achieves better lossy compression. If the similar number of coefficients is retained for a wavelet transform and a Fourier transform of a given data vector, the wavelet version will provide a more accurate approximation of the original data. Hence, for an equivalent approximation, the DWT requires less space than the DFT. The general description of the automatic wavelet dimension reduction algorithm is shown in 2.13

A linear signal processing technique which when applied to a data vector X, transforms it to a different vector X0 of wavelet coefficient is called discrete wavelet transform. X and X0 are of same length. In data reduction, we consider each tuple as a data vector of n-dimensions, that is, X = (x1;x2; : : : ;xn), depicting n measurements made on the tuple from n database attributes. The usefulness of the two vectors of equal length lies in the fact that the wavelet transformed data can be truncated. A compressed estimate of the data can be retained by storing only a small portion of the strongest of the wavelet coefficients. For example, we can retain all wavelet coefficients larger than some specific threshold and all other coefficients can be set to 0. The data representation as a result is very sparse. Hence operations which may take advantage of data sparsity are computationally very fast if performed in wavelet space. The technique also works to eliminate noise without smoothing out the main features of the data. This makes it helpful for data cleaning. If a set of coefficients is given, approximation of the original data can be constructed by applying the inverse of the DWT used.

The general procedure for applying a discrete wavelet transform uses a hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. The method is as follows:

n The length of input vector must be in power of 2. Padding the vector with zero can be used here.

n In each transform two functions can be used. In first some data smoothing is applied, such as a sum or weighted average. In The second a weighted difference is performed, which acts to bring out the detailed features of the data.

n These two functions are applied to pairs of data points in X, i.e., to all pairs of measurements (x2i; x2i+1). This gives two sets of data of length L=2. These represent a smoothed or low-frequency version of the input data and the high frequency content of it, respectively.

n Until the resulting se obtained is of length 2 , the two functions are recursively applied to the sets of data obtained in the previous loop.

n Selected values from the data sets obtained in the above iterations are designated the wavelet coefficients of the transformed data.

Equally, we can also apply matrix multiplication to the input data in order to find the wavelet coefficients, where the matrix used depends on the given discrete wavelet transform. The matrix has to be orthonormal, which means that the columns are unit vectors and are mutually orthogonal, so that the matrix inverse is just its transpose. This property allows the rebuilding of the data from the smooth and smooth-difference data sets. The “fast DWT” algorithm has a complexity of O(n) for an input vector of length n due to factoring the matrix used into a product of a few sparse matrices. Wavelet transforms can be applied to multidimensional data, such as a data cube. For doing this transform is firstly applied to the first dimension, then to the second dimension, and so on. The complexity here is linear with respect to the number of cells in the cube. Wavelet transforms provide fine results on sparse or skewed data and also on data with ordered attributes. Lossy compression by wavelets is apparently better than JPEG compression, which is the current commercial standard. Wavelet transforms have many real-life applications, including the compression of fingerprint images, computer vision, analysis of time-series data, and data cleaning.

### 2.2.4.4 Numerosity Reduction

Numerosity reduction aims at replacing existing data values with much smaller one. It can be achieved using regression concepts of statistics. A linear regression is one where a random variable is represented as a linear function. As we have already discussed linear regression is also extended to support response variable prediction, based on multidimensional feature vectors referred as multiple regression. Histogram representation of data is alternative to numerosity reduction where individual attribute value/range is represented on X-axis and their corresponding counts on y-axis. Clustering can also be used as an alternative to numerosity reduction where the principle of grouping similar objects within cluster is exploited to replace the actual data with cluster representatives who are selected on the basis of a distance measure. Sampling techniques, with and without replacement, can also be employed to replace original data with condensed representation drawing random samples from the original data set.

### Regression and Log Linear Models

In simple linear regression, scores on one variable are predicted from the scores on a second variable. The variable which is being predicted is called the criterion variable and is referred to as Y. the other variable is called the predictor variable and is referred to as X. simple regression is the prediction method when we have only one predictor variable. In simple linear regression, the predictions of Y when plotted as a function of X form a straight line.

The example data given in table are plotted in 2.14. a positive relationship between X and Y can be seen in the .

### Table 1. Example data

X

Y

1.00

2.00

3.00

4.00

5.00

1.00

2.00

1.30

3.75

2.25

In linear regression the best-fitting straight line through the points is searched. This line is known as regression line. The black line in 2.14 is the regression line and made up of the predicted score on Y for each likely value of X. The vertical lines from the points to the regression line symbolize the errors of prediction. The red point is very near the regression line; its error of prediction is small. On the other hand , the yellow point is much higher than the regression line and therefore its error of prediction is large.

The log linear model is one of the specialized cases of generalized linear models for Poisson-distributed data. Log linear analysis is an extension of the two-way contingency table where the conditional relationship between two or more discrete, categorical variables is analyzed by taking the natural logarithm of the cell frequencies within a contingency table. Although log linear models can be used to analyze the relationship between two categorical variables (two-way contingency tables), they are more commonly used to evaluate multi-way contingency tables that involve three or more variables. The variables found by these models are all treated as response variables. Alternatively, no difference is made between independent and dependent variables. Therefore, these models only reveal relationship between variables. If some variables are treated as dependent and others as independent, then logit or logistic regression should be used in its place. Furthermore, if the variables being searched are continuous and cannot be broken down into discrete categories, logit or logistic regression would again be the appropriate analysis.

Assume that we are interested in the relationship between gender, heart disease and body weight. We could take a sample of 200 people and determine the gender, approximate body weight, and who does and does not have heart disease. The variable, body weight, is broken down into two categories: not over weight, and over weight. The possibility table containing the data may look like this:

Heart Disease

Total

Body Weight

gender

Yes

No

Not over weight

Male

15

5

20

Female

40

60

100

Total

55

65

120

Over weight

Male

20

10

30

Female

10

40

50

Total

30

50

80

In this illustration, if we had chosen heart disease as the dependent variable and gender and body weight as the independent variables, then logit or logistic regression would have been the appropriate analysis.

Regression and log-linear models can both be used on sparse data; however their application may be limited. Both of these methods can handle skewed data. But regression performs exceptionally well. Regression can be computationally intensive when applied to high dimensional data, whereas log-linear models show good scalability for up to 10 or so dimensions.

### Histograms

Histograms are popular forms of data reduction. It uses binning to approximate data distributions. Histogram is a graphical method for summarizing the distribution of a given attribute. A histogram for some attribute A partitions the data distribution of attribute A into different buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets. Frequently, buckets instead represent continuous ranges for the given attribute. Normally, the width of each bucket is the same. Each bucket is shown by a rectangle whose height is equal to the count or relative frequency of the values at the bucket. If A is categorical, for example automobile model or item type, then one rectangle is drawn for each known value of A, and the resulting graph is usually called as a bar chart. If A is numeric, the term histogram is preferred.

Example 2.5 The following data are a list of prices of commonly sold items at a store. The numbers have been sorted:

1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30

Equal-width: In an equal-width histogram, the width of each bucket range is uniform (such as the width of $10 for the buckets in 2.16).

Equal-frequency (or equi-depth): In an equal-frequency histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the same number of contiguous data samples).

V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-Optimal histogram is the one with the least variance. Histogram variance is a weighted sum of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket.

V-Optimal and MaxDiff histograms tend to be the most accurate and practical. Histograms are highly effective at approximating both sparse and dense data, as well as highly skewed and uniform data. The histograms described above for single attributes can be extended for multiple attributes. Multidimensional histograms can capture dependencies between attributes. Such histograms have been found effective in approximating data with up to five attributes. More studies are needed regarding the effectiveness of multidimensional histograms for very high dimensions. Singleton buckets are useful for storing outliers with high frequency.

### Clustering

Clustering techniques partitions the objects into clusters (groups) such that objects within a cluster are all similar and are different from the objects in other clusters. Similarity is generally defined in terms of how “close” the objects are in space which is dependent on a distance function. There are various measures of cluster quality such as -it may be measured by cluster diameter, the maximum distance between any two objects in the cluster, centroid distance which is defined as the average distance of each cluster object from the cluster centroid (denoting the average object,” or average point in space for the cluster). In data reduction, the cluster representations of the data are used to replace the actual data. The usefulness of clustering technique depends on the nature of the data. It is much more useful for data that can be organized into distinct clusters than for smeared data. We will discuss clustering in detail in chapter called cluster analysis.

### Sampling

Sampling is an important data reduction technique as it allows a large data set to be represented by a much smaller random sample (or subset) of the data. Assume that a large data set, D, contains N tuples. Here we present some common ways that by which we can reduce data set D.

### Types of Sampling

n Simple random sampling: it requires that each element has an equal likelihood of being incorporated in the sample and that the list of all population elements is available. Choice of a sample element can be performed with or without replacement.

n Simple random sampling with replacement (SRSWR): This method is of special importance because it simplifies statistical inference by eliminating any relation (covariance) between the selected elements through the replacement process. In this method, an element can appear more than once in the sample.

n Simple random sampling without replacement (SRSWOR): In this method simple random sampling is done with out replacement because there is no need to collect the information more than once from an element. Moreover, SRSWOR gives a smaller sampling variance than SRSWR. However, these two sampling methods are almost the same in a large survey in which a small fraction of population elements are sampled. SRSWOR throughout design is modified further to accommodate other theoretical and practical considerations. The common practical designs include stratified random sampling, cluster sampling and other controlled selection procedures. These more practical designs deviate from SRSWOR in two important ways.

o The inclusion probabilities for the elements (also the joint inclusion probabilities for sets for the elements) may be unequal.

o The sampling unit can be different from the population element of interest.

These designs complicate the usual methods of estimation and variance calculation and, if proper methods of analysis are not used, can lead to a bias in estimation and statistical tests. We will consider these in detail:

n Stratified random sampling classifies the population elements into strata and samples separately from each stratum. It is used for several reasons:

The sampling variance can be reduced if strata are internally homogeneous,

Separate estimates can be obtained for strata,

Administration of fieldwork can be organized using strata, and

Different sampling needs can be accommodated in separate strata.

Allocation of the sample across the strata is proportionate when the sampling fraction is uniform across the strata or disproportionate when, for instance, a higher sampling fraction is applied to a smaller stratum to select a sufficient number of subjects for comparative studies. In general, the estimation process for a stratified random sample is more complicated than in SRSWOR. It is generally described as a two-step process. The first step is the calculation of the statistics—for example, the mean and its variance—separately within each stratum. These estimates are then combined based on weights reflecting the proportion of the population in each stratum. As will be discussed later, it also can be described as a one-step process using weighted statistics. The estimation simplifies in the case of proportionate stratified sampling, but the strata must be taken into account in the variance estimation. The formulation of the strata requires that information on the stratification variable(s) be available in the sampling frame. When such information is not available, stratification cannot be incorporated in the design. But stratification can be done after data are collected to improve the precision of the estimates. The so-called post-stratification is used to make the sample more representative of the population by adjusting the demographic compositions of the sample to the known population compositions. Typically, such demographic variables as age, sex, race, and education are used in post-stratification in order to take advantage of the population census data. This adjustment requires the use of weights and different strategies for variance estimation because the stratum sample size is a random variable in the post-stratified design (determined after the data are collected).

n Cluster sampling is often a practical approach to surveys because it samples by groups (clusters) of elements rather than by individual elements. It simplifies the task of constructing sampling frames, and it reduces the survey costs. Often, a hierarchy of geographical clusters is used, as described earlier. In multistage cluster sampling, the sampling units are groups of elements except for the last stage of sampling. When the numbers of elements in the clusters are equal, the estimation process is equivalent to SRSWOR. However, simple random sampling of unequal-sized clusters leads to the elements in the smaller clusters being more likely to be in the sample than those in the larger clusters. Additionally, the clusters are often stratified to accomplish certain survey objectives and field procedures, for instance, the oversampling of predominantly minority population clusters. The use of disproportionate stratification and unequal-sized clusters complicates the estimation process.

The main advantage of sampling technique is that cost is very low and is proportional to the sample size s as opposed to N the data set size. The other reduction techniques require at least one pass of complete data set. Sampling is commonly used to estimate the answer to an aggregate query when applied to data reduction. It is feasible (using the central limit theorem) to decide a sufficient sample size for estimating a given function within a specified degree of error. The size od sample s, may be very small in comparison to N. Sampling is a usual choice for the progressive sophistication of a reduced data set. Such a set can be further refined by simply increasing the sample size.

### 2.4.4.5 Data Discretization and Concept Hierarchy Generation

Many real-world data mining tasks involve continuous attributes. However, many of the existing data mining systems cannot handle such attributes. Furthermore, even if a data mining task can handle a continuous attribute, its performance can be significantly improved by replacing a continuous attribute with its discretized values. Data discretization is defined as a process of converting continuous data attribute values into a finite set of intervals and associating with each interval some specific data value. There are no restrictions on discrete values associated with a given data interval except that these values must induce some ordering on the discretized attribute domain. Discretization significantly improves the quality of discovered knowledge and also reduces the running time of various data mining tasks such as association rule discovery, classification, and prediction. Some literature report ten fold performance improvement for domains with a large number of continuous attributes with little or no loss of accuracy. However, any discretization process generally leads to a loss of information. Thus, the goal of the good discretization algorithm is to minimize such information loss. Discretization of continuous attributes has been extensively studied. There are a wide variety of discretization methods starting with naive (often referred to as unsupervised) methods such as equal-width and equal-frequency to much more sophisticated (often referred to as supervised) methods such as Entropy and Pearson's X2 or Wilks' G2 statistics based discretization algorithms . Unsupervised discretization methods are not provided with class label information whereas supervised discretization methods are supplied with a class label for each data item value. In spite of the wealth of literature on discretization methods, there are very few attempts to analytically compare them. Typically, researchers compare the performance of different algorithms by providing experimental results of running these algorithms on publicly available data sets.

Concept hierarchies are specified as trees, with the attribute values (known as base concepts) at the leaves, and higher level concepts as the interior nodes. These hierarchies embody certain implicit assumption on data structures of the active attribute domains. The primary assumption is that there is a nested sequence of equivalence relations among these leaf concepts. The first level parent nodes represent the equivalence classes of the inner most equivalence relation. Such assumption restricts hierarchies to have tree structures, precluding other types of relationships among concepts.

As applied to spatial data, the hierarchical levels may illustrate spatial relationships. An example spatial concept hierarchy is given in following 2.17.

Such spatial hierarchies may be generated by consolidating adjusted spatial objects. Since spatial data contain both spatial and non spatial features, attribute hierarchies may be provided to further aid in the extraction of general knowledge from spatial database examined. Although detail is lost by data generalization, the generalized data may be more meaningful and easier to interpret. Also, mining on a reduced data set requires less number of input/output operations and is more efficient than mining on a larger, un-generalized data set. Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining as a preprocessing step, rather than during mining. An example of a concept hierarchy for the attribute price is given in 2.18. More than one concept hierarchy can be defined for the same attribute in order to accommodate the needs of various users.

Manual definition of concept hierarchies can be a tedious and time-consuming task for a user or a domain expert. Fortunately, several discretization methods can be used to automatically generate or dynamically refine concept hierarchies for numerical attributes. Furthermore, many hierarchies for categorical attributes are implicit within the database schema and can be automatically defined at the schema definition level. Let's consider the generation of concept hierarchies for numerical and categorical data.

### Discretization and Concept Hierarchy Generation for Numerical Data

It is complex and tiresome to give concept hierarchies for numerical attributes because of the wide diversity of possible data ranges and the frequent updates of data values. These manual specifications are also fairly arbitrary. Concept hierarchies for numerical data may be given automatically based on data discretization. We look at the following methods:

n Binning

n Histogram analysis

n Entropy-based discretization

n Chi square merging

n Custer analysis

n Discretization by intuitive partitioning.

Each of these methods assumes that the values to be discretized are sorted in increasing order.

### Binning

Binning is a splitting technique which is based on a specified number of bins and is top-down in nature. We have already discussed binning methods for data smoothing. Binning methods are also used as discretization methods for numerosity reduction and concept hierarchy generation. It does not use class information and is therefore known as unsupervised discretization technique.

### Histogram Analysis

Histogram analysis is also an unsupervised discretization technique like binning because it does not use class information. Histograms partition the values for an attribute, A, into disjoint ranges called buckets. We have discussed histograms in previous section.

### Entropy-Based Discretization

This is one most commonly used discretization measures. It is also a supervised, top-down splitting technique. In its calculation it explores class distribution information and determination of split-points. This method selects the value of an attribute A with minimum entropy as a split point and recursively partition the intervals to arrive at hierarchical discretization. This discretization forms a concept hierarchy for attribute A.

Let consider data set D defined by a set of attributes and a class-label attribute. The class information per tuple is provided by class-label attribute. The method for entropy-based discretization of an attribute A is as follows:

1. To partition the range of attribute A, each value of A can be considered as a potential interval boundary or split-point (denoted split point). This means a split-point for A can partition using the conditions A≤ split point and A > split point, respectively into two subsets, thereby creating a binary discretization.

2. This discretization uses a class label for tuples. To understand the essence entropy-based discretization, take a glance at following classification. Suppose by partitioning on attribute A and some split-point, we want to classify the tuples in data set D. We would like this process to result in an exact classification of the tuples. For instance, if we had two classes, we would hope that all of the tuples of, class C1 will go into one partition, and all of the tuples of class C2 will go into the other partition. But , this is difficult. Let consider, the first partition may contain many tuples of C1, but also some of C2. The question here is How much more information would we still need for a perfect classification, after this partitioning? We call this amount the expected information. It is given by

InfoA(D)=

Where

n D1 and D2 correspond to the tuples in D satisfying the conditions A ≤ split point and A > split point, respectively;

n |D| is the number of tuples in D, and so on.

The entropy function for a given set is calculated based on the class distribution of the tuples in the set. For example, given m classes, C1;C2; : : : ;Cm, the entropy of D1 is

Entropy (D1)=

Where pi is the probability of class Ci in D1, determined by dividing the number of tuples of class Ci in D1 by |D1|, the total number of tuples in D1. We want to pick the attribute value that gives the minimum expected information requirement (i.e., min(InfoA(D))), when selecting a split-point for attribute A,. This would result in the minimum amount of expected information (still) required to perfectly classify the tuples after partitioning by A≤ split point and A>split point. This is equivalent to the attribute-value pair with the maximum information gain. Note that the value of Entropy(D2) can be computed similarly as in Equation “But our task is discretization, not classification!”, you may exclaim. This is true.We use the split-point to partition the range of A into two intervals, corresponding to A ≤ split point and A > split point.

3. The process of determining a split-point is recursively applied to each partition obtained, until some stopping criterion is met, such as when the minimum information requirement on all candidate split-points is less than a small threshold, ε, or when the number of intervals is greater than a threshold, max interval.

Entropy-based discretization can reduce data size. Unlike the other methods mentioned here so far, entropy-based discretization uses class information. This makes it more likely that the interval boundaries (split-points) are defined to occur in places that may help improve classification accuracy. The entropy and information gain measures described here are also used for decision tree induction.

### Interval Merging by χ2 Analysis

ChiMerge is a χ2-based discretization method. Till now we have studied the discretization methods that all employed a top-down, splitting strategy. This is different with ChiMerge, which uses a bottom-up approach by finding the best neighboring intervals and then merging these to form larger intervals, recursively. Since it uses class information the method is supervised. The basic idea is that for correct discretization, the relative class frequencies should be fairly consistent within an interval. So, the intervals can be merged if two adjacent intervals have a very similar distribution of classes or else, they should remain separate.

### ChiMerge proceeds as follows.

n Initially, each distinct value of a numerical attribute A is considered to be one interval.

n χ2 tests are performed for every pair of adjacent intervals.

n Adjacent intervals with the least χ2 values are merged together, because low χ2 values for a pair indicate similar class distributions.

This merging process precedes recursively until a predefined stopping criterion is met.

The stopping criterion is classically determined by three conditions.

1. First, merging stops when χ2 values of all pairs of adjacent intervals go beyond some threshold. This is determined by a specified significance level. A very high value of significance level for the χ2 test may cause over discretization, whereas very low value may lead to under discretization. Typically, the significance level is set between 0.10 and 0.01.

2. The number of intervals cannot be over a pre-specified max-interval, such as 10 to 15.

3. Basis behind ChiMerge is that the relative class frequencies should be quite consistent within an interval. Practically, some inconsistency are allowed, although this should be no more than a pre-specified threshold, such as 3%, which maybe estimated from the training data.

This last condition can be used to remove irrelevant attributes from the data set.

### Cluster Analysis

This is a well known data discretization method. A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the values of A into clusters or groups. Clustering takes the distribution of A into consideration, as well as the closeness of data points, and therefore is able to produce high-quality discretization results. Clustering can be used to generate a concept hierarchy for A by following either a top down splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the concept hierarchy. In the former, each initial cluster or partition may be further decomposed into several sub clusters, forming a lower level of the hierarchy. In the latter, clusters are formed by repeatedly grouping neighboring clusters in order to form higher-level concepts. Clustering methods for data mining are studied further in subsequent chapters.

### Discretization by Intuitive Partitioning

Although the above discretization methods are useful in the generation of numerical hierarchies, many users would like to see numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or “natural.” For example, annual salaries broken into ranges like ($50,000, $60,000] are often more desirable than ranges like ($51,263.98, $60,872.34], obtained by, say, some sophisticated clustering analysis. The 3-4-5 rule can be used to segment numerical data into relatively uniform, natural seeming intervals. In general, the rule partitions a given range of data into 3, 4, or 5 relatively equal-width intervals, recursively and level by level, based on the value range at the most significant digit. We will illustrate the use of the rule with an example further below. The rule is as follows:

n If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then partition the range into 3 intervals (3 equal-width intervals for 3, 6, and 9; and 3 intervals in the grouping of 2-3-2 for 7).

n If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4 equal-width intervals.

n If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into 5 equal-width intervals.

The rule can be recursively applied to each interval, creating a concept hierarchy for the given numerical attribute. Real-world data often contain extremely large positive and/or negative outlier values, which could distort any top-down discretization method based on minimum and maximum data values. For example, the assets of a few people could be several orders of magnitude higher than those of others in the same data set. Discretization based on the maximal asset values may lead to a highly biased hierarchy. Thus the top-level discretization can be performed based on the range of data values representing the majority (e.g., 5th percentile to 95th percentile) of the given data. The extremely high or low values beyond the top-level discretization will form distinct interval(s) that can be handled separately, but in a similar manner. The following example illustrates the use of the 3-4-5 rule for the automatic construction of a numerical hierarchy.

### Concept Hierarchy Generation for Categorical Data

Categorical data are discrete data. Categorical attributes have a finite (but possibly large) number of distinct values, with no ordering among the values. Examples include geographic location, job category, and itemtype. There are several methods for the generationof concept hierarchies for categorical data. Specification of a partial ordering of attributes explicitly at the schema level by users or experts: Concept hierarchies for categorical attributes or dimensions typically involve a group of attributes. A user or expert can easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level. For example, a relational database or a dimension location of a data warehouse may contain the following group of attributes: street, city, province or state, and country. A hierarchy can be defined by specifying the total ordering among these attributes at the schema level, such as street < city < province or state < country. Specification of a portion of a hierarchy by explicit data grouping: This is essentially the manual definition of a portion of a concept hierarchy. In a large database, it is unrealistic to define an entire concept hierarchy by explicit value enumeration. On the contrary, we can easily specify explicit groupings for a small portion of intermediate-level data. For example, after specifying that province and country form a hierarchy at the schema level, a user could define some intermediate levels manually, such as “{Alberta, Saskatchewan, Manitobag} C prairies Canada” and “ {British Columbia, prairies Canadag} C Western Canada”.

Specification of a set of attributes, but not of their partial ordering: A user may specify a set of attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. “Without knowledge of data semantics, how can a hierarchical ordering for an arbitrary set of categorical attributes be found?” Consider the following observation that since higher-level concepts generally cover several subordinate lower-level concepts, an attribute defining a high concept level (e.g., country) will usually contain a smaller number of distinct values than an attribute defining a lower concept level (e.g., street). Based on this observation, a concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. The lower the number of distinct values an attribute has, the higher it is in the generated concept hierarchy. This heuristic rule works well in many cases. Some local-level swapping or adjustments may be applied by users or experts, when necessary, after examination of the generated hierarchy. Let's examine an example of this method.