# The K Nearest Neighbor Classification Study English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

K-nearest-neighbor (KNN) classification is one of the most fundamental and simple classification methods and is one of the first choices for a classification study when there is little or no prior knowledge about the distribution of the data. K-nearest-neighbor classification was developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to determine.

## BETWEEN-SAMPLE GEOMETRIC DISTANCE

The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let {\textbf x}_ibe an input sample with pfeatures (x_{i1}, x_{i2}, \ldots, x_{ip}), nbe the total number of input samples (i=1,2,\ldots,n) and pthe total number of features (j=1,2,\ldots,p). The Euclidean distance between sample {\textbf x}_iand {\textbf x}_l(l=1,2,\ldots,n) is defined as:

d({\textbf x}_i,{\textbf x}_l) = \sqrt{(x_{i1}-x_{l1})^2 + (x_{i2}-x_{l2})^2 + \cdots + (x_{ip}-x_{lp})^2} .

## CLASSIFICATION DECISION RULE AND CONFUSION MATRIX

Classification typically involves partitioning samples into training and testing categories. Let \textbf{x}_ibe a training sample and \textbf{x}be a test sample, and let \omegabe the true class of a training sample and \hat{\omega}be the predicted class for a test sample (\omega,\hat{\omega}=1,2,\ldots,\Omega). Here, \Omegais the total number of classes.

During the training process, we use only the true class \omegaof each training sample to train the classifier, while during testing we predict the class \hat{\omega}of each test sample. It warrants noting that KNN is a "supervised" classification method in that it uses the class labels of the training data. Unsupervised classification methods, or "clustering" methods, on the other hand, do not employ the class labels of the training data.

With 1-nearest neighbor rule, the predicted class of test sample \textbf{x}is set equal to the true class \omegaof its nearest neighbor, where \textbf{m}_iis a nearest neighbor to \textbf{x}if the distance

d( \textbf{m}_i, \textbf{x})={\min}_j\{ d(\textbf{m}_j, \textbf{x})\}.

For k-nearest neighbors, the predicted class of test sample \textbf{x}is set equal to the most frequent true class among knearest training samples. This forms the decision rule D:\textbf{x}\rightarrow \hat{\omega}.

The confusion matrix used for tabulating test sample class predictions during testing is denoted as \textbf Cand has dimensions \Omega \times \Omega. During testing, if the predicted class of test sample \textbf{x}is correct (i.e., \hat{\omega}=\omega), then the diagonal element c_{\omega \omega}of the confusion matrix is incremented by 1. However, if the predicted class is incorrect (i.e., \hat{\omega} \neq \omega), then the off-diagonal element c_{\omega \hat{\omega}}is incremented by 1. Once all the test samples have been classified, the classification accuracy is based on the ratio of the number of correctly classified samples to the total number of samples classified, given in the form

Acc = \frac{\sum_{\omega} ^ {\Omega} c_{\omega \omega}} {n_{total}}

where c_{\omega \omega}is a diagonal element of \textbf Cand n_{total}is the total number of samples classified.

## EXAMPLE

Consider a machine learning study to classify 19 samples with 2 features, Xand Y. Table 1 lists the pair-wise Euclidean distance between the 19 samples. Assume that sample {\textbf x}_{11}is being used as a test sample while all remaining samples are used for training. For k=4, the four samples closest to sample {\textbf x}_{11}are sample {\textbf x}_{10}(blue class label), sample {\textbf x}_{12}(red class label), {\textbf x}_{13}(red class label), and {\textbf x}_{14}(red class label).

Table 1. Euclidean distance matrix D listing all possible pairwise Euclidean distances between 19 samples.

{\textbf x}_1

{\textbf x}_2

{\textbf x}_3

{\textbf x}_4

{\textbf x}_5

{\textbf x}_6

{\textbf x}_7

{\textbf x}_8

{\textbf x}_9

{\textbf x}_{10}

{\textbf x}_{11}

{\textbf x}_{12}

{\textbf x}_{13}

{\textbf x}_{14}

{\textbf x}_{15}

{\textbf x}_{16}

{\textbf x}_{17}

{\textbf x}_{18}

{\textbf x}_2

1.5

{\textbf x}_3

1.4

1.6

{\textbf x}_4

1.6

1.4

1.3

{\textbf x}_5

1.7

1.4

1.5

1.5

{\textbf x}_6

1.3

1.4

1.4

1.5

1.4

{\textbf x}_7

1.6

1.3

1.4

1.4

1.5

1.8

{\textbf x}_8

1.5

1.4

1.6

1.3

1.7

1.6

1.4

{\textbf x}_9

1.4

1.3

1.4

1.5

1.2

1.4

1.3

1.5

{\textbf x}_{10}

2.3

2.4

2.5

2.3

2.6

2.7

2.8

2.7

3.1

{\textbf x}_{11}

2.9

2.8

2.9

3.0

2.9

3.1

2.9

3.1

3.0

1.5

{\textbf x}_{12}

3.2

3.3

3.2

3.1

3.3

3.4

3.3

3.4

3.5

3.3

1.6

{\textbf x}_{13}

3.3

3.4

3.2

3.2

3.3

3.4

3.2

3.3

3.5

3.6

1.4

1.7

{\textbf x}_{14}

3.4

3.2

3.5

3.4

3.7

3.5

3.6

3.3

3.5

3.6

1.5

1.8

0.5

{\textbf x}_{15}

4.2

4.1

4.1

4.1

4.1

4.1

4.1

4.1

4.1

4.1

1.7

1.6

0.3

0.5

{\textbf x}_{16}

4.1

4.1

4.1

4.1

4.1

4.1

4.1

4.1

4.1

4.1

1.6

1.5

0.4

0.5

0.4

{\textbf x}_{17}

5.9

6.2

6.2

5.8

6.1

6.0

6.1

5.9

5.8

6.0

2.3

2.3

2.5

2.3

2.4

2.5

{\textbf x}_{18}

6.1

6.3

6.2

5.8

6.1

6.0

6.1

5.9

5.8

6.0

3.1

2.7

2.6

2.3

2.5

2.6

3.0

{\textbf x}_{19}

6.0

6.1

6.2

5.8

6.1

6.0

6.1

5.9

5.8

6.0

3.0

2.9

2.7

2.4

2.5

2.8

3.1

0.4

Figure 2 shows an X-Y scatterplot of the 19 samples plotted as a function of their Xand Yvalues. One can notice that among the four samples closest to test sample {\textbf x}_{11}(labeled green), 3/4 of the class labels are for Class A (red color), and therefore, the test sample is assigned to Class A.

http://www.scholarpedia.org/wiki/images/thumb/1/13/Knn_sample_plot.png/300px-Knn_sample_plot.png

http://www.scholarpedia.org/wiki/skins/common/images/magnify-clip.png

X-Y Scatterplot of the 19 samples for which pair-wise Euclidean distances are listed in Table 1. Among the 4 nearest neighbors of the test sample, the most frequent class label colour is red, and thus the test sample is assigned to the red class.

## PERFORMANCE ASSESSMENT WITH CROSS-VALIDATION

A basic rule in classification analysis is that class predictions are not made for data samples that are used for training or learning. If class predictions are made for samples used in training or learning, the accuracy will be artificially biased upward. Instead, class predictions are made for samples that are kept out of training process.

The performance of most classifiers is typically evaluated through cross-validation, which involves the determination of classification accuracy for multiple partitions of the input samples used in training. For example, during 5-fold (\kappa=5)cross-validation training, a set of input samples is split up into 5 partitions \mathcal{D}_1, \mathcal{D}_2, \ldots, \mathcal{D}_5having equal sample sizes to the extent possible. The notion of ensuring uniform class representation among the partitions is called stratified cross-validation, which is preferred. To begin, for 5-fold cross-validation, samples in partitions \mathcal{D}_2, \mathcal{D}_3, \ldots, \mathcal{D}_5are first used for training while samples in partition \mathcal{D}_1are used for testing. Next, samples in groups \mathcal{D}_1, \mathcal{D}_3, \ldots, \mathcal{D}_5are used for training and samples in partition \mathcal{D}_2used for testing. This is repeated until each partitions have been used singly for testing. It is also customary to re-partition all of the input samples e.g. 10 times in order to get a better estimate of accuracy.

The algorithm given on the next page was applied to several commonly used data sets where the fold value varied in order to assesââ‚¬â„¢ performance (accuracy) as a function of the size of the cross validation partitions.

## ALGORITHM

begin

initialize the n \times ndistance matrix \textbf{D}, initialize the \Omega \times \Omegaconfusion matrix \textbf{C}, set t \leftarrow 0, TotAcc \leftarrow 0, and set NumIterationsequal to the desired number of iterations (re-partitions).

calculate distances between all the input samples and store in n \times nmatrix \textbf{D}. (For a large number of samples, use only the lower or upper triangular of \textbf{D}for storage since it is a square symmetric matrix.)

for t \leftarrow 1 to NumIterationsdo

set \textbf{C} \leftarrow 0, and n_{total} \leftarrow 0.

partition the input samples into \kappaequally-sized groups.

assign samples in the foldth partition to testing, and use the remaining samples for training.

Set the number of samples used for testing as n_{test}.

set n_{total} \leftarrow n_{total}+n_{test}.

for i \leftarrow 1 to n_{test}do

for test sample \textbf{x}_idetermine the kclosest training samples based on the calculated distances.

determine \hat{\omega}, the most frequent class label among the kclosest training samples.

increment confusion matrix \textbf{C}by 1 in element c_{\omega,\hat{\omega}}, where \omegais the true and \hat{\omega}the predicted class label for test sample \textbf{x}_i. If \omega =\hat{\omega}then the increment of +1will occur on the diagonal of the confusion matrix, otherwise, the increment will occur in an off-diagonal.

determine the classification accuracy using Acc = \frac{\sum_j^{\Omega}c_{jj}}{n_{total}}where c_{jj}is a diagonal element of the confusion matrix \textbf{C}.

calculate TotAcc = TotAcc + Acc.

calculate AvgAcc = TotAcc/NumIterations

end