# High-Dimensional Regularized Discriminant Analysis

8229 words (33 pages) Essay in Sciences

23/09/19 Sciences Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

High-Dimensional Regularized Discriminant Analysis

Abstract

Regularized discriminant analysis (RDA) is a widely-used classifier that determines which features discriminate between two or more groups. Despite its popularity and generalizability, however, regularized discriminant analysis becomes impractical and loses interpretability in classification problems with high-dimensional, small sample data where the number of features far exceeds the training sample size. [1] To address this flaw, High-dimensional regularized discriminant analysis (HDRDA) is introduced. The performance and computational runtime of HDRDA are analyzed by applying HDRDA and other traditional classifiers to six real high-dimensional datasets. It is demonstrated that HDRDA is superior to multiple sparse and regularized classifiers in both classification accuracy and computational complexity, especially as the number of feature increases.

Keywords: Regularized discriminant analysis, High-dimensional classification, dimensionality reduction

## 1.  Introduction

The focus of this research paper is on categorical classification scenarios involving small-sample, high-dimensional datasets in which the number of features p far exceeds the training sample size N, that is, n << p. In this particular scenario, discriminant analysis classifiers, namely linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), become incalculable. This is because the class and pooled covariance matrix estimators are singular and that the matrix inverse does not exist. Introducing a weighted average of the class and pooled covariance matrices and a regularization component to discriminant analysis results in regularized discriminant analysis (RDA) that yields higher accuracy in estimating the class covariance matrix and stabilizes its inverse. This, however, causes RDA classifier to lose interpretability and does not resolve the flaw that it is impractical for high-dimensional dataset. The high number of features in high-dimensional datasets causes the number of computations to grow at a polynomial rate, and it makes matrix calculations extremely computationally expensive. The tuning parameters of RDA requires the matrix inverse and determinant of each class covariance matrix to be computed across multiple cross-validation folds for each candidate tuning-parameter pair, and only intensifies the computational complexity.

To address the flaws of regularized discriminant analysis, High-dimensional regularization discriminant analysis (HDRDA) classifier is introduced. The RDA classifier is parametrized, a biased covariance-matrix estimator is implemented, and the resulting estimator is shrunk towards a scaled identity matrix to obtain positive definiteness. The pooling parameter introduced in the HDRDA classifier allows for interpretability by identifying how each training observation results in the estimation of each class covariance matrix.

## 2.  Preliminaries

2.1   Discriminant Analysis

Discriminant analysis is a classifier in which two or more clusters or populations are known a priori and new observations are classified into one of the known groups by the measured characteristics. Accordingly, the procedure for the classifier starts with training data with known group memberships with prior probability

${{p}}_{{i}}$

that represents the expected ratio of the group that belong to population

${{\pi }}_{{i}}$

. There are commonly three choices for defining priors:

Case 1: We assume equal priors if all the population sizes are expected to be equal with

${\stackrel{̂}{{p}}}_{{i}}{=}\frac{{1}}{{g}}$

where g is the number of known groups in the training data.

Case 2: We select arbitrary priors to represent the relative population sizes where

${\stackrel{̂}{{p}}}_{{1}}{+}{\stackrel{̂}{{p}}}_{{2}}{+}{\dots }{+}{\stackrel{̂}{{p}}}_{{g}}{=}{1}$

Case 3: We estimate the priors based on the ratio of the number of observations

${{n}}_{{i}}$

to population

${{\pi }}_{{i}}$

in the training data. This therefore requires that

${{n}}_{{1}}{+}{{n}}_{{2}}{+}{\dots }{+}{{n}}_{{g}}{=}{N}$

and

.

Once the priors are identified, Bartlett’s test is used to determine whether the variance-covariance matrices are homogenous for all population groups involved. This helps determine whether Linear or Quadratic Discriminant Analysis should be used.

Case 1: If the variance-covariance matrices are homogenous and therefore

${{\sum }}_{{1}}{=}{{\sum }}_{{2}}{=}{{\sum }}_{{3}}{=}{\dots }{{\sum }}_{{g}}{=}{\sum }$

This means that the variance-covariance matrices do not depend on the population. In this case, linear discriminant analysis is used.

Case 2: if the variance-covariance matrices are heterogenous and therefore

then the variance-covariance matrices depend on the population and quadratic discriminant analysis is used.

Once the method to be used is identified, conditional probability density function (

${\mathrm{f}}{\left(}{\mathrm{X}}{|}{{\pi }}_{{i}}$

) parameters  are estimated with the assumptions that each data from group

${i}$

has the common mean vector

${{\mu }}_{{i}}$

and common variance-covariance matrix

${\sum }$

. It is also assumed to be independently sampled and multivariate normally distributed.

Then, discriminant functions are computed to classify a new observation into one of the known population groups, and cross validation is used to calculate the probability of misclassification.

2.2   Linear Discriminant Analysis (LDA)

Discriminant analysis assumes the estimates of prior probabilities, population means, and variance-covariance matrix from the training data as follows:

The population means and the variance-covariance matrix are estimated by the sample mean vectors and the pooled variance-covariance matrix, respectively.

If the variance-covariance matrices are homogeneous as in Case 1 above, the probability density function of

${\mathbf{x}}$

is multivariate normal by assumption with mean vector

${{\mu }}_{{i}}$

and variance-covariance matrix

${\sum }$

in population

${{\pi }}_{{i}}$

. This is represented as follows:

$\mathrm{f}\left(x|{{\pi }}_{{i}}\right){=}\frac{{1}}{{{\left(}{2}{\pi }{\right)}}^{{p}{/}{2}}{\left|{\sum }\right|}^{{1}{/}{2}}}{\mathit{exp}}\left[{–}\frac{{1}}{{2}}{\left(}{x}{–}{{\mu }}_{{i}}{\right)}{‘}{{\sum }}^{{–}{1}}{\left(}{x}{–}{{\mu }}_{{i}}{\right)}\right]$

Each observation is classified to the population for which

${p}_{i}\mathrm{f}\left(x|{{\pi }}_{{i}}\right)$

is the highest.

In linear discriminant analysis, the Linear Score Function is used to make decisions. It is a function of the pooled variance-covariance matrix and

${{\mu }}_{{i}}$

, the population mean for each of the

$\mathrm{g}$

populations represented as the following:

${s}_{i}^{L}\left(X\right)=–\frac{1}{2}{{\mu }}_{{i}}{‘}{{\sum }}^{{–}{1}}{{\mu }}_{{i}}{+}{{\mu }}_{{i}}{‘}{{\sum }}^{{–}{1}}{x}{+}{\mathrm{log}}{{p}}_{{i}}$

$={d}_{i0}+\sum _{j=1}^{p}{d}_{\mathit{ij}}{x}_{j}+{\mathrm{log}}{{p}}_{{i}}$

where

${d}_{i0}=–\frac{1}{2}{{\mu }}_{{i}}{‘}{{\sum }}^{{–}{1}}{{\mu }}_{{i}}$

The Linear Discriminant Function becomes

${d}_{i}^{L}\left(X\right)=–\frac{1}{2}{{\mu }}_{{i}}{‘}{{\sum }}^{{–}{1}}{{\mu }}_{{i}}{+}{{\mu }}_{{i}}{‘}{{\sum }}^{{–}{1}}{x}$

$={d}_{i0}+\sum _{j=1}^{p}{d}_{\mathit{ij}}{x}_{j}$

where

${d}_{i0}=–\frac{1}{2}{{\mu }}_{{i}}{‘}{{\sum }}^{{–}{1}}{{\mu }}_{{i}}$

Sample units with

are classified into the population group that has the largest Linear Score Function; in other words, they are classified into the population such that the posterior probability of membership is maximized.

For heterogeneous variance-covariance matrices, the matrices depend on the population. Quadratic Score Functions are used for quadratic discriminant analysis:

${s}_{i}^{Q}\left(X\right)=–\frac{1}{2}\mathrm{log}|{{\sum }}_{i}|–\frac{1}{2}{\left(}{x}{–}{{\mu }}_{{i}}{\right)}{‘}{{\sum }}^{{–}{1}}{\left(}{x}{–}{{\mu }}_{{i}}{\right)}{+}{\mathrm{log}}{{p}}_{{i}}$

In this equation, the mean vector and the variance-covariance matrices are unknown and therefore replaced by their estimates from the training data, and this makes

${\stackrel{̂}{s}}_{i}^{Q}\left(X\right)=–\frac{1}{2}\mathrm{log}|{{\mathrm{S}}}_{i}|–\frac{1}{2}{\left(}{x}{–}{\stackrel{̂}{{x}}}_{{i}}{\right)}{‘}{{{S}}_{{i}}}^{{–}{1}}{\left(}{x}{–}{\stackrel{̂}{{x}}}_{{i}}{\right)}{+}{\mathrm{log}}{{p}}_{{i}}$

Similar to the classification decision rule in the case of linear discriminant analysis, sample units are classified into the population group such that the quadratic score function is maximized.

The QDA classifier is defined as;

${D}_{\mathit{QDA}}\left(x\right)={\mathrm{arg}\mathit{min}}_{k}{\left(x–{\stackrel{̂}{x}}_{k}\right)}^{T}{{\sum }^{̂}}_{{k}}^{{–}{1}}\left(x–{\stackrel{̂}{x}}_{k}\right)+\mathrm{log}\left|{{\sum }^{̂}}_{k}\right|$

where

${\sum }^{̂}{=}{{N}}^{{–}{1}}\sum _{{k}{=}{1}}^{{K}}{{n}}_{{k}}{{\sum }^{̂}}_{k}$

is the maximum-likelihood estimator (MLE) for

${\sum }$

.

## 3.  System Overview and methodology

3.1   Methodology

The performance of the HDRDA classifier is compared to that of other classifiers in this study in two metrics – average classification error rates and  running time using six high-dimensional datasets.

The runtime of the classifiers are measured and plotted to observe how much improvement HDRDA provides compared to the RDA classifier.

3.2   Software Package Description

R is a programming language and open-source statistical software used for statistical computing and graphics supported by the R Foundation for Statistical Computing. It is widely used among statisticians and researchers of various disciplines for analyzing data and developing statistical software. Version 3.3.1 of the software is used to conduct this experiment. The penalizedLDA package is used to implement linear discriminant analysis. The sparsediscrim package is used to implement two different variants of the diagonal linear discriminant analysis that use an improved mean estimator and an improved variance estimator, respectively. The random forest classifier from the randomForest package is used as a benchmark to compare other classifiers.

3.3   Datasets

3.3.1   Chiaretti et al. (2004) Dataset

The Chiaretti dataset contains the gene expression profiles of acute lymphoblastic leukemia (ALL) patients. The profiles were obtained from 128 patients using Affymetrix human 95Av2 arrays. 33 samples containing more than 90% of blast cells were used for gene expression analysis. Leukemia samples from another 18 patients were used to test the 3 gene model developed by gene expression profiling. [2]

3.3.2        Chowdary et al. (2006) Dataset

The Chowdary dataset includes 52 matched pairs of colon and breast tumor tissues using Affymetrix U133A arrays and ribonucleic acid (RNA) amplification. Matched frozen and RNAlater preservative suspension tissues were obtained from Genomics Collaborative Inc. (Cambridge, MA) and Proteogenex (Los Angeles, CA) after prospective collection by multiple agencies after approval by the institutional review board. Rapidly frozen samples (but not RNAlater-preserved tissues) were confirmed by pathology to contain at least 70% tumor cell content. The tissue preserved by RNAlater is directly adjacent to the rapidly frozen tumor, and the portion for pathological validation is located between the two tissues, creating a pair of mirrors in which the RNAlater preservation is the tissue. [3]

3.3.3        Nakayama et al. (2007) Dataset

For this dataset, 105 gene expression samples were obtained from 10 types of soft-tissue tumors through an oligonucleotide microarray. This includes samples from synovial sarcoma (16 samples), myxoid/round cell liposarcoma (19 samples), lipoma (3 samples), well-differentiated liposarcoma (3 samples), dedifferentiated liposarcoma (15 samples), myxofibrosarcoma (15 samples), leiomyosarcoma (6 samples), malignant nerve sheathe tumor (3 samples), fibrosarcoma (4 samples), and malignant fibrous histiocytoma (21 samples). [4] We used five tumor types with at least 15 observations for the analysis to follow Witten and Tibshirani (2011) [5]

3.3.4          Shipp et al. (2002) Dataset

Shipp have examined that diffuse large B-cell lymphoma (DLBCL) is the most common lymphoid malignancy in adults and can be cured in less than 50% of patients. Prognostic models based on pre-processing features, such as the International Prognostic Index (IPI), are currently used to predict the outcome of DLBCL. Customized cDNA (lymphochip) microarrays were used to take 6817 gene-expression level measurements from 58 DLBCL samples to research cyclophosphamide, adriamycin, vincristine, and prednisone-based chemotherapy and their effectivity on cancer patients. 32 of the samples represent cured cases, and the remaining 26 originate from patients with fatal or refractory disease. [6]

3.3.5          Singh et al. (2002) Dataset

The Singh dataset includes 235 cases of surgical radical prostatectomy taken between 1995 and 1997. Oligonucleotide microarrays that contain probes were used to collect approximately 12,600 genes and expressed sequence tags. 102 of the 235 specimens were deemed high quality in terms of interpretability, 52 of which represent prostate tumor cases and the remaining 50, non-tumor prostate samples. [7]

3.3.6          Tian et al. (2003) Dataset

Affymetrix U95Av2 microarrays were used to extract expression profiles for 2625 genes. Molecular determinants of osteolytic lesions were identified by exposing the plasma cells to biochemical and immunohistochemical tests. Magnetic resonance imaging (MRI) successfully detected focal bone lesions in  136 myloma patients, and failed to detect 36 myloma cases. [8]

3.4   Definition of HDRDA

High-Dimensional Regularized Discriminant Analysis classifier can be defined by demonstrating the covariance-matrix estimator

${{\sum }^{̂}}_{k}\left(\mathrm{\lambda }\right)$

and its interpretation as a linear combination of the cross-products of the training observations. The convex combination is defined as follows:

for

$\mathrm{k}=1,\dots ,\mathrm{K}$

where

is the pooling parameter. The degree of shrinkage is determined by

$\mathrm{\lambda }$

for the estimate of each class covariance matrix toward the pooled estimate.

$\mathrm{\lambda }=0$

$\mathrm{\lambda }=0$

Each observation

is centered by its class sample mean, and expressing the convex combination in terms of

${x}_{i}$

we obtain:

${{\sum }^{̂}}_{k}\left(\mathrm{\lambda }\right)=\left(1–\mathrm{\lambda }+\frac{{\mathrm{\lambda n}}_{k}}{N}\right){{\sum }^{̂}}_{k}+\frac{\mathrm{\lambda }}{N}\sum _{\begin{array}{c}{k}^{‘}=1\\ k‘\ne k\end{array}}^{K}{n}_{k‘}{{\sum }^{̂}}_{k‘}$

$=\left(\frac{1–\mathrm{\lambda }}{{n}_{k}}+\frac{\mathrm{\lambda }}{N}\right)\sum _{i=1}^{N}I\left({y}_{i}=k\right){x}_{i}{x}_{i}^{T}+\frac{\mathrm{\lambda }}{N}\sum _{i=1}^{N}I\left({y}_{i}\ne k\right){x}_{i}{x}_{i}^{T}$

$=\sum _{i=1}^{N}{c}_{\mathit{ik}}\left(\mathrm{\lambda }\right){x}_{i}{x}_{i}^{T}$

where

${c}_{\mathit{ik}}\left(\mathrm{\lambda }\right)=\mathrm{\lambda }{N}^{–1}+\left(1–\mathrm{\lambda }\right){n}_{k}^{–1}I\left({y}_{i}=k\right)$

.

This shows that

$\mathrm{\lambda }$

is the weight of the contribution of each the observations in N in estimating the variance-covariance matrix

${{\sum }}_{k}$

from all K classes as opposed to formulating it with only the

${n}_{k}$

observations from a single class. Therefore,

${{\sum }^{̂}}_{k}\left(\mathrm{\lambda }\right)=\sum _{i=1}^{N}{c}_{\mathit{ik}}\left(\mathrm{\lambda }\right){x}_{i}{x}_{i}^{T}$

Represents a covariance-matrix estimator that “borrows” from

${\sum }^{̂}{=}{{N}}^{{–}{1}}\sum _{{k}{=}{1}}^{{K}}{{n}}_{{k}}{{\sum }^{̂}}_{k}$

to estimate

${{\sum }}_{{k}}$

.

The estimation of

${{\sum }}_{{k}}$

is improved and its inverse is stabilized by introducing an eigenvalue adjustment as follows:

${{\sum }^{̂}}_{k}≔{\alpha }_{k}{{\sum }^{̂}}_{k}\left(\mathrm{\lambda }\right)+{\mathit{\gamma I}}_{p}$

,

${\mathrm{\alpha }}_{\mathrm{k}}\ge 0$

and

$\gamma \ge 0$

$\gamma$

is an eigenvalue shrinkage constant.

This shows that the pooling parameter

$\mathrm{\lambda }$

represents the amount of estimation information “borrowed” to estimate

${{\sum }}_{{k}}$

from

${\sum }^{̂}$

, and the shrinkage parameter

$\gamma$

represents the degree of eigenvalue shrinkage. The constant

${\alpha }_{k}$

gives flexibility to the covariance-matrix estimators. [1]

Substituting

${{\sum }^{̂}}_{k}≔{\alpha }_{k}{{\sum }^{̂}}_{k}\left(\mathrm{\lambda }\right)+{\mathit{\gamma I}}_{p}$

into the QDA classifier discussed in the previous section gives definition to the HDRDA classifier as follows:

${D}_{\mathit{HDRDA}}\left(x\right)={\mathrm{arg}\mathit{min}}_{k}{\left(x–{\stackrel{̂}{x}}_{k}\right)}^{T}{{\sum }^{̂}}_{{k}}^{{+}}\left(x–{\stackrel{̂}{x}}_{k}\right)+\mathrm{log}\left|{{\sum }^{̂}}_{k}\right|$

3.5   Properties of HDRDA

The equation above can be decomposed into two components. The first component represents matrix operations applied to matrices that are low dimensional, and the second component consists of the null space of

${\sum }^{̂}$

. For all classes, the matrix operations applied on the null space of

${\sum }^{̂}$

results in a constant quadratic form, and therefore can be omitted. [1] As p increases and significantly exceeds N, this property allows us to substantially save computational costs because the constant component consists of calculations of determinants and inverses of matrices of high dimensions. [1]

The lemmas below that appear in Ramey et al (2016) [1] provide the bases to define the decision rule for HDRDA:

Lemma 1.

Let

and

${{\sum }}_{{k}}$

be the MLEs of

${{\sum }}_{{k}}$

and

${\sum }$

, respectively. Let

${{\sum }^{̂}}_{k}\left(\mathrm{\lambda }\right)$

be defined as

${{\sum }^{̂}}_{k}\left(\mathrm{\lambda }\right)≔\left(1–\mathrm{\lambda }\right){{\sum }^{̂}}_{k}+\mathrm{\lambda }{\sum }^{̂}$

.

Then,

[1]

Lemma 2.

Let

${\sum }^{̂}{=}{\mathit{UD}}{{U}}^{{T}}$

be the eigendecomposition of

${\sum }^{̂}$

as above, and suppose that

${\mathit{rank}}{\left(}{\sum }^{̂}{\right)}{=}{\mathrm{q}}{\le }{\mathrm{p}}$

.

Then,

where

[1]

Lemma 3.

Let

${U}_{2}$

be defined as above.

Then, for all

[1]

With these lemmas, the discriminant function,

${D}_{\mathit{HDRDA}}\left(x\right)={\mathrm{arg}\mathit{min}}_{k}{\left(x–{\stackrel{̂}{x}}_{k}\right)}^{T}{{\sum }^{̂}}_{{k}}^{{+}}\left(x–{\stackrel{̂}{x}}_{k}\right)+\mathrm{log}\left|{{\sum }^{̂}}_{k}\right|$

can be solved to derive the following decision rule for HDRDA:

${D}_{\mathit{HDRDA}}\left(x\right)={\mathrm{arg}\mathit{min}}_{k}{\left(x–{\stackrel{̂}{x}}_{k}\right)}^{T}$

${U}_{1}{W}_{k}^{–1}{U}_{1}^{T}\left(x–{\stackrel{̂}{x}}_{k}\right)+\mathrm{log}\left|{{W}}_{k}\right|$

This allows us to calculate the operations on

${W}_{k}\in {\mathbb{R}}_{\mathit{qxq}}$

, instead of computing the inverses and determinants of covariance matrices. [1]

## 5.  Results and Analysis

• Timing Comparisons between RDA and HDRDA
• Classification Study
• Simulation Study

## 6.  Conclusion

###### References

[1]      J. A. Ramey, C. K. Stein, P. D. Young, and D. M. Young. High-Dimensional Regularized Discriminant Analysis. arXiv preprint arXiv:1602.01182, 2016.

[2]      Chiaretti, S., Li, X., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J., Foa, R., 2004. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 (7), 2771-2778.

[3]      Chowdary, D., Lathrop, J., Skelton, J., Curtin, K., Briggs, T., Zhang, Y., Yu, J., Wang, Y., Mazumder, A., Feb. 2006. Prognostic Gene Expression Signatures Can Be Measured in Tissues Collected in RNAlater Preservative. The Journal of Molecular Diagnostics 8 (1), 31-39.

[4]      Nakayama, R., Nemoto, T., Takahashi, H., Ohta, T., Kawai, A., Seki, K., Yoshida, T., Toyama, Y., Ichikawa, H., Hasegawa, T., Apr. 2007. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Nature 20 (7), 749-759.

[5]      Witten, D. M., Tibshirani, R., Aug. 2011. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (5), 753-772.

[6]      Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C. T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C., Golub, T. R., Jan. 2002. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine 8 (1), 68-74.

[7]      Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kanto_, P. W., Golub, T. R., Sellers, W. R., Mar. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 (2), 203-209.

[8]      Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Barlogie, B., Shaughnessy, Jr., J. D., Dec. 2003. The Role of the Wnt-Signaling Antagonist DKK1 in the Development of Osteolytic Lesions in Multiple Myeloma. New England Journal of Medicine 349 (26), 2483-2494.

## Related Services

View all

### DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please:

Prices from
###### £28

Undergraduate 2:2 • 250 words • 7 day delivery

Delivered on-time or your money back

(148 Reviews)