High-Dimensional Regularized Discriminant Analysis
Regularized discriminant analysis (RDA) is a widely-used classifier that determines which features discriminate between two or more groups. Despite its popularity and generalizability, however, regularized discriminant analysis becomes impractical and loses interpretability in classification problems with high-dimensional, small sample data where the number of features far exceeds the training sample size.  To address this flaw, High-dimensional regularized discriminant analysis (HDRDA) is introduced. The performance and computational runtime of HDRDA are analyzed by applying HDRDA and other traditional classifiers to six real high-dimensional datasets. It is demonstrated that HDRDA is superior to multiple sparse and regularized classifiers in both classification accuracy and computational complexity, especially as the number of feature increases.
Keywords: Regularized discriminant analysis, High-dimensional classification, dimensionality reduction
The focus of this research paper is on categorical classification scenarios involving small-sample, high-dimensional datasets in which the number of features p far exceeds the training sample size N, that is, n << p. In this particular scenario, discriminant analysis classifiers, namely linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), become incalculable. This is because the class and pooled covariance matrix estimators are singular and that the matrix inverse does not exist. Introducing a weighted average of the class and pooled covariance matrices and a regularization component to discriminant analysis results in regularized discriminant analysis (RDA) that yields higher accuracy in estimating the class covariance matrix and stabilizes its inverse. This, however, causes RDA classifier to lose interpretability and does not resolve the flaw that it is impractical for high-dimensional dataset. The high number of features in high-dimensional datasets causes the number of computations to grow at a polynomial rate, and it makes matrix calculations extremely computationally expensive. The tuning parameters of RDA requires the matrix inverse and determinant of each class covariance matrix to be computed across multiple cross-validation folds for each candidate tuning-parameter pair, and only intensifies the computational complexity.
If you need assistance with writing your essay, our professional essay writing service is here to help!Essay Writing Service
To address the flaws of regularized discriminant analysis, High-dimensional regularization discriminant analysis (HDRDA) classifier is introduced. The RDA classifier is parametrized, a biased covariance-matrix estimator is implemented, and the resulting estimator is shrunk towards a scaled identity matrix to obtain positive definiteness. The pooling parameter introduced in the HDRDA classifier allows for interpretability by identifying how each training observation results in the estimation of each class covariance matrix.
2.1 Discriminant Analysis
Discriminant analysis is a classifier in which two or more clusters or populations are known a priori and new observations are classified into one of the known groups by the measured characteristics. Accordingly, the procedure for the classifier starts with training data with known group memberships with prior probability
that represents the expected ratio of the group that belong to population
. There are commonly three choices for defining priors:
Case 1: We assume equal priors if all the population sizes are expected to be equal with
where g is the number of known groups in the training data.
Case 2: We select arbitrary priors to represent the relative population sizes where
Case 3: We estimate the priors based on the ratio of the number of observations
in the training data. This therefore requires that
Once the priors are identified, Bartlett’s test is used to determine whether the variance-covariance matrices are homogenous for all population groups involved. This helps determine whether Linear or Quadratic Discriminant Analysis should be used.
Case 1: If the variance-covariance matrices are homogenous and therefore
This means that the variance-covariance matrices do not depend on the population. In this case, linear discriminant analysis is used.
Case 2: if the variance-covariance matrices are heterogenous and therefore
then the variance-covariance matrices depend on the population and quadratic discriminant analysis is used.
Once the method to be used is identified, conditional probability density function (
) parameters are estimated with the assumptions that each data from group
has the common mean vector
and common variance-covariance matrix
. It is also assumed to be independently sampled and multivariate normally distributed.
Then, discriminant functions are computed to classify a new observation into one of the known population groups, and cross validation is used to calculate the probability of misclassification.
2.2 Linear Discriminant Analysis (LDA)
Discriminant analysis assumes the estimates of prior probabilities, population means, and variance-covariance matrix from the training data as follows:
The population means and the variance-covariance matrix are estimated by the sample mean vectors and the pooled variance-covariance matrix, respectively.
If the variance-covariance matrices are homogeneous as in Case 1 above, the probability density function of
is multivariate normal by assumption with mean vector
and variance-covariance matrix
. This is represented as follows:
Each observation is classified to the population for which
is the highest.
In linear discriminant analysis, the Linear Score Function is used to make decisions. It is a function of the pooled variance-covariance matrix and
, the population mean for each of the
populations represented as the following:
The Linear Discriminant Function becomes
Sample units with
are classified into the population group that has the largest Linear Score Function; in other words, they are classified into the population such that the posterior probability of membership is maximized.
2.3 Quadratic Discriminant Analysis (QDA)
For heterogeneous variance-covariance matrices, the matrices depend on the population. Quadratic Score Functions are used for quadratic discriminant analysis:
In this equation, the mean vector and the variance-covariance matrices are unknown and therefore replaced by their estimates from the training data, and this makes
Similar to the classification decision rule in the case of linear discriminant analysis, sample units are classified into the population group such that the quadratic score function is maximized.
The QDA classifier is defined as;
is the maximum-likelihood estimator (MLE) for
3. System Overview and methodology
The performance of the HDRDA classifier is compared to that of other classifiers in this study in two metrics – average classification error rates and running time using six high-dimensional datasets.
The runtime of the classifiers are measured and plotted to observe how much improvement HDRDA provides compared to the RDA classifier.
3.2 Software Package Description
R is a programming language and open-source statistical software used for statistical computing and graphics supported by the R Foundation for Statistical Computing. It is widely used among statisticians and researchers of various disciplines for analyzing data and developing statistical software. Version 3.3.1 of the software is used to conduct this experiment. The penalizedLDA package is used to implement linear discriminant analysis. The sparsediscrim package is used to implement two different variants of the diagonal linear discriminant analysis that use an improved mean estimator and an improved variance estimator, respectively. The random forest classifier from the randomForest package is used as a benchmark to compare other classifiers.
3.3.1 Chiaretti et al. (2004) Dataset
The Chiaretti dataset contains the gene expression profiles of acute lymphoblastic leukemia (ALL) patients. The profiles were obtained from 128 patients using Affymetrix human 95Av2 arrays. 33 samples containing more than 90% of blast cells were used for gene expression analysis. Leukemia samples from another 18 patients were used to test the 3 gene model developed by gene expression profiling. 
3.3.2 Chowdary et al. (2006) Dataset
The Chowdary dataset includes 52 matched pairs of colon and breast tumor tissues using Affymetrix U133A arrays and ribonucleic acid (RNA) amplification. Matched frozen and RNAlater preservative suspension tissues were obtained from Genomics Collaborative Inc. (Cambridge, MA) and Proteogenex (Los Angeles, CA) after prospective collection by multiple agencies after approval by the institutional review board. Rapidly frozen samples (but not RNAlater-preserved tissues) were confirmed by pathology to contain at least 70% tumor cell content. The tissue preserved by RNAlater is directly adjacent to the rapidly frozen tumor, and the portion for pathological validation is located between the two tissues, creating a pair of mirrors in which the RNAlater preservation is the tissue. 
3.3.3 Nakayama et al. (2007) Dataset
For this dataset, 105 gene expression samples were obtained from 10 types of soft-tissue tumors through an oligonucleotide microarray. This includes samples from synovial sarcoma (16 samples), myxoid/round cell liposarcoma (19 samples), lipoma (3 samples), well-differentiated liposarcoma (3 samples), dedifferentiated liposarcoma (15 samples), myxofibrosarcoma (15 samples), leiomyosarcoma (6 samples), malignant nerve sheathe tumor (3 samples), fibrosarcoma (4 samples), and malignant fibrous histiocytoma (21 samples).  We used five tumor types with at least 15 observations for the analysis to follow Witten and Tibshirani (2011) 
3.3.4 Shipp et al. (2002) Dataset
Shipp have examined that diffuse large B-cell lymphoma (DLBCL) is the most common lymphoid malignancy in adults and can be cured in less than 50% of patients. Prognostic models based on pre-processing features, such as the International Prognostic Index (IPI), are currently used to predict the outcome of DLBCL. Customized cDNA (lymphochip) microarrays were used to take 6817 gene-expression level measurements from 58 DLBCL samples to research cyclophosphamide, adriamycin, vincristine, and prednisone-based chemotherapy and their effectivity on cancer patients. 32 of the samples represent cured cases, and the remaining 26 originate from patients with fatal or refractory disease. 
3.3.5 Singh et al. (2002) Dataset
The Singh dataset includes 235 cases of surgical radical prostatectomy taken between 1995 and 1997. Oligonucleotide microarrays that contain probes were used to collect approximately 12,600 genes and expressed sequence tags. 102 of the 235 specimens were deemed high quality in terms of interpretability, 52 of which represent prostate tumor cases and the remaining 50, non-tumor prostate samples. 
3.3.6 Tian et al. (2003) Dataset
Affymetrix U95Av2 microarrays were used to extract expression profiles for 2625 genes. Molecular determinants of osteolytic lesions were identified by exposing the plasma cells to biochemical and immunohistochemical tests. Magnetic resonance imaging (MRI) successfully detected focal bone lesions in 136 myloma patients, and failed to detect 36 myloma cases. 
3.4 Definition of HDRDA
High-Dimensional Regularized Discriminant Analysis classifier can be defined by demonstrating the covariance-matrix estimator
and its interpretation as a linear combination of the cross-products of the training observations. The convex combination is defined as follows:
is the pooling parameter. The degree of shrinkage is determined by
for the estimate of each class covariance matrix toward the pooled estimate.
leads to QDA whereas
leads to LDA.
is centered by its class sample mean, and expressing the convex combination in terms of
This shows that
is the weight of the contribution of each the observations in N in estimating the variance-covariance matrix
from all K classes as opposed to formulating it with only the
observations from a single class. Therefore,
Represents a covariance-matrix estimator that “borrows” from
The estimation of
is improved and its inverse is stabilized by introducing an eigenvalue adjustment as follows:
is an eigenvalue shrinkage constant.
This shows that the pooling parameter
represents the amount of estimation information “borrowed” to estimate
, and the shrinkage parameter
represents the degree of eigenvalue shrinkage. The constant
gives flexibility to the covariance-matrix estimators. 
into the QDA classifier discussed in the previous section gives definition to the HDRDA classifier as follows:
3.5 Properties of HDRDA
The equation above can be decomposed into two components. The first component represents matrix operations applied to matrices that are low dimensional, and the second component consists of the null space of
. For all classes, the matrix operations applied on the null space of
results in a constant quadratic form, and therefore can be omitted.  As p increases and significantly exceeds N, this property allows us to substantially save computational costs because the constant component consists of calculations of determinants and inverses of matrices of high dimensions. 
The lemmas below that appear in Ramey et al (2016)  provide the bases to define the decision rule for HDRDA:
be the MLEs of
, respectively. Let
be defined as
be the eigendecomposition of
as above, and suppose that
be defined as above.
Then, for all
With these lemmas, the discriminant function,
can be solved to derive the following decision rule for HDRDA:
This allows us to calculate the operations on
, instead of computing the inverses and determinants of covariance matrices. 
5. Results and Analysis
- Timing Comparisons between RDA and HDRDA
- Classification Study
- Simulation Study
 J. A. Ramey, C. K. Stein, P. D. Young, and D. M. Young. High-Dimensional Regularized Discriminant Analysis. arXiv preprint arXiv:1602.01182, 2016.
 Chiaretti, S., Li, X., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J., Foa, R., 2004. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 (7), 2771-2778.
 Chowdary, D., Lathrop, J., Skelton, J., Curtin, K., Briggs, T., Zhang, Y., Yu, J., Wang, Y., Mazumder, A., Feb. 2006. Prognostic Gene Expression Signatures Can Be Measured in Tissues Collected in RNAlater Preservative. The Journal of Molecular Diagnostics 8 (1), 31-39.
 Nakayama, R., Nemoto, T., Takahashi, H., Ohta, T., Kawai, A., Seki, K., Yoshida, T., Toyama, Y., Ichikawa, H., Hasegawa, T., Apr. 2007. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Nature 20 (7), 749-759.
 Witten, D. M., Tibshirani, R., Aug. 2011. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (5), 753-772.
 Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C. T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C., Golub, T. R., Jan. 2002. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine 8 (1), 68-74.
 Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kanto_, P. W., Golub, T. R., Sellers, W. R., Mar. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 (2), 203-209.
 Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Barlogie, B., Shaughnessy, Jr., J. D., Dec. 2003. The Role of the Wnt-Signaling Antagonist DKK1 in the Development of Osteolytic Lesions in Multiple Myeloma. New England Journal of Medicine 349 (26), 2483-2494.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: