Implementing Principal Component Analysis with R

[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Mastering Text Mining with R, written by Ashish Kumar and Avinash Paul. This book gives a comprehensive view of the text mining process and how you can leverage the power of R to analyze textual data and get unique insights out of it.[/box]

In this article, we aim to explain the concept of dimensionality reduction, or variable reduction, using Principal Component Analysis.

Principal Component Analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs.

PCA comprises of the following steps:

Compute the n-dimensional mean of the given dataset.
Compute the covariance matrix of the features.
Compute the eigenvectors and eigenvalues of the covariance matrix.
Rank/sort the eigenvectors by descending eigenvalue.
Choose x eigenvectors with the largest eigenvalues.

Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space.

PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis.

Using R for PCA

R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. Each function returns a list whose class is prcomp() or princomp().

The information returned and terminology is summarized in the following table:

implementing-principal-component-analysis-r-img-0

Here's a list of the functions available in different R packages for performing PCA:

PCA(): FactoMineR package
acp(): amap package
prcomp(): stats package
princomp(): stats package
dudi.pca(): ade4 package
pcaMethods: This package from Bioconductor has various convenient methods to compute PCA

Understanding the FactoMineR package

FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package:

library(FactoMineR)

data<-replicate(10,rnorm(1000))

result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T)

print(result.pca)

The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects:

implementing-principal-component-analysis-r-img-1

Eigenvalue percentage of variance cumulative percentage of variance:

implementing-principal-component-analysis-r-img-2

Amap package

Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame.

This function is akin to princomp() and prcomp(), except that it has slightly different graphic representation.

For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/amap/amap.pdf

Library(amap

acp(data,center=TRUE,reduce=TRUE)

Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package:

acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien")

K(u,kernel="gaussien")

W(x,h,D=NULL,kernel="gaussien")

acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien")

Proportion of variance

We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence.

R has a prcomp() function in the base package to estimate principal components.

Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits:

pca_base<-prcomp(data)

print(pca_base)

The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains:

pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100

pr_variance

[1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907

9.498072

[8] 9.218186 8.762572 8.430598

pr_variance signifies the proportion of variance explained by each component in descending order of magnitude.

Let's calculate the cumulative proportion of variance for the components:

cumsum(pr_variance)

[1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057

73.58864

[8] 82.80683 91.56940 100.00000

Components 1-8 explain the 82% variance in the data.

Scree plot

If you wish to plot the variances against the number of components, you can use the screeplot function on the fitted model: screeplot(pca_base)
implementing-principal-component-analysis-r-img-3

To summarize, we saw how fairly easy it is to implement PCA using rich functionalities offered by different R packages.

If this article has caught your interest, make sure to check out Mastering Text Mining with R, which contains many interesting techniques for text mining and natural language processing using R.