Why R is perfect for Statistical Analysis

7 min read

[box type=”note” align=”” class=”” width=””]This article is taken from Machine Learning with R written by Brett Lantz. This book will help you learn specialized machine learning techniques for text mining, social network data, and big data.[/box]

In this post we will explore different statistical analysis techniques and how they can be implemented using R language easily and efficiently.

Introduction

The R language, as the descendent of the statistics language, S, has become the preferred computing language in the field of statistics. Moreover, due to its status as an active contributor in the field, if a new statistical method is discovered, it is very likely that this method will first be implemented in the R language. As such, a large quantity of statistical methods can be fulfilled by applying the R language.

To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics:

Descriptive statistics: These are used to summarize the characteristics of the data. The user can use mean and standard deviation to describe numerical data, and use frequency and percentages to describe categorical data
Inferential statistics: Based on the pattern within a sample data, the user can infer the characteristics of the population. The methods related to inferential statistics are for hypothesis testing, data estimation, data correlation, and relationship modeling. Inference can be further extended to forecasting, prediction, and estimation of unobserved values either in or associated with the population being studied.

In the following recipes, we will discuss examples of data sampling, probability distribution, univariate descriptive statistics, correlations and multivariate analysis, linear regression and multivariate analysis, Exact Binomial Test, student’s t-test, Kolmogorov-Smirnov test, Wilcoxon Rank Sum and Signed Rank test, Pearson’s Chi-squared Test, One-way ANOVA, and Two-way ANOVA.

Data sampling with R

Sampling is a method to select a subset of data from a statistical population, which can use the characteristics of the population to estimate the whole population. The following recipe will demonstrate how to generate samples in R.

Perform the following steps to understand data sampling in R:

To generate random samples of a given population, the user can simply use the sample function:

> sample(1:10)

R and Statistics

[ 111 ]

To specify the number of items returned, the user can set the assigned value to the size argument:

> sample(1:10, size = 5)

Moreover, the sample can also generate Bernoulli trials by specifying replace = TRUE (default is FALSE):

> sample(c(0,1), 10, replace = TRUE)

If we want to do a coin flipping trail, where the outcome is Head or Tail, we can use:

> outcome <- c("Head","Tail")

> sample(outcome, size=1)

To generate result for 100 times, we can use:

> sample(outcome, size=100, replace=TRUE)

The sample can be useful when we want to select random data from datasets, selecting 10 observations from AirPassengers:

> sample(AirPassengers, size=10)

How it works

As we saw in the preceding demonstration, the sample function can generate random samples from a specified population. The returned number from records can be designated by the user simply by specifying the argument of size. By assigning the replace argument as TRUE, you can generate Bernoulli trials (a population with 0 and 1 only).

Operating a probability distribution in R

Probability distribution and statistics analysis are closely related to each other. For statistics analysis, analysts make predictions based on a certain population, which is mostly under a probability distribution. Therefore, if you find that the data selected for a prediction does not follow the exact assumed probability distribution in the experiment design, the upcoming results can be refuted. In other words, probability provides the justification for statistics. The following examples will demonstrate how to generate probability distribution in R.

Perform the following steps:

For a normal distribution, the user can use dnorm, which will return the height of a normal curve at 0:

> dnorm(0)

Output:

[1] 0.3989423

Then, the user can change the mean and the standard deviation in the argument:

> dnorm(0,mean=3,sd=5)

Output:

[1] 0.06664492

Next, plot the graph of a normal distribution with the curve function:

> curve(dnorm,-3,3)

In contrast to dnorm, which returns the height of a normal curve, the pnorm function can return the area under a given value:

> pnorm(1.5)

Output:

[1] 0.9331928

Alternatively, to get the area over a certain value, you can specify the option, lower.tail, as FALSE:

> pnorm(1.5, lower.tail=FALSE)

Output:

[1] 0.0668072

To plot the graph of pnorm, the user can employ a curve function:

> curve(pnorm(x), -3,3)

To calculate the quantiles for a specific distribution, you can use qnorm. The function, qnorm, can be treated as the inverse of pnorm, which returns the Zscore of a given probability:

> qnorm(0.5)

Output:

[1] 0

> qnorm(pnorm(0))

Output:

[1] 0

To generate random numbers from a normal distribution, one can use the rnorm function and specify the number of generated numbers. Also, one can define optional arguments, such as the mean and standard deviation:

> set.seed(50)

> x = rnorm(100,mean=3,sd=5)

> hist(x)

To calculate the uniform distribution, the runif function generates random numbers from a uniform distribution. The user can specify the range of the generated numbers by specifying variables, such as the minimum and maximum. For the following example, the user generates 100 random variables from 0 to 5:

> set.seed(50)

> y = runif(100,0,5)

> hist(y)

Lastly, if you would like to test the normality of the data, the most widely used test for this is the Shapiro-Wilks test. Here, we demonstrate how to perform a test of normality on samples from both the normal and uniform distributions, respectively:

> shapiro.test(x)

Output:

Shapiro-Wilk normality test

data: x

W = 0.9938, p-value = 0.9319

> shapiro.test(y)

Shapiro-Wilk normality test

data: y

W = 0.9563, p-value = 0.002221

How it works

In this recipe, we first introduce dnorm, a probability density function, which returns the height of a normal curve. With a single input specified, the input value is called a standard score or a z-score. Without any other arguments specified, it is assumed that the normal distribution is in use with a mean of zero and a standard deviation of 1. We then introduce three ways to draw standard and normal distributions.

After this, we introduce pnorm, a cumulative density function. The function, pnorm, can generate the area under a given value. In addition to this, pnorm can be also used to calculate the p-value from a normal distribution. One can get the p-value by subtracting 1 from the number, or assigning True to the option, lower.tail. Similarly, one can use the plot function to plot the cumulative density.

In contrast to pnorm, qnorm returns the z-score of a given probability. Therefore, the example shows that the application of a qnorm function to a pnorm function will produce the exact input value.

Next, we show you how to use the rnrom function to generate random samples from a normal distribution, and the runif function to generate random samples from the uniform distribution. In the function, rnorm, one has to specify the number of generated numbers and we may also add optional augments, such as the mean and standard deviation. Then, by using the hist function, one should be able to find a bell-curve in figure 3. On the other hand, for the runif function, with the minimum and maximum specifications, one can get a list of sample numbers between the two. However, we can still use the hist function to plot the samples. The output figure (shown in the preceding figure) is not in a bell shape, which indicates that the sample does not come from the normal distribution. Finally, we demonstrate how to test data normality with the Shapiro-Wilks test. Here, we conduct the normality test on both the normal and uniform distribution samples, respectively. In both outputs, one can find the p-value in each test result. The p-value shows the changes, which show that the sample comes from a normal distribution. If the p-value is higher than 0.05, we can conclude that the sample comes from a normal distribution. On the other hand, if the value is lower than 0.05, we conclude that the sample does not come from a normal distribution.

We have shown you how you can use R language to perform Statistical Analysis easily and efficiently and what are the simplest forms of it.

If you liked this article, please be sure to check out Machine Learning with R which consists of useful machine learning techniques with R.

Amarabha Banerjee