[box type=”note” align=”” class=”” width=””]This article is taken from Machine Learning with R written by Brett Lantz. This book will help you learn specialized machine learning techniques for text mining, social network data, and big data.[/box]
In this post we will explore different statistical analysis techniques and how they can be implemented using R language easily and efficiently.
The R language, as the descendent of the statistics language, S, has become the preferred computing language in the field of statistics. Moreover, due to its status as an active contributor in the field, if a new statistical method is discovered, it is very likely that this method will first be implemented in the R language. As such, a large quantity of statistical methods can be fulfilled by applying the R language.
To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics:
In the following recipes, we will discuss examples of data sampling, probability distribution, univariate descriptive statistics, correlations and multivariate analysis, linear regression and multivariate analysis, Exact Binomial Test, student’s t-test, Kolmogorov-Smirnov test, Wilcoxon Rank Sum and Signed Rank test, Pearson’s Chi-squared Test, One-way ANOVA, and Two-way ANOVA.
Sampling is a method to select a subset of data from a statistical population, which can use the characteristics of the population to estimate the whole population. The following recipe will demonstrate how to generate samples in R.
Perform the following steps to understand data sampling in R:
> sample(1:10)
R and Statistics
[ 111 ]
> sample(1:10, size = 5)
> sample(c(0,1), 10, replace = TRUE)
> outcome <- c("Head","Tail")
> sample(outcome, size=1)
> sample(outcome, size=100, replace=TRUE)
> sample(AirPassengers, size=10)
As we saw in the preceding demonstration, the sample function can generate random samples from a specified population. The returned number from records can be designated by the user simply by specifying the argument of size. By assigning the replace argument as TRUE, you can generate Bernoulli trials (a population with 0 and 1 only).
Probability distribution and statistics analysis are closely related to each other. For statistics analysis, analysts make predictions based on a certain population, which is mostly under a probability distribution. Therefore, if you find that the data selected for a prediction does not follow the exact assumed probability distribution in the experiment design, the upcoming results can be refuted. In other words, probability provides the justification for statistics. The following examples will demonstrate how to generate probability distribution in R.
Perform the following steps:
> dnorm(0)
Output:
[1] 0.3989423
> dnorm(0,mean=3,sd=5)
Output:
[1] 0.06664492
> curve(dnorm,-3,3)
> pnorm(1.5)
Output:
[1] 0.9331928
> pnorm(1.5, lower.tail=FALSE)
Output:
[1] 0.0668072
> curve(pnorm(x), -3,3)
> qnorm(0.5)
Output:
[1] 0
> qnorm(pnorm(0))
Output:
[1] 0
> set.seed(50)
> x = rnorm(100,mean=3,sd=5)
> hist(x)
> set.seed(50)
> y = runif(100,0,5)
> hist(y)
> shapiro.test(x)
Output:
Shapiro-Wilk normality test
data: x
W = 0.9938, p-value = 0.9319
> shapiro.test(y)
Shapiro-Wilk normality test
data: y
W = 0.9563, p-value = 0.002221
In this recipe, we first introduce dnorm, a probability density function, which returns the height of a normal curve. With a single input specified, the input value is called a standard score or a z-score. Without any other arguments specified, it is assumed that the normal distribution is in use with a mean of zero and a standard deviation of 1. We then introduce three ways to draw standard and normal distributions.
After this, we introduce pnorm, a cumulative density function. The function, pnorm, can generate the area under a given value. In addition to this, pnorm can be also used to calculate the p-value from a normal distribution. One can get the p-value by subtracting 1 from the number, or assigning True to the option, lower.tail. Similarly, one can use the plot function to plot the cumulative density.
In contrast to pnorm, qnorm returns the z-score of a given probability. Therefore, the example shows that the application of a qnorm function to a pnorm function will produce the exact input value.
Next, we show you how to use the rnrom function to generate random samples from a normal distribution, and the runif function to generate random samples from the uniform distribution. In the function, rnorm, one has to specify the number of generated numbers and we may also add optional augments, such as the mean and standard deviation. Then, by using the hist function, one should be able to find a bell-curve in figure 3. On the other hand, for the runif function, with the minimum and maximum specifications, one can get a list of sample numbers between the two. However, we can still use the hist function to plot the samples. The output figure (shown in the preceding figure) is not in a bell shape, which indicates that the sample does not come from the normal distribution. Finally, we demonstrate how to test data normality with the Shapiro-Wilks test. Here, we conduct the normality test on both the normal and uniform distribution samples, respectively. In both outputs, one can find the p-value in each test result. The p-value shows the changes, which show that the sample comes from a normal distribution. If the p-value is higher than 0.05, we can conclude that the sample comes from a normal distribution. On the other hand, if the value is lower than 0.05, we conclude that the sample does not come from a normal distribution.
We have shown you how you can use R language to perform Statistical Analysis easily and efficiently and what are the simplest forms of it.
If you liked this article, please be sure to check out Machine Learning with R which consists of useful machine learning techniques with R.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…