8 min read

 

Statistical Analysis with R

Statistical Analysis with R

Take control of your data and produce superior statistical analysis with R.

  • An easy introduction for people who are new to R, with plenty of strong examples for you to work through
  • This book will take you on a journey to learn R as the strategist for an ancient Chinese kingdom!
  • A step by step guide to understand R, its benefits, and how to use it to maximize the impact of your data analysis
  • A practical guide to conduct and communicate your data analysis with R in the most effective manner

 

        Read more about this book      

(For more resources on R, see here.)

Retracing and refining a complete analysis

For demonstration purposes, it will be assumed that a fire attack was chosen as the optimal battle strategy. Throughout this segment, we will retrace the steps that lead us to this decision. Meanwhile, we will make sure to organize and clarify our analyses so they can be easily communicated to others.

Suppose we determined our fire attack will take place 225 miles away in Anding, which houses 10,000 Wei soldiers. We will deploy 2,500 soldiers for a period of 7 days and assume that they are able to successfully execute the plans. Let us return to the beginning to develop this strategy with R in a clear and concise manner.

Time for action – first steps

To begin our analysis, we must first launch R and set our working directory:

  1. Launch R.
  2. The R console will be displayed.
  3. Set your R working directory using the setwd(dir) function. The following code is a hypothetical example. Your working directory should be a relevant location on your own computer.

    > #set the R working directory using setwd(dir)
    > setwd("/Users/johnmquick/rBeginnersGuide/")

  4. Verify that your working directory has been set to the proper location using the getwd() command :

    > #verify the location of your working directory
    > getwd()
    [1] "/Users/johnmquick/rBeginnersGuide/"

What just happened?

We prepared R to begin our analysis by launching the soft ware and setting our working directory. At this point, you should be very comfortable completing these steps.

Time for action – data setup

Next, we need to import our battle data into R and isolate the portion pertaining to past fire attacks:

  1. Copy the battleHistory.csv file into your R working directory. This file contains data from 120 previous battles between the Shu and Wei forces.
  2. Read the contents of battleHistory.csv into an R variable named battleHistory using the read.table(…) command:

    > #read the contents of battleHistory.csv into an R variable
    > #battleHistory contains data from 120 previous battles
    between the Shu and Wei forces
    > battleHistory <- read.table("battleHistory.csv", TRUE, ",")

  3. Create a subset using the subset(data, …) function and save it to a new variable named subsetFire:

    > #use the subset(data, ...) function to create a subset of
    the battleHistory dataset that contains data only from battles
    in which the fire attack strategy was employed
    > subsetFire <- subset(battleHistory, battleHistory$Method ==
    "fire")

  4. Verify the contents of the new subset. Note that the console should return 30 rows, all of which contain fire in the Method column:

    > #display the fire attack data subset
    > subsetFire

What just happened?

We imported our dataset and then created a subset containing our fire attack data. However, we used a slightly different function, called read.table(…), to import our external data into R.

read.table(…)

U p to this point, we have always used the read.csv() function to import data into R. However, you should know that there are oft en many ways to accomplish the same objectives in R. For instance, read.table(…) is a generic data import function that can handle a variety of file types. While it accepts several arguments, the following three are required to properly import a CSV file, like the one containing our battle history data:

  • file: t he name of the file to be imported, along with its extension, in quotes
  • header: whether or not the file contains column headings; TRUE for yes, FALSE (default) for no
  • sep: t he character used to separate values in the file, in quotes

Using these arguments, we were able to import the data in our battleHistory.csv into R. Since our file contained headings, we used a value of TRUE for the header argument and because it is a comma-separated values file, we used “,” for our sep argument:

> battleHistory <- read.table("battleHistory.csv", TRUE, ",")

This is just one example of how a different technique can be used to achieve a similar outcome in R. We will continue to explore new methods in our upcoming activities.

Pop quiz

  1. Suppose you wanted to import the following dataset, named newData into R. Which of the following read.table(…) functions would be best to use?

    4,5
    5,9
    6,12

    1. read.table(“newData”, FALSE, “,”)
    2. read.table(“newData”, TRUE, “,”)
    3. read.table(“newData.csv”, FALSE, “,”)
    4. read.table(“newData.csv”, TRUE, “,”)

Time for action – data exploration

To begin our analysis, we will examine the summary statistics and correlations of our data. These will give us an overview of the data and inform our subsequent analyses:

  1. Generate a summary of the fire attack subset using summary(object):

    > #generate a summary of the fire subset
    > summaryFire <- summary(subsetFire)
    > #display the summary
    > summaryFire

    Before calculating correlations, we will have to convert our nonnumeric data from the Method, SuccessfullyExecuted, and Result columns into numeric form.

  2. Re code the Method column using as.numeric(data):

    > #represent categorical data numerically using
    as.numeric(data)
    > #recode the Method column into Fire = 1
    > numericMethodFire <- as.numeric(subsetFire$Method) - 1

  3. Recode the SuccessfullyExecuted column using as.numeric(data):

    > #recode the SuccessfullyExecuted column into N = 0 and Y = 1
    > numericExecutionFire <-
    as.numeric(subsetFire$SuccessfullyExecuted) - 1

  4. Recode the Result column using as.numeric(data):

    > #recode the Result column into Defeat = 0 and Victory = 1
    > numericResultFire <- as.numeric(subsetFire$Result) - 1

    With the Method, SuccessfullyExecuted, and Result columns coded into numeric form, let us now add them back into our fire dataset.

  5. Save the data in our recoded variables back into the original dataset:

    > #save the data in the numeric Method, SuccessfullyExecuted,
    and Result columns back into the fire attack dataset
    > subsetFire$Method <- numericMethodFire
    > subsetFire$SuccessfullyExecuted <- numericExecutionFire
    > subsetFire$Result <- numericResultFire

  6. Display the numeric version of the fire attack subset. Notice that all of the columns now contain numeric data; it will look like the following:

  7. Having replaced our original text values in the SuccessfullyExecuted and Result columns with numeric data, we can now calculate all of the correlations in the dataset using the cor(data) function:

    > #use cor(data) to calculate all of the correlations in the
    fire attack dataset
    > cor(subsetFire)

    Note that the error message and NA values in our correlation output result from the fact that our Method column contains only a single value. This is irrelevant to our analysis and can be ignored.

What just happened?

Initially, we calculated summary statistics for our fire attack dataset using the summary(object) function. From this information, we can derive the following useful insights about our past battles:

  • The rating of the Shu army’s performance in fire attacks has ranged from 10 to 100, with a mean of 45
  • Fire attack plans have been successfully executed 10 out of 30 times (33%)
  • Fire attacks have resulted in victory 8 out of 30 times (27%)
  • Successfully executed fire attacks have resulted in victory 8 out of 10 times (80%), while unsuccessful attacks have never resulted in victory
  • The number of Shu soldiers engaged in fire attacks has ranged from 100 to 10,000 with a mean of 2,052
  • The number of Wei soldiers engaged in fire attacks has ranged from 1,500 to 50,000 with a mean of 12,333
  • The duration of fire attacks has ranged from 1 to 14 days with a mean of 7

Next, we recoded the text values in our dataset’s Method, SuccessfullyExecuted, and Result columns into numeric form. Aft er adding the data from these variables back into our our original dataset, we were able to calculate all of its correlations. This allowed us to learn even more about our past battle data:

  • The performance rating of a fire attack has been highly correlated with successful execution of the battle plans (0.92) and the battle’s result (0.90), but not strongly correlated with the other variables.
  • The execution of a fire attack has been moderately negatively correlated with the duration of the attack, such that a longer attack leads to a lesser chance of success (-0.46).
  • The numbers of Shu and Wei soldiers engaged are highly correlated with each other (0.74), but not strongly correlated with the other variables.

The insights gleaned from our summary statistics and correlations put us in a prime position to begin developing our regression model.

Pop quiz

  1. Which of the following is a benefit of adding a text variable back into its original dataset aft er it has been recoded into numeric form?
    1. Calculation functions can be executed on the recoded variable.
    2. Calculation functions can be executed on the other variables in the dataset.
    3. Calculation functions can be executed on the entire dataset.
    4. There is no benefit.

LEAVE A REPLY

Please enter your comment!
Please enter your name here