8 min read

Statistical Analysis with R

Statistical Analysis with R

Take control of your data and produce superior statistical analysis with R.

  • An easy introduction for people who are new to R, with plenty of strong examples for you to work through
  • This book will take you on a journey to learn R as the strategist for an ancient Chinese kingdom!
  • A step by step guide to understand R, its benefits, and how to use it to maximize the impact of your data analysis
  • A practical guide to conduct and communicate your data analysis with R in the most effective manner

 

        Read more about this book      

(For more resources on R, see here.)

Time for action — creating a line chart

The ever popular line chart, or line graph, depicts relationships as continuous series of connected data points. Line charts are particularly useful for visualizing specific values and trends over time. Just as a line chart is an extension of a scatterplot in the non-digital realm, a line chart is created using an extended form of the plot(…) function in R. Let us explore how to extend the plot(…) function to create line charts in R:

  1. Use the type argument within the plot(…) function to create a line chart that depicts a single relationship between two variables:

    > #create a line chart that depicts the durations of past fire attacks
    > #get the data to be used in the chart
    > lineFireDurationDataX > lineFireDurationDataY > #customize the chart
    > lineFireDurationMain > lineFireDurationLabX > lineFireDurationLabY > #use the type argument to connect the data points with a line
    > lineFireDurationType > #use plot(...) to create and display the line chart
    > plot(x = lineFireDurationDataX, y = lineFireDurationDataY,
    main = lineFireDurationMain, xlab = lineFireDurationLabX,
    ylab = lineFireDurationLabY, type = lineFireDurationType)

  2. Your chart will be displayed in the graphic window, as follows:

What just happened?

We expanded our use of the plot(…) function to generate a line chart and encountered a new data notation in the process. Let us review these features.

type

In the plot(…) function, the type argument determines what kind of line, if any, should be used to connect a chart’s data points. The type argument receives one of several character values, all of which are listed as follows:

  • p: only points are plotted; this is the default value when type is undefined
  • l: only lines are drawn, without any points
  • o: both lines and points are drawn, with the lines overlapping the points
  • b: both lines and points are drawn, with the lines broken where they intersect with points
  • c: only lines are drawn, but they are broken where points would occur
  • s: only the lines are drawn in step formation; the initial step begins at zero
  • S: (uppercase) only the lines are drawn in step formation; the final step tails off at the last point
  • h: vertical lines are drawn to represent each point
  • n: no points nor lines are drawn

Our chart, which represented the duration of past fire attacks, featured a line that overlapped the plotted points. First, we defined our desired line type in an R variable:

> lineFireDurationType

Then the type argument was placed within our plot(…) function to generate the line chart:

> plot(lineFireDurationDataX, lineFireDurationDataY,
main = lineFireDurationMain, xlab = lineFireDurationLabX,
ylab = lineFireDurationLabY,
type = lineFireDurationType)

Number-colon-number notation

You may have noticed that we specified a vector for the x-axis data in our plot(…) function.

> lineFireDurationDataX

This vector used number-colon-number notation. Essentially, this notation has the effect of enumerating a range of values that lie between the number that precedes the colon and the number that follows it. To do so, it adds one to the beginning value until it reaches a final value that is equal to or less than the number that comes after the colon. For example, the code > 14:21 would yield eight whole numbers, beginning with 14 and ending with 21, as follows:

[1] 14 15 16 17 18 19 20 21

Furthermore, the code > 14.2:21 would yield seven values, beginning with 14.2 and ending with 20.2, as follows:

[1] 14.2 15.2 16.2 17.2 18.2 19.2 20.2

Number-colon-number notation is a useful way to enumerate a series of values without having to type each one individually. It can be used in any circumstance where a series of values is acceptable input into an R function.

Number-colon-number notation can also enumerate values from high to low. For instance, 21:14 would yield a list of values beginning with 21 and ending with 14.

Since we do not have exact dates or other identifying information for our 30 past battles, we simply enumerated the numbers 1 through 30 on the x-axis. This had the effect of assigning a generic identification number to each of our past battles, which in turn allowed us to plot the duration of each battle on the y axis.

Pop quiz

  1. Which of the following is the type argument capable of?
    1. Drawing a line to connect or replace the points on a scatterplot.
    2. Drawing vertical or step lines.
    3. Drawing no points or lines.
    4. All of the above.
  2. What would the following line of code yield in the R console?

    > 1:50

    1. A sequence of 50 whole numbers, in order from 1 to 50.
    2. A sequence of 50 whole numbers, in order from 50 to 1.
    3. A sequence of 50 random numbers, in order from 1 to 50.
    4. A sequence of 50 random numbers, in order from 50 to 1.

Time for action — creating a box plot

A useful way to convey a collection of summary statistics in a dataset is through the use of a box plot. This type of graph depicts a dataset’s minimum and maximum, as well as its lower, median, and upper quartiles in a single diagram. Let us look at how box plots are created in R:

  1. Use the boxplot(…) function to create a box plot.

    > #create a box plot that depicts the number of soldiers required to launch a fire attack
    > #get the data to be used in the plot
    > boxplotFireShuSoldiersData > #customize the plot
    > boxPlotFireShuSoldiersLabelMain > boxPlotFireShuSoldiersLabelX > boxPlotFireShuSoldiersLabelY > #use boxplot(...) to create and display the box plot
    > boxplot(x = boxplotFireShuSoldiersData,
    main = boxPlotFireShuSoldiersLabelMain,
    xlab = boxPlotFireShuSoldiersLabelX,
    ylab = boxPlotFireShuSoldiersLabelY)

  2. Your plot will be displayed in the graphic window, as shown in the following:

  3. Use the boxplot(…) function to create a box plot that compares multiple datasets.

    > #create a box plot that compares the number of soldiers required across the battle methods
    > #get the data formula to be used in the plot
    > boxplotAllMethodsShuSoldiersData > #customize the plot
    > boxPlotAllMethodsShuSoldiersLabelMain > boxPlotAllMethodsShuSoldiersLabelX > boxPlotAllMethodsShuSoldiersLabelY > #use boxplot(...) to create and display the box plot
    > boxplot(formula = boxplotAllMethodsShuSoldiersData,
    main = boxPlotAllMethodsShuSoldiersLabelMain,
    xlab = boxPlotAllMethodsShuSoldiersLabelX,
    ylab = boxPlotAllMethodsShuSoldiersLabelY)

  4. Your plot will be displayed in the graphic window, as shown in the following:

What just happened?

We just created two box plots using R’s boxplot(…) function, one with a single box and one with multiple boxes.

boxplot(…)

We started by generating a single box plot that was composed of a dataset, main title, and x and y labels. The basic format for a single box plot is as follows:

boxplot(x = dataset)

The x argument contains the data to be plotted. Technically, only x is required to create a box plot, although you will often include additional arguments. Our boxplot(…) function used the main, xlab, and ylab arguments to display text on the plot, as shown:

> boxplot(x = boxplotFireShuSoldiersData,
main = boxPlotFireShuSoldiersLabelMain,
xlab = boxPlotFireShuSoldiersLabelX,
ylab = boxPlotFireShuSoldiersLabelY)

Next, we created a multiple box plot that compared the number of Shu soldiers deployed by each battle method. The main, xlab, and ylab arguments remained from our single box plot, however our multiple box plot used the formula argument instead of x. Here, a formula allows us to break a dataset down into separate groups, thus yielding multiple boxes.

The basic format for a multiple box plot is as follows:

boxplot(formula = dataset ~ group)

In our case, we took our entire Shu soldier dataset (battleHistory$ShuSoldiers) and separated it by battle method (battleHistory$Method):

> boxplotAllMethodsShuSoldiersData

Once incorporated into the boxplot(…) function, this formula resulted in a plot that contained four distinct boxes—ambush, fire, head to head, and surround:

> boxplot(formula = boxplotAllMethodsShuSoldiersData,
main = boxPlotAllMethodsShuSoldiersLabelMain,
xlab = boxPlotAllMethodsShuSoldiersLabelX,
ylab = boxPlotAllMethodsShuSoldiersLabelY)

Pop quiz

  1. Which of the following best describes the result of the following code?

    > boxplot(x = a)

    1. A single box plot of the a dataset.
    2. A single box plot of the x dataset.
    3. A multiple box plot of the a dataset that is grouped by x.
    4. A multiple box plot of the x dataset that is grouped by a.
  2. Which of the following best describes the result of the following code?

    > boxplot(formula = a ~ b)

    1. A single box plot of the a dataset.
    2. A single box plot of the b dataset.
    3. A multiple box plot of the a dataset that is grouped by b.
    4. A multiple box plot of the b dataset that is grouped by a.

Subscribe to the weekly Packt Hub newsletter. We'll send you this year's Skill Up Developer Skills Report.

* indicates required

LEAVE A REPLY

Please enter your comment!
Please enter your name here