[box type=”note” align=”” class=”” width=””]This article is extracted from the book Machine Learning with R written by Brett Lantz. This book will methodically take you through stages to apply machine learning for data analysis using R.[/box]
In this article, we will explore the popular time series analysis method and its practical implementation using R.
When we think about time, we think about years, days, months, hours, minutes, and seconds. Think of any datasets and you will find some attributes which will be in the form of time, especially data related to stock, sales, purchase, profit, and loss. All these have time associated with them. For example, the price of stock in the stock exchange at different points on a given day or month or year. Think of any industry domain, and sales are an important factor; you can see time series in sales, discounts, customers, and so on. Other domains include but are not limited to statistics, economics and budgets, processes and quality control, finance, weather forecasting, or any kind of forecasting, transport, logistics, astronomy, patient study, census analysis, and the list goes on. In simple words, it contains data or observations in time order, spaced at equal intervals.
Time series analysis means finding the meaning in the time-related data to predict what will happen next or forecast trends on the basis of observed values. There are many methods to fit the time series, smooth the random variation, and get some insights from the dataset.
When you look at time series data you can see the following:
- Trend: Long term increase or decrease in the observations or data.
- Pattern: Sudden spike in sales due to christmas or some other festivals, drug consumption increases due to some condition; this type of data has a fixed time duration and can be predicted for future time also.
- Cycle: Can be thought of as a pattern that is not fixed; it rises and falls without any pattern. Such time series involve a great fluctuation in data.
How to do
There are many datasets available with R that are of the time series types. Using the command class, one can know if the dataset is time series or not. We will look into the AirPassengers dataset that shows monthly air passengers in thousands from 1949 to 1960. We will also create new time series to represent the data.
Perform the following commands in RStudio or R Console:
> class(AirPassengers) Output:  "ts" > start(AirPassengers) Output:  1949 1 > end(AirPassengers) Output:  1960 12 > summary(AirPassengers) Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Analyzing Time Series Data [ 89 ]
In the next recipe, we will create the time series and print it out.
Let’s think of the share price of some company in the range of 2,500 to 4,000 from 2011 to be recorded monthly. Perform the following coding in R:
> my_vector = sample(2500:4000, 72, replace=T) > my_series = ts(my_vector, start=c(2011,1), end=c(2016,12), frequency = 12) > my_series Output: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2011 2888 3894 3675 3113 3421 3870 2644 2677 3392 2847 2543 3147 2012 2973 3538 3632 2695 3475 3971 2695 2963 3217 2836 3525 2895 2013 3984 3811 2902 3602 3812 3631 2625 3887 3601 2581 3645 3324 2014 3830 2821 3794 3942 3504 3526 3932 3246 3787 2894 2800 2732 2015 3326 3659 2993 2765 3881 3983 3813 3172 2667 3517 3445 2805 2016 3668 3948 2779 2881 3285 2733 3203 3329 3854 3285 3800 2563
How it works
In the first recipe, we used the AirPassengers dataset, using the class function. We saw that it is ts (ts stands for time series). The start and end functions will give the starting year and ending year of the dataset with the values. The frequency function tells us the interval of observations; 1 means annually, 4 means quarterly, 12 means yearly, and so on.
In the next recipe, we want to generate samples between 2,500 to 40,000 to represent the price of a share. Using a sample function, we can create a sample; it takes the range as the first argument, and the number of samples required as the second argument. The last argument decides whether duplication is to be allowed in the sample or not. We stored the
sample in the my_vector. Now we create a time series using the ts function. The ts function takes the vector as an argument followed by the start and end to show the period for which the time series is being constructed. The frequency specifies the number of observations in the start and end to be recorded. 12.
To summarize we talked about how R can be utilized to perform time series analysis in different ways.
If you would like to learn more useful machine learning techniques in R, be sure to check out Machine Learning with R.