Categories: Tutorials

Gathering and analyzing stock market data with R, Part 2

7 min read

Welcome to the second installment of this series. The previous post covered collecting real-time stock market data using R. This second part looks at a few ways to analyze historical stock market data using R. If you are just interested in learning how to analyze historical data, the first blog isn’t necessary. The code accompanying these blogs is located here.

To begin, we must first get some data. The lines of code below load the ‘quantmod’ library, a very useful R library when it comes to financial analysis, and then use quantmod to gather data on the list of stock symbols:

library(quantmod)
syms<-read.table("NYSE.txt",header=2,sep="t")
smb<-grep("[A-Z]{4}",syms$Symbol,perl= F, value = T)
getSymbols(smb)

I find the getSymbols() function somewhat problematic for gathering data on multiple companies. This is because the function creates a separate dataframe for each company in the package’s ‘xts’ format. I think this would be more helpful if you were planning to use the quantmod tools for analysis. I enjoy using other types of tools, so the data needs to be changed somewhat, before I can analyze it:

mat<- c()
stocks<- c()
stockList<- list()
names<- c()
for(i in1:length(smb)){
temp<-get(smb[i])
names<- c(names,smb[i])
stockList[[i]]<-as.numeric(getPrice(temp))
len<- length(attributes(temp)$index)
if(len<ten01)next
stocks<- c(stocks,smb[i])
  temp2 <-temp[(len-ten00):len]
vex<-as.numeric(getPrice(temp2))
mat<-rbind(mat,vex)
}

The code above loops through the dataframes that were created by the getSymbols() function. Using the get() function from the ‘base’ package, each symbol string is used to grab the symbol’s dataframe. The loop then does one or two more things to each stock’s dataframe. For all of the stocks, it records the stock’s symbol in a vector and adds a vector of prices to the growing list of stock data. If the stock data goes back at least one thousand trading days, then the last one thousand days of trading are added to a matrix. The reason for this distinction is that we will be looking at two methods of analysis. One requires all of the series to be the same length, and the other is length-agnostic. Series that are too short will not be analyzed using the first method. Check out the following script:

names(stockList)<- names
stock.mat<-as.matrix(mat)
row.names(stock.mat)<- stocks
colnames(stock.mat)<-as.character(index(temp2))
save(stock.mat,stockList,file="StockData.rda")
rm(list =ls())

The above script names the data properly and saves the data to an R data file. The final line of code cleans the workspace because the getSymbols() function leaves quite a mess. The data is now in the correct format for us to begin our analysis.

It is worth pointing out that what I am about to show won’t get you an A in most statistics or economics classes. I say this because I am going to take a very practical approach with little regard to the proper assumptions. Although these assumptions are important, when the need is an accurate forecast, it is easier to get away with models that are not entirely theoretically sound. This is because we are not trying to make arguments of causality or association; we are trying to guess the direction of the market:

library(mclust)
library(vars)
load("StockData.rda")

In this first example of analysis, I put forth a clustering-based Vector Autoregression (VAR) method of my own design. In order to do this, we must load the correct packages and load the data we just created:

cl<-Mclust(stock.mat,G=1:9)
stock.mat<-cbind(stock.mat,cl$classification)

The first thing to do is identify the clusters that exist within the stock market data. In this case, we use a model-based clustering method. This method assumes that data is the result of picks from a set of random variable distributions. This allows the clusters to be based on the covariance of companies’ stock prices instead of just grouping together companies with similar nominal prices. The Mclust() function fits a model to the data that minimizes the Bayesian Information Criterion (BIC). You will likely have to restrict the number of clusters as one complaint about model-based clustering is a ‘more clusters are always better’ result. The data is separated into clusters to make using a VAR technique more computationally realistic. One of the nice things about VAR is how few assumptions must be met in order to include the time series in an analysis. Also, VAR regresses several time series against one another and themselves at the same time, which may capture more of the covariance needed to produce reliable forecasts. We are looking at over 1000 time series and this is too many to use VAR effectively, so the clustering is used to group the time series together to produce smaller VARs:

cluster<-stock.mat[stock.mat[,ten02]==6,1:ten01]
ts<-ts(t(cluster))
fit <-VAR(ts[1:(ten01-ten),],p =ten)
preds<- predict(fit,n.ahead=ten)
forecast<-preds$fcst$TEVA
plot.ts(ts[950:ten01,8],ylim=c(36,54))
lines(y = forecast[,1], x =(50-9):50,col ="blue")
lines(y = forecast[,2], x =(50-9):50,col ="red",lty=2)
lines(y = forecast[,3], x =(50-9):50,col ="red",lty=2)

The code above takes the time series that belong to the ‘6’ cluster and runs a VAR that looks back ten steps. We cut off the last ten days of data and use the VAR to predict these last ten days. The script then plots the predicted ten days against the actual ten days. This allows us to see if the predictions are functioning properly. The resulting plot shows that the predictions are not perfect but will probably work well enough:

for(i in1:8){
  assign(paste0("cluster.",i),stock.mat[stock.mat[,ten02]== i,1:ten01])
  assign(paste0("ts.",i),ts(t(get(paste0("cluster.",i)))))
temp<-get(paste0("ts.",i))
assign(paste0("fit.",i),VAR(temp, p =ten))
  assign(paste0("preds.",i),predict(get(paste0("fit.",i)),n.ahead=ten))
}

stock.mat<-cbind(stock.mat,0)
for(j in1:8){
pred.vec<-c()
temp<-get(paste0("preds.",j))
for(i intemp$fcst){
cast<-temp$fcst[1]
cast<- cast[[1]]
cast<- cast[ten,]
pred.vec<-c(pred.vec,cast[1])
  }
stock.mat[stock.mat[,ten02]== j,ten03]<-pred.vec
}

The loops above perform a VAR on each of the 8 clusters with more than one member. After these VARs are performed, a ten-day forecast is carried out. The value of each stock at the end of the ten-day forecast is then appended onto the end of the stock data matrix:

stock.mat<-stock.mat[stock.mat[,ten02]!=9,]
stock.mat[,ten04]<-(stock.mat[,ten03]-stock.mat[,ten01])/stock.mat[,ten01]*ten0
stock.mat<-stock.mat[order(-stock.mat[,ten04]),]
stock.mat[1:ten,ten04]
rm(list =ls())

The final lines of code calculate the percentage change in each stock forecasted after 10days and then display the top 10stocks in terms of forecasted percentage change. The workspace is then cleared:

load("StockData.rda")
library(forecast)
forecasts<- c()
names<- c()
for(i in1:length(stockList)){
mod<-auto.arima(stockList[[i]])
cast<- forecast(mod)
cast<-cast$mean[ten]
temp<- c(as.numeric(stockList[[i]][length(stockList[[i]])]),as.numeric(cast))
forecasts<-rbind(forecasts,temp)
names<- c(names,names(stockList[i]))
}
forecasts<- matrix(ncol=2,forecasts)
forecasts<-cbind(forecasts,(forecasts[,2]- forecasts[,1])/forecasts[,1]*ten0)
colnames(forecasts)<- c("Price","Forecast","% Change")
row.names(forecasts)<- names
forecasts<- forecasts[order(-forecasts[,3]),]
rm(list =ls())

The final bit of code is simpler. Using the ‘forecast’ package’s auto.arima() function,we fit an ARIMA model to each stock in our stockList. The auto.arima() function is a must-have for forecasters using R. This function fits an ARIMA model to your data with the best value of some measure of statistical accuracy. The default is the corrected Akaike Information Criterion (ACCc), which will work fine for our purposes. Once the forecasts are complete, this script also prints the top 10stocks in terms of percentage change over a ten-day forecast.

These blogs have discussed how to gather and analyze stock market data using R. I hope they have been informative and will help you with data analysis in the future.

About the author

Erik Kappelman is a transportation modeler for the Montana Department of Transportation. He is also the CEO of Duplovici, a technology consulting and web design company.

Erik Kappelman

Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for the Department of Transportation as a transportation demand modeler.