Gathering and analyzing stock market data with R Part 1 of 2

This two-part blogseries walks through a set of R scripts used to collect and analyze data from the New York Stock Exchange. Collecting data in realtime from the stock market can be valuable in at least two ways. First, historical intraday trading data is valuable. There are many companies you can find around the Web that sell historical intraday trading data. This data can be used to make quick investment decisions. Investment strategies like day trading and short selling rely on being able to ride waves in the stock market that might only last a few hours or minutes. So, if a person could collect daily trading data for a time long enough, this data would eventually become valuable and could be sold.

While almost any programming language can be used to collect data from the Internet, using R to collect stock market data is somewhat more convenient if R will be used to analyze and make predictions with the data. Additionally, I find R to be an intuitive scripting language that can be used for a wide range of solutions. I will first discuss how to create a script that can collect intraday trading data. I will then discuss using R to collect historical daily trading data. I will also discuss analyzing this data and making predictions from it. There is a lot of ground to cover, so this post is split into two parts. All of the code and accompanying files can be found in this repository. So, let’s get started.

If you don’t have the R binaries installed, go ahead and get them as they are going to be a must for following along. Additionally, I would highly recommend using RStudio in development projects centered around R. Although there are absolutely flaws with RStudio, in my opinion, it is the best choice.

library(httr)
library(jsonlite)
source('DataFetch.R')

The above three lines source the file containing the functions that actually collect the data and load the required packages to execute their commands. Libraries are a common feature in R. Before you try to do something too complex, make sure that you check whether there is an existing library that already performs the operation. The R community is extensive and thriving, which makes using R for development that much better.

Sys.sleep(55*60)

frame.list<-list()

ticker<-function(rest.time){
ptm<-proc.time()
df<-data.frame(get.data(),Date= date())
timer.time<-proc.time()-ptm
Sys.sleep(as.numeric(rest.time-timer.time[3]))
return(list(df))
}

The next lines of code stop the system until it is time for the stock market to open. I start this script before I go to work in the morning. So, 55*60 is about how many seconds pass between when I leave for work and the market opens. We then initialize an empty list using the next line of code. If you are new to R, you will notice the use of an arrow instead of an equals sign. Although the equals sign does work, many people, including me, use the arrow. This list is going to hold the dataframes containing the stock data that is created throughout the day. We then initialize the ticker function, which is used to repeatedly call the set of functions that retrieve the data and then return the data in the form of a dataframe.

for(i in1:80){
frame.list<-c(suppressWarnings(ticker(5*30)),frame.list)
}
save(frame.list,file="RealTimeData.rda")

The ticker function takes the number of seconds to wait between queries to the market as its only argument. This number is modified based on the length of time the query takes. This ensures that the timing of the data points is consistent. The ticker function is called eighty times in five minute intervals. The results are appended onto the list of dataframes. After the for-loop is completed. The data is saved in the R format.

Now let’s look into the functions that fetch the data located in DataFetch.R. R code can become pretty verbose, so it is good to get in the habit of segmenting your code into multiple files. The functions used to fetch data are displayed below. We will start by discussing the parse.data function because it is the work horse, and the get.data function is more of a controller.

parse.data<-function(symbols,range){
base.URL <-"http://finance.google.com/finance/info?client=ig&q="
start= min(range)
end= max(range)
symbol.string<-paste0("NYSE:",symbols[start],",")
for(i in(start+1):end){
temp<- paste0("NYSE:",symbols[i],",")
symbol.string<-paste(symbol.string,temp,sep="")
  }
  URL <-paste(base.URL,symbol.string,sep="")

data<- GET(URL)
now<- date()
bin<- content(data,"raw")
writeBin(bin,"data.txt")

conn<- file("data.txt",open="r")
linn<-readLines(conn)
jstring<-"["
for(i in3:length(linn)){
jstring<- paste0(jstring,linn[i])
  }
close(conn)
file.remove("data.txt")
obj<-fromJSON(jstring)

return(data.frame(Symbol=obj$t,Price=as.numeric(obj$l)))
}

The first function takes a list of stock symbols and the list indices of the symbols that are to be queried. The function then builds a string in the proper format to be used to query Google Finance for the latest price information on the chosen symbols. The query is performed using the ‘httr’ R package, a package used to perform HTTP tasks. The response from the web request is shuttled through a few formats in order to get the data into an easy-to-use format. The function then returns a dataframe containing the symbols and prices.

get.data<-function(){
syms<- read.csv("NYSE.txt",header=2,sep="t")
sb<-grep("[A-Z]{4}|[A-Z]{3}",syms$Symbol,perl= F, value = T)
result<- c()
in.list<-list()
list.seq<-seq(1,2901,100)
for(i in1:(length(list.seq)-1)){
range<-list.seq[i]:list.seq[i+1]
result<-rbind(result,parse.data(sb,range))
  }
return(droplevels.data.frame(na.omit(result)))
}

The get.data function above is called by the ticker function. It serves as a controller on the parse.data function by calling for the prices in chunks so that the queries are small enough. It also reads the symbol list in from the "NYSE.txt" file, which is a simple list of stocks in the New York Stock Exchange and their symbols. The symbols are then put through a RegEx routine that eliminates symbols that do not follow the right format for Google Finance.

Gathering intraday data from the stock market using R, or any language, is obviously somewhat of a pain; however, if properly executed, the results could be quite useful and valuable. I hope you read part two of this blog series where we use R to gather and analyze historical stock market data.

About the author

Erik Kappelman is a transportation modeler for the Montana Department of Transportation. He is also the CEO of Duplovici, a technology consulting and web design company.