9 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book by Dr. Param Jeet and Prashant Vats titled Learning Quantitative Finance with R. This book will help you learn about various algorithmic trading techniques and ways to optimize them using the tools available in R.[/box]

In this tutorial we will learn how logistic regression is used to forecast market direction.

Market direction is very important for investors or traders. Predicting market direction is quite a challenging task as market data involves lots of noise. The market moves either upward or downward and the nature of market movement is binary. A logistic regression model help us to fit a model using binary behavior and forecast market direction. Logistic regression is one of the probabilistic models which assigns probability to each event. We are going to use the quantmod package. The next three commands are used for loading the package into the workspace, importing data into R from the yahoo repository and extracting only the closing price from the data:

>library("quantmod")

>getSymbols("^DJI",src="yahoo")

>dji<- DJI[,"DJI.Close"]

The input data to the logistic regression is constructed using different indicators, such as moving average, standard deviation, RSI, MACD, Bollinger Bands, and so on, which has some predictive power in market direction, that is, Up or Down. These indicators can be constructed using the following commands:

>avg10<- rollapply(dji,10,mean)

>avg20<- rollapply(dji,20,mean)

>std10<- rollapply(dji,10,sd)

>std20<- rollapply(dji,20,sd)

>rsi5<- RSI(dji,5,"SMA")

>rsi14<- RSI(dji,14,"SMA")

>macd12269<- MACD(dji,12,26,9,"SMA")

>macd7205<- MACD(dji,7,20,5,"SMA")

>bbands<- BBands(dji,20,"SMA",2)

The following commands are to create variable direction with either Up direction (1) or Down direction (0). Up direction is created when the current price is greater than the 20 days previous price and Down direction is created when the current price is less than the 20 days previous price:

>direction<- NULL

>direction[dji> Lag(dji,20)] <- 1

>direction[dji< Lag(dji,20)] <- 0

Now we have to bind all columns consisting of price and indicators, which is shown in the following command:

>dji<- cbind(dji,avg10,avg20,std10,std20,rsi5,rsi14,macd12269,macd7205,bbands,dire ction)

The dimension of the dji object can be calculated using dim(). I used dim() over dji and saved the output in dm(). dm() has two values stored: the first value is the number of rows and the second value is the number of columns in dji. Column names can be extracted using colnames(). The third command is used to extract the name for the last column. Next I replaced the column name with a particular name, Direction:

>dm<- dim(dji)

>dm

[1] 2493   16

>colnames(dji)[dm[2]] [1] "..11"

>colnames(dji)[dm[2]] <- "Direction"

>colnames(dji)[dm[2]] [1] "Direction"

We have extracted the Dow Jones Index (DJI) data into the R workspace. Now, to implement logistic regression, we should divide the data into two parts. The first part is in- sample data and the second part is out-sample data.

In-sample data is used for the model building process and out-sample data is used for evaluation purposes. This process also helps to control the variance and bias in the model. The next four lines are for in-sample start, in-sample end, out-sample start, and out-sample end dates:

>issd<- "2010-01-01"

>ised<- "2014-12-31"

>ossd<- "2015-01-01"

>osed<- "2015-12-31"

The following two commands are to get the row number for the dates, that is, the variable isrow extracts row numbers for the in-sample date range and osrow extracts the row numbers for the out-sample date range:

>isrow<- which(index(dji) >= issd& index(dji) <= ised)

>osrow<- which(index(dji) >= ossd& index(dji) <= osed)

The variables isdji and osdji are the in-sample and out-sample datasets respectively:

>isdji<- dji[isrow,]

>osdji<- dji[osrow,]

If you look at the in-sample data, that is, isdji, you will realize that the scaling of each column is different: a few columns are in the scale of 100, a few others are in the scale of 10,000, and a few others are in the scale of 1. Difference in scaling can put your results in trouble as higher weights are being assigned to higher scaled variables. So before moving ahead, you should consider standardizing the dataset. I will use the following formula:

standardized data = Logistic Regression

The mean and standard deviation of each column using apply() can be seen here:

>isme<- apply(isdji,2,mean)

>isstd<- apply(isdji,2,sd)

An identity matrix of dimension equal to the in-sample data is generated using the following command, which is going to be used for normalization:

>isidn<- matrix(1,dim(isdji)[1],dim(isdji)[2])

Use formula 6.1 to standardize the data:

>norm_isdji<-  (isdji - t(isme*t(isidn))) / t(isstd*t(isidn))

The preceding line also standardizes the direction column, that is, the last column. We don’t want direction to be standardized so I replace the last column again with variable direction for the in-sample data range:

>dm<- dim(isdji)

>norm_isdji[,dm[2]] <- direction[isrow]

Now we have created all the data required for model building. You should build a logistic regression model and it will help you to predict market direction based on in-sample data. First, in this step, I created a formula which has direction as dependent and all other columns as independent variables. Then I used a generalized linear model, that is, glm(), to fit a model which has formula, family, and dataset:

>formula<- paste("Direction ~ .",sep="")

>model<- glm(formula,family="binomial",norm_isdji)

A summary of the model can be viewed using the following command:

>summary(model)

Next use predict() to fit values on the same dataset to estimate the best fitted value:

>pred<- predict(model,norm_isdji)

Once you have fitted the values, you should try to convert it to probability using the following command. This will convert the output into probabilistic form and the output will be in the range [0,1]:

>prob<- 1 / (1+exp(-(pred)))

The figure shown below is plotted using the following commands. The first line of the code shows that we divide the figure into two rows and one column, where the first figure is for prediction of the model and the second figure is for probability:

>par(mfrow=c(2,1))

>plot(pred,type="l")

>plot(prob,type="l")

head() can be used to look at the first few values of the variable:

>head(prob)

2010-01-042010-01-05 2010-01-06 2010-01-07

0.8019197  0.4610468  0.7397603  0.9821293

The following figure shows the above-defined variable pred, which is a real number, and its conversion between 0 and 1, which represents probability, that is, prob, using the preceding transformation:

Logistic Regression with R

Figure 6.1: Prediction and probability distribution of DJI

As probabilities are in the range of (0,1) so is our vector prob. Now, to classify them as one of the two classes, I considered Up direction (1) when prob is greater than 0.5 and Down direction (0) when prob is less than 0.5. This assignment can be done using the following commands. prob> 0.5 generate true for points where it is greater and pred_direction[prob> 0.5] assigns 1 to all such points. Similarly, the next statement shows assignment 0 when probability is less than or equal to 0.5:

>pred_direction<- NULL

>pred_direction[prob> 0.5] <- 1

>pred_direction[prob<= 0.5] <- 0

Once we have figured out the predicted direction, we should check model accuracy: how much our model has predicted Up direction as Up direction and Down as Down. There might be some scenarios where it predicted the opposite of what it is, such as predicting down when it is actually Up and vice versa. We can use the caret package to calculate confusionMatrix(), which gives a matrix as an output. All diagonal elements are correctly predicted and off-diagonal elements are errors or wrongly predicted. One should aim to reduce the off-diagonal elements in a confusion matrix:

>install.packages('caret')

>library(caret)

>matrix<- confusionMatrix(pred_direction,norm_isdji$Direction)

>matrix

Confusion Matrix and Statistics

Reference

Prediction               0                     1

0            362                    35

1             42                   819

Accuracy : 0.9388        95% CI : (0.9241, 0.9514)

No Information Rate : 0.6789   P-Value [Acc>NIR] : <2e-16 Kappa : 0.859                        Mcnemar's Test P-Value : 0.4941 Sensitivity : 0.8960                        Specificity : 0.9590

PosPredValue : 0.9118   NegPred Value : 0.9512 Prevalence : 0.3211                          Detection Rate : 0.2878

Detection Prevalence : 0.3156 Balanced Accuracy : 0.9275

The preceding table shows we have got 94% correct prediction, as 362+819 = 1181 are correct predictions out of 1258 (sum of all four values). Prediction above 80% over in-sample data is generally assumed good prediction; however, 80% is not fixed, one has to figure out this value based on the dataset and industry. Now you have implemented the logistic regression model, which has predicted 94% correctly, and need to test it for generalization power. One should test this model using out-sample data and test its accuracy. The first step is to standardize the out-sample data using formula (6.1). Here mean and standard deviations should be the same as those used for in-sample normalization:

>osidn<- matrix(1,dim(osdji)[1],dim(osdji)[2])

>norm_osdji<-  (osdji - t(isme*t(osidn))) / t(isstd*t(osidn))

>norm_osdji[,dm[2]] <- direction[osrow]

Next we use predict() on the out-sample data and use this value to calculate probability:

>ospred<- predict(model,norm_osdji)

>osprob<- 1 / (1+exp(-(ospred)))

Once probabilities are determined for the out-sample data, you should put it into either Up or Down classes using the following commands. ConfusionMatrix() here will generate a matrix for the out-sample data:

>ospred_direction<- NULL

>ospred_direction[osprob> 0.5] <- 1

>ospred_direction[osprob<= 0.5] <- 0

>osmatrix<- confusionMatrix(ospred_direction,norm_osdji$Direction)

>osmatrix

Confusion Matrix and Statistics

Reference

Prediction            0                       1

0          115                     26

1           12                     99

Accuracy : 0.8492       95% CI : (0.7989, 0.891)

This shows 85% accuracy on the out-sample data. A realistic trading model also accounts for trading cost and market slippage, which decrease the winning odds significantly.

We presented advanced techniques implemented in capital markets and also learned logistic regression model using binary behavior to forecast market direction.

If you enjoyed this excerpt, check out the book  Learning Quantitative Finance with R to deep dive into the vast world of algorithmic and machine-learning based trading.

Learning Quantitative Finance with R

 

 

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here