19 min read

Today we will learn to develop an options trading web app using Q-learning algorithm and will also evaluate the model.

Developing an options trading web app using Q-learning

The trading algorithm is the process of using computers programmed to follow a defined set of instructions for placing a trade in order to generate profits at a speed and frequency that is impossible for a human trader. The defined sets of rules are based on timing, price, quantity, or any mathematical model.

Problem description

Through this project, we will predict the price of an option on a security for N days in the
future according to the current set of observed features derived from the time of expiration, the price of the security, and volatility. The question would be: what model should we use for such an option pricing model? The answer is that there are actually many; Black-Scholes stochastic partial differential equations (PDE) is one of the most recognized.

In mathematical finance, the Black-Scholes equation is necessarily a PDE overriding the price evolution of a European call or a European put under the Black-Scholes model. For a European call or put on an underlying stock paying no dividends, the equation is:

partial differential equations

Where V is the price of the option as a function of stock price S and time t, r is the risk-free interest rate, and σ σ (displaystyle sigma) is the volatility of the stock. One of the key financial insights behind the equation is that anyone can perfectly hedge the option by buying and selling the underlying asset in just the right way without any risk. This hedge implies that there is only one right price for the option, as returned by the Black-Scholes formula.

Consider a January maturity call option on an IBM with an exercise price of $95. You write a January IBM put option with an exercise price of $85. Let us consider and focus on the call options of a given security, IBM. The following chart plots the daily price of the IBM stock and its derivative call option for May 2014, with a strike price of $190:

IBM stock and call $190 May 2014 pricing in May-Oct 2013

Figure 1: IBM stock and call $190 May 2014 pricing in May-Oct 2013

Now, what will be the profit and loss be for this position if IBM is selling at $87 on the option maturity date? Alternatively, what if IBM is selling at $100? Well, it is not easy to compute or predict the answer. However, in options trading, the price of an option depends on a few parameters, such as time decay, price, and volatility:

  • Time to expiration of the option (time decay)
  • The price of the underlying security
  • The volatility of returns of the underlying asset

A pricing model usually does not consider the variation in trading volume in terms of the underlying security. Therefore, some researchers have included it in the option trading model. As we have described, any RL-based algorithm should have an explicit state (or states), so let us define the state of an option using the following four normalized features:

  • Time decay (timeToExp): This is the time to expiration once normalized in the range of (0, 1).
  • Relative volatility (volatility): within a trading session, this is the relative variation of the price of the underlying security. It is different than the more complex volatility of returns defined in the Black-Scholes model, for example.
  • Volatility relative to volume (vltyByVol): This is the relative volatility of the price of the security adjusted for its trading volume.
  • Relative difference between the current price and the strike price (priceToStrike): This measures the ratio of the difference between the price and the strike price to the strike price.

The following graph shows the four normalized features that can be used for the IBM option strategy:

Features of IBM option strategy

Figure 2: Normalized relative stock price volatility, volatility relative to trading volume, and price relative to strike price for the IBM stock

Now let us look at the stock and the option price dataset. There are two files IBM.csv and IBM_O.csv contain the IBM stock prices and option prices, respectively. The stock price dataset has the date, the opening price, the high and low price, the closing price, the trade volume, and the adjusted closing price. A shot of the dataset is given in the following diagram:

IBM stock data

Figure 3: IBM stock data

On the other hand, IBM_O.csv has 127 option prices for IBM Call 190 Oct 18, 2014. A few values are 1.41, 2.24, 2.42, 2.78, 3.46, 4.11, 4.51, 4.92, 5.41, 6.01, and so on. Up to this point, can we develop a predictive model using a Q-Learning, algorithm that can help us answer the previously mentioned question: Can it tell us the how IBM can make maximum profit by utilizing all the available features?

Well, we know how to implement the Q-Learning, and we know what option trading is.

Implementing an options trading web application

The goal of this project is to create an options trading web application that creates a Q-Learning model from the IBM stock data. Then the app will extract the output from the model as a JSON object and show the result to the user. Figure 4, shows the overall workflow:

Workflow of the options trading Scala web

Figure 4: Workflow of the options trading Scala web

The compute API prepares the input for the Q-learning algorithm, and the algorithm starts by extracting the data from the files to build the option model. Then it performs operations on the data such as normalization and discretization. It passes all of this to the Q-learning algorithm to train the model. After that, the compute API gets the model from the algorithm, extracts the best policy data, and puts it onto JSON to be returned to the web browser. Well, the implementation of the options trading strategy using Q-learning consists of the following steps:

  • Describing the property of an option
  • Defining the function approximation
  • Specifying the constraints on the state transition

Creating an option property

Considering the market volatility, we need to be a bit more realistic, because any longer- term prediction is quite unreliable. The reason is that it would fall outside the constraint of the discrete Markov model. So, suppose we want to predict the price for next two days—that is, N= 2. That means the price of the option two days in the future is the value of the reward profit or loss. So, let us encapsulate the following four parameters:

  • timeToExp: Time left until expiration as a percentage of the overall duration of the option
  • Volatility normalized Relative volatility of the underlying security for a given trading session
  • vltyByVol: Volatility of the underlying security for a given trading session relative to a trading volume for the session
  • priceToStrike: Price of the underlying security relative to the Strike price for a given trading session

The OptionProperty class defines the property of a traded option on a security. The constructor creates the property for an option:

class OptionProperty(timeToExp:  Double,volatility: Double,vltyByVol: Double,priceToStrike:  Double) {
 nval toArray  = Array[Double](timeToExp,  volatility, vltyByVol,

 require(timeToExp   > 0.01, s"OptionProperty  time to expiration

 $timeToExp  required 0.01")

Creating an option model

Now we need to create an OptionModel to act as the container and the factory for the properties of the option. It takes the following parameters and creates a list of option properties, propsList, by accessing the data source of the four features described earlier:

  • The symbol of the security.
  • The strike price for option, strikePrice. The source of the data, src.
  • The minimum time decay or time to expiration, minTDecay. Out-of-the-money options expire worthlessly, and in-the-money options have a very different price behavior as they get closer to the expiration. Therefore, the last minTDecay trading sessions prior to the expiration date are not used in the training process.
  • The number of steps (or buckets), nSteps, is used in approximating the values of each feature. For instance, an approximation of four steps creates four buckets: (0,
    25), (25, 50), (50, 75), and (75, 100).

Then it assembles OptionProperties and computes the normalized minimum time to the expiration of the option. Then it computes an approximation of the value of options by discretization of the actual value in multiple levels from an array of options prices; finally it returns a map of an array of levels for the option price and accuracy. Here is the constructor of the class:

class OptionModel( symbol:  String, strikePrice: Double, src:  DataSource, minExpT: Int,
nSteps:  Int

Inside this class implementation, at first, a validation is done using the check() method, by checking the following:

  • strikePrice: A positive price is required
  • minExpT: This has to be between 2 and 16
  • nSteps: Requires a minimum of two steps

Here’s the invocation of this method:

check(strikePrice,  minExpT, nSteps)

The signature of the preceding method is shown in the following code:

def check(strikePrice:  Double, minExpT: Int, nSteps:  Int): Unit = {
require(strikePrice  > 0.0, s"OptionModel.check  price found $strikePrice required  > 0")
require(minExpT  > 2 && minExpT  < 16,s"OptionModel.check  Minimum expiration time found  $minExpT required  ]2, 16[")
require(nSteps   > 1,s"OptionModel.check,  number of steps found $nSteps required  > 1")

Once the preceding constraint is satisfied, the list of option properties, named propsList, is created as follows:

val propsList  = (for {
price  <- src.get(adjClose)
volatility  <- src.get(volatility)
nVolatility  <- normalize[Double](volatility) vltyByVol  <- src.get(volatilityByVol) nVltyByVol <- normalize[Double](vltyByVol)
priceToStrike  <- normalize[Double](price.map(p  => 1.0 - strikePrice /


yield {
nVolatility.zipWithIndex./:(List[OptionProperty]())  {
case (xs,  (v, n)) =>
val normDecay  = (n + minExpT).toDouble  / (price.size + minExpT)
new OptionProperty(normDecay,  v, nVltyByVol(n),

priceToStrike(n))  :: xs

In the preceding code block, the factory uses the zipWithIndex Scala method to represent the index of the trading sessions. All feature values are normalized over the interval (0, 1), including the time decay (or time to expiration) of the normDecay option.

The quantize() method of the OptionModel class converts the normalized value of each option property of features into an array of bucket indices. It returns a map of profit and loss for each bucket keyed on the array of bucket indices:

def quantize(o:  Array[Double]): Map[Array[Int],  Double] = {
val mapper  = new mutable.HashMap[Int,  Array[Int]]
val acc:  NumericAccumulator[Int]  = propsList.view.map(_.toArray)
map(toArrayInt(_)).map(ar  => { val enc = encode(ar) mapper.put(enc,  ar)

new NumericAccumulator[Int])  {
case (_acc,  (t, y)) => _acc  += (t, y); _acc
acc.map  {
case (k,  (v, w)) =>  (k, v / w) }
.map  {
case (k,  v) => (mapper(k),  v) }.toMap

The method also creates a mapper instance to index the array of buckets. An accumulator, acc, of type NumericAccumulator extends the Map[Int,  (Int, Double)] and computes this tuple (number of occurrences of features on each bucket, sum of the increase or decrease of the option price).

The toArrayInt method converts the value of each option property (timeToExp, volatility, and so on) into the index of the appropriate bucket. The array of indices is then encoded to generate the id or index of a state. The method updates the accumulator with the number of occurrences and the total profit and loss for a trading session for the option. It finally computes the reward on each action by averaging the profit and loss on each bucket. The signature of the encode(), toArrayInt() is given in the following code:

private def encode(arr:  Array[Int]): Int = arr./:((1,  0)) {
case ((s,  t), n) =>  (s * nSteps,  t + s * n) }._2
private def toArrayInt(feature:  Array[Double]): Array[Int] =
feature.map(x  => (nSteps *

final class NumericAccumulator[T]
extends mutable.HashMap[T,  (Int, Double)] {
def +=(key:  T, x: Double):  Option[(Int, Double)]  = {
val newValue  =
if (contains(key))  (get(key).get._1 + 1,  get(key).get._2 + x)
else (1,  x)
super.put(key,  newValue)

Finally, and most importantly, if the preceding constraints are satisfied (you can modify these constraints though) and once the instantiation of the OptionModel class generates a list of OptionProperty elements if the constructor succeeds; otherwise, it generates an empty list.

Putting it altogether

Because we have implemented the Q-learning algorithm, we can now develop the options trading application using Q-learning. However, at first, we need to load the data using
the DataSource class (we will see its implementation later on). Then we can create an
option model from the data for a given stock with default strike and minimum expiration time parameters, using OptionModel, which defines the model for a traded option, on a security. Then we have to create the model for the profit and loss on an option given the underlying security.

The profit and loss are adjusted to produce positive values. It instantiates an instance of the Q-learning class, that is, a generic parameterized class that implements the Q-learning algorithm. The Q-learning model is initialized and trained during the instantiation of the class, so it can be in the correct state for the runtime prediction.

Therefore, the class instances have only two states: successfully trained and failed training
Q-learning value action. Then the model is returned to get processed and visualized.

So, let us create a Scala object and name it QLearningMain. Then, inside the QLearningMain object, define and initialize the following parameters:

  • Name: Used to indicate the reinforcement algorithm’s name (for our case, it’s Q- learning)
  • STOCK_PRICES: File that contains the stock data OPTION_PRICES: File that contains the available option data STRIKE_PRICE: Option strike price
  • MIN_TIME_EXPIRATION: Minimum expiration time for the option recorded
  • QUANTIZATION_STEP: Steps used in discretization or approximation of the value of the security
  • ALPHA: Learning rate for the Q-learning algorithm
  • DISCOUNT (gamma): Discount rate for the Q-learning algorithm
  • MAX_EPISODE_LEN:Maximum number of states visited per episode
  • NUM_EPISODES: Number of episodes used during training
  • MIN_COVERAGE: Minimum coverage allowed during the training of the Q- learning model
  • NUM_NEIGHBOR_STATES: Number of states accessible from any other state
    REWARD_TYPE: Maximum reward or Random

Tentative initializations for each parameter are given in the following code:

val name:  String  =  "Q-learning"//  Files  containing  the  historical  prices for  the  stock  and  option
val STOCK_PRICES  =  "/static/IBM.csv"
val OPTION_PRICES  =  "/static/IBM_O.csv"//  Run  configuration  parameters
val STRIKE_PRICE  =  190.0  //  Option  strike  price
val MIN_TIME_EXPIRATION  =  6  //  Min  expiration  time  for  option  recorded
val QUANTIZATION_STEP  =  32  //  Quantization  step  (Double  =>  Int)
val ALPHA  =  0.2  //  Learning  rate
val DISCOUNT  =  0.6  //  Discount  rate  used  in  Q-Value  update  equation val MAX_EPISODE_LEN  =  128  //  Max  number  of  iteration  for  an  episode val NUM_EPISODES  =  20  //  Number  of  episodes  used  for  training.
val NUM_NEIGHBHBOR_STATES  =  3  //  No.  of  states  from  any  other  state

Now the run() method accepts as input the reward type (Maximum  reward in our case), quantized step (in our case, QUANTIZATION_STEP), alpha (the learning rate, ALPHA in our case) and gamma (in our case, it’s DISCOUNT, the discount rate for the Q-learning algorithm). It displays the distribution of values in the model. Additionally, it displays the estimated Q-value for the best policy on a Scatter plot (we will see this later). Here is the workflow of the preceding method:

  • First, it extracts the stock price from the IBM.csv file
  • Then it creates an option model createOptionModel using the stock prices and quantization, quantizeR (see the quantize method for more and the main method invocation later)
  • The option prices are extracted from the IBM_o.csv file
    After that, another model, model, is created using the option model to evaluate it on the option prices, oPrices
  • Finally, the estimated Q-Value (that is, Q-value = value * probability) is displayed
    0n a Scatter plot using the display method

By amalgamating the preceding steps, here’s the signature of the run() method:

private def run(rewardType:  String,quantizeR: Int,alpha:  Double,gamma: Double): Int = {
val sPath  = getClass.getResource(STOCK_PRICES).getPath
val src  = DataSource(sPath,  false, false, 1).get
val option  = createOptionModel(src,  quantizeR)

val oPricesSrc  = DataSource(OPTION_PRICES,  false, false, 1).get
val oPrices  = oPricesSrc.extract.get

val model  = createModel(option,  oPrices, alpha, gamma)model.map(m  =>
{if (rewardType  != "Random")
display(m.bestPolicy.EQ,m.toString,s"$rewardType  with quantization order

Now here is the signature of the createOptionModel() method that creates an option model using (see the OptionModel class):

private def createOptionModel(src:  DataSource, quantizeR: Int): OptionModel
new OptionModel("IBM",  STRIKE_PRICE, src, MIN_TIME_EXPIRATION, quantizeR)

Then the createModel() method creates a model for the profit and loss on an option given the underlying security. Note that the option prices are quantized using the quantize() method defined earlier. Then the constraining method is used to limit the number of actions available to any given state. This simple implementation computes the list of all the states within a radius of this state. Then it identifies the neighboring states within a predefined radius.

Finally, it uses the input data to train the Q-learning model to compute the minimum value for the profit, a loss so the maximum loss is converted to a null profit. Note that the profit and loss are adjusted to produce positive values. Now let us see the signature of this method:

def createModel(ibmOption:  OptionModel,oPrice:  Seq[Double],alpha: Double,gamma:  Double):  Try[QLModel]  =  {
val qPriceMap  =  ibmOption.quantize(oPrice.toArray)
val numStates  =  qPriceMap.size
val neighbors  =  (n:  Int)  =>  {
def getProximity(idx:  Int,  radius:  Int):  List[Int]  =  {
val idx_max  =
if (idx  +  radius  >=  numStates)  numStates  -  1
else idx  +  radius
val idx_min  =
if (idx  <  radius)  0
else idx  -  radiusRange(idx_min,  idx_max  +  1).filter(_  != idx)./:(List[Int]())((xs,  n)  =>  n  ::  xs)}getProximity(n, NUM_NEIGHBHBOR_STATES)
val qPrice:  DblVec  =  qPriceMap.values.toVector
val profit:  DblVec  =  normalize(zipWithShift(qPrice,  1).map  {
case (x,  y)  =>  y  -  x}).get
val maxProfitIndex  =  profit.zipWithIndex.maxBy(_._1)._2
val reward  =  (x:  Double,  y:  Double)  =>  Math.exp(30.0  *  (y  -  x))
val probabilities  =  (x:  Double,  y:  Double)  =>
if (y  <  0.3  *  x)  0.0
else 1.0println(s"$name  Goal  state  index:  $maxProfitIndex")
if (!QLearning.validateConstraints(profit.size,  neighbors))
thrownew IllegalStateException("QLearningEval  Incorrect  states transition  constraint")
val instances  =  qPriceMap.keySet.toSeq.drop(1)
val config  =  QLConfig(alpha,  gamma,  MAX_EPISODE_LEN,  NUM_EPISODES,  0.1)
val qLearning  = QLearning[Array[Int]](config,Array[Int](maxProfitIndex),profit,reward,proba bilities,instances,Some(neighbors))	val modelO  =  qLearning.getModel
if (modelO.isDefined)  {
val numTransitions  =  numStates  *  (numStates  -  1)println(s"$name Coverage  ${modelO.get.coverage}  for  $numStates  states  and  $numTransitions transitions")
val profile  =  qLearning.dumpprintln(s"$name  Execution profilen$profile")display(qLearning)Success(modelO.get)}
else Failure(new IllegalStateException(s"$name  model  undefined"))

Note that if the preceding invocation cannot create an option model, the code fails to show a message that the model creation failed. Nonetheless, remember that the minCoverage used in the following line is important, considering the small dataset we used (because the algorithm will converge very quickly):

val config  = QLConfig(alpha,  gamma, MAX_EPISODE_LEN,  NUM_EPISODES, 0.0)

Although we’ve already stated that it is not assured that the model creation and training will be successful, a Naïve clue would be using a very small minCoverage value between
0.0 and 0.22. Now, if the preceding invocation is successful, then the model is trained and ready for making prediction. If so, then the display method is used to display the estimated Q-value = value * probability in a Scatter plot. Here is the signature of the method:

private def display(eq:  Vector[DblPair],results: String,params:  String): Unit = {
import org.scalaml.plots.{ScatterPlot,  BlackPlotTheme, Legend}
val labels  = Legend(name,  s"Q-learning config:  $params", "States", "States")ScatterPlot.display(eq,
labels,  new BlackPlotTheme)

Hang on and do not lose patience! We are finally ready to see a simple rn and inspect the result. So let us do it:

def main(args:  Array[String]):  Unit  =  {run("Maximum  reward", QUANTIZATION_STEP,  ALPHA,  DISCOUNT)
Action:  state  71  =>  state  74
Action:  state  71  =>  state  73
Action:  state  71  =>  state  72
Action:  state  71  =>  state  70
Action:  state  71  =>  state  69
Action:  state  71  =>  state  68...Instance:  [[email protected]  -  state:  124
Action:  state  124  =>  state  125
Action:  state  124  =>  state  123
Action:  state  124  =>  state  122
Action:  state  124  =>  state  121Q-learning  Coverage  0.1  for  126  states  and
15750  transitions
Q-learning  Execution  profile
Q-Value  ->  5.572310105096295,  0.013869013819834967,  4.5746487300071825,
0.4037703812585325,  0.17606260549479869,  0.09205272504875522,
0.023205692430068765,  0.06363082458984902,  50.405283888218435...
Model:  Success(Optimal  policy:  Reward  -
1.00,204.28,115.57,6.05,637.58,71.99,12.34,0.10,4939.71,521.30,402.73,  with coverage:  0.1)

Evaluating the model

The preceding output shows the transition from one state to another, and for the 0.1 coverage, the Q-Learning model had 15,750 transitions for 126 states to reach goal state 37 with optimal rewards. Therefore, the training set is quite small and only a few buckets have actual values. So we can understand that the size of the training set has an impact on the number of states. Q-Learning will converge too fast for a small training set (like what we have for this example).

However, for a larger training set, Q-Learning will take time to converge; it will provide at least one value for each bucket created by the approximation. Also, by seeing those values, it is difficult to understand the relation between Q-values and states.

So what if we can see the Q-values per state? Why not! We can see them on a scatter plot:

Q-value per state

Figure  5: Q-value  per state

Now let us display the profile of the log of the Q-value (QLData.value) as the recursive search (or training) progress for different episodes or epochs. The test uses a learning rate α = 0.1 and a discount rate γ = 0.9 (see more in the deployment section):

logarithmic Q-Value

Figure 6: Profile of the logarithmic Q-Value for different  epochs during Q-learning training

The preceding chart illustrates the fact that the Q-value for each profile is independent of
the order of the epochs during training. However, the number of iterations to reach the goal state depends on the initial state selected randomly in this example. To get more insights, inspect the output on your editor or access the API endpoint at http://localhost:9000/api/compute (see following). Now, what if we display the
distribution of values in the model and display the estimated Q-value for the best policy on a Scatter plot for the given configuration parameters?

Q learning Config

Figure 7: Maximum reward with quantization 32 with the QLearning

The final evaluation consists of evaluating the impact of the learning rate and discount rate on the coverage of the training:

learning rate and discount rate

Figure 7: Impact of the learning rate and discount rate on the coverage of the training

The coverage decreases as the learning rate increases. This result confirms the general rule of using learning rate < 0.2. A similar test to evaluate the impact of the discount rate on the coverage is inconclusive.

We learned to develop a real-life application for options trading using a reinforcement learning algorithm called Q-learning.

You read an excerpt from a book written by Md. Rezaul Karim, titled Scala Machine Learning Projects. In this book, you will learn to develop, build, and deploy research or commercial machine learning projects in a production-ready environment.

Check out other related posts:

Getting started with Q-learning using TensorFlow

How to implement Reinforcement Learning with TensorFlow

How Reinforcement Learning works

Scala Machine learning projects


Data Science fanatic. Cricket fan. Series Binge watcher. You can find me hooked to my PC updating myself constantly if I am not cracking lame jokes with my team.


Please enter your comment!
Please enter your name here