(For more resources related to this topic, see here.)
Recursive partitions
The name of the library package rpart, shipped along with R, stands for Recursive Partitioning . The package was first created by Terry M Therneau and Beth Atkinson , and is currently maintained by Brian Ripley . We will first have a peek at means recursive partitions are.
A complex and contrived relationship is generally not identifiable by linear models. In the previous chapter, we saw the extensions of the linear models in piecewise, polynomial, and spline regression models. It is also well known that if the order of a model is larger than 4, then interpretation and usability of the model becomes more difficult. We consider a hypothetical dataset, where we have two classes for the output Y and two explanatory variables in X1 and X2. The two classes are indicated by filled-in green circles and red squares. First, we will focus only on the left display of Figure 1: A complex classification dataset with partitions , as it is the actual depiction of the data. At the outset, it is clear that a linear model is not appropriate, as there is quite an overlap of the green and red indicators. Now, there is a clear demarcation of the classification problem accordingly, as X1 is greater than 6 or not. In the area on the left side of X1=6, the mid-third region contains a majority of green circles and the rest are red squares. The red squares are predominantly identifiable accordingly, as the X2 values are either lesser than or equal to 3 or greater than 6. The green circles are the majority values in the region of X2 being greater than 3 and lesser than 6. A similar story can be built for the points on the right side of X1 greater than 6. Here, we first partitioned the data according to X1 values first, and then in each of the partitioned region, we obtained partitions according to X2 values. This is the act of recursive partitioning.
Figure 1: A complex classification dataset with partitions
Let us obtain the preceding plot in R.
Time for action – partitioning the display plot
We first visualize the CART_Dummy dataset and then look in the next subsection at how CART gets the patterns, which are believed to exist in the data.
- Obtain the dataset CART_Dummy from the RSADBE package by using data( CART_Dummy).
- Convert the binary output Y as a factor variable, and attach the data frame with CART_Dummy$Y .
attach(CART_Dummy)
In Figure 1: A complex classification dataset with partitions , the red squares refer to 0 and the green circles to 1.
- Initialize the graphics windows for the three samples by using par(mfrow= c(1,2)).
- Create a blank scatter plot:
plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2").
- Plot the green circles and red squares:
points(X1[Y==0],X2[Y==0],pch=15,col="red") points(X1[Y==1],X2[Y==1],pch=19,col="green") title(main="A Difficult Classification Problem")
- Repeat the previous two steps to obtain the identical plot on the right side of the graphics window.
- First, partition according to X1 values by using abline(v=6,lwd=2).
- Add segments on the graph with the segment function:
segments(x0=c(0,0,6,6),y0=c(3.75,6.25,2.25,5),x1=c(6,6,12,12),
y1=c(3.75,6.25,2.25,5),lwd=2) title(main="Looks a Solvable Problem Under Partitions")
What just happened?
A complex problem is simplified through partitioning! A more generic function, segments, has nicely slipped in our program, which you may use for many other scenarios.
Now, this approach of recursive partitioning is not feasible all the time! Why? We seldom deal with two or three explanatory variables and data points as low as in the preceding hypothetical example. The question is how one creates recursive partitioning of the dataset. Breiman, et. al. (1984) and Quinlan (1988) have invented tree building algorithms, and we will follow the Breiman, et. al. approach in the rest of book. The CART discussion in this book is heavily influenced by Berk (2008).
Splitting the data
In the earlier discussion, we saw that partitioning the dataset can benefit a lot in reducing the noise in the data. The question is how does one begin with it? The explanatory variables can be discrete or continuous. We will begin with the continuous (numeric objects in R) variables.
For a continuous variable, the task is a bit simpler. First, identify the unique distinct values of the numeric object. Let us say, for example, that the distinct values of a numeric object, say height in cms, are 160, 165, 170, 175, and 180. The data partitions are then obtained as follows:
- data[Height data[Height>160,]
- data[Height data[Height>165,]
- data[Height data[Height>170,]
- data[Height data[Height>175,]
The reader should try to understand the rationale behind the code, and certainly this is just an indicative one.
Now, we consider the discrete variables. Here, we have two types of variables, namely categorical and ordinal . In the case of ordinal variables, we have an order among the distinct values. For example, in the case of the economic status variable, the order may be among the classes Very Poor, Poor, Average, Rich, and Very Rich. Here, the splits are similar to the case of continuous variable, and if there are m distinct orders, we consider m -1 distinct splits of the overall data. In the case of a categorical variable with m categories, for example the departments A to F of the UCBAdmissions dataset, the number of possible splits becomes 2^{m-1}-1. However, the benefit of using software like R is that we do not have to worry about these issues.
The first tree
In the CART_Dummy dataset, we can easily visualize the partitions for Y as a function of the inputs X1 and X2. Obviously, we have a classification problem, and hence we will build the classification tree.
Time for action – building our first tree
The rpart function from the library rpart will be used to obtain the first classification tree. The tree will be visualized by using the plot options of rpart, and we will follow this up with extracting the rules of a tree by using the asRules function from the rattle package.
- Load the rpart package by using library(rpart).
- Create the classification tree with CART_Dummy_rpart .
- Visualize the tree with appropriate text labels by using plot(CART_Dummy_rpart); text(CART_Dummy_rpart).
Figure 2: A classification tree for the dummy dataset
Now, the classification tree flows as follows. Obviously, the tree using the rpart function does not partition as simply as we did in Figure 1: A complex classification dataset with partitions , the working of which will be dealt within the third section of this chapter. First, we check if the value of the second variable X2 is less than 4.875. If the answer is an affirmation, we move to the left side of the tree; the right side in the other case. Let us move to the right side. A second question asked is whether X1 is lesser than 4.5 or not, and then if the answer is yes it is identified as a red square, and otherwise a green circle. You are now asked to interpret the left side of the first node. Let us look at the summary of CART_Dummy_rpart.
- Apply the summary, an S3 method, for the classification tree with summary( CART_Dummy_rpart).
That one is a lot of output!
Figure 3: Summary of a classification tree
Our interests are in the nodes numbered 5 to 9! Why? The terminal nodes, of course! A terminal node is one in which we can’t split the data any further, and for the classification problem, we arrive at a class assignment as the class that has a majority count at the node. The summary shows that there are indeed some misclassifications too. Now, wouldn’t it be great if R gave the terminal nodes asRules. The function asRules from the rattle package extracts the rules from an rpart object. Let’s do it!
- Invoke the rattle package library(rattle) and using the asRules function, extract the rules from the terminal nodes with asRules(CART_Dummy_rpart).
The result is the following set of rules:
Figure 4: Extracting “rules” from a tree!
We can see that the classification tree is not according to our “eye-bird” partitioning. However, as a final aspect of our initial understanding, let us plot the segments using the naïve way. That is, we will partition the data display according to the terminal nodes of the CART_Dummy_rpart tree.
- The R code is given right away, though you should make an effort to find the logic behind it. Of course, it is very likely that by now you need to run some of the earlier code that was given previously.
abline(h=4.875,lwd=2) segments(x0=4.5,y0=4.875,x1=4.5,y1=10,lwd=2) abline(h=1.75,lwd=2) segments(x0=3.5,y0=1.75,x1=3.5,y1=4.875,lwd=2) title(main="Classification Tree on the Data Display")
It can be easily seen from the following that rpart works really well:
Figure 5: The terminal nodes on the original display of the data
What just happened?
We obtained our first classification tree, which is a good thing. Given the actual data display, the classification tree gives satisfactory answers.
We have understood the “how” part of a classification tree. The “why” aspect is very vital in science, and the next section explains the science behind the construction of a regression tree, and it will be followed later by a detailed explanation of the working of a classification tree.
The construction of a regression tree
In the CART_Dummy dataset, the output is a categorical variable, and we built a classification tree for it. The same distinction is required in CART, and we thus build classification trees for binary random variables, where regression trees are for continuous random variables. Recall the rationale behind the estimation of regression coefficients for the linear regression model. The main goal was to find the estimates of the regression coefficients, which minimize the error sum of squares between the actual regressand values and the fitted values. A similar approach is followed here, in the sense that we need to split the data at the points that keep the residual sum of squares to a minimum. That is, for each unique value of a predictor, which is a candidate for the node value, we find the sum of squares of y’s within each partition of the data, and then add them up. This step is performed for each unique value of the predictor, and the value, which leads to the least sum of squares among all the candidates, is selected as the best split point for that predictor. In the next step, we find the best split points for each of the predictors, and then the best split is selected across the best split points across the predictors. Easy!
Now, the data is partitioned into two parts according to the best split. The process of finding the best split within each partition is repeated in the same spirit as for the first split. This process is carried out in a recursive fashion until the data can’t be partitioned any further. What is happening here? The residual sum of squares at each child node will be lesser than that in the parent node.
At the outset, we record that the rpart function does the exact same thing. However, as a part of cleaner understanding of the regression tree, we will write raw R codes and ensure that there is no ambiguity in the process of understanding CART. We will begin with a simple example of a regression tree, and use the rpart function to plot the regression function. Then, we will first define a function, which will extract the best split given by the covariate and dependent variable. This action will be repeated for all the available covariates, and then we find the best overall split. This will be verified with the regression tree. The data will then be partitioned by using the best overall split, and then the best split will be identified for each of the partitioned data. The process will be repeated until we reach the end of the complete regression tree given by the rpart. First, the experiment!
The cpus dataset available in the MASS package contains the relative performance measure of 209 CPUs in the perf variable. It is known that the performance of a CPU depends on factors such as the cycle time in nanoseconds (syct), minimum and maximum main memory in kilobytes (mmin and mmax), cache size in kilobytes (cach), and minimum and maximum number of channels (chmin and chmax). The task in hand is to model the perf as a function of syct, mmin, mmax, cach, chmin, and chmax. The histogram of perf—try hist(cpus$perf)—will show a highly skewed distribution, and hence we will build a regression tree for the logarithm transformation log10(perf).
Time for action – the construction of a regression tree
A regression tree is first built by using the rpart function. The getNode function is introduced, which helps in identifying the split node at each stage, and using it we build a regression tree and verify that we had the same tree as returned by the rpart function.
- Load the MASS library by using library(MASS).
- Create the regression tree for the logarithm (to the base 10) of perf as a function of the covariates explained earlier, and display the regression tree:
cpus.ltrpart
The regression tree will be indicated as follows:
Figure 6: Regression tree for the “perf” of a CPU
We will now define the getNode function. Given the regressand and the covariate, we need to find the best split in the sense of the sum of squares criterion. The evaluation needs to be done for every distinct value of the covariate. If there are m distinct points, we need m -1 evaluations. At each distinct point, the regressand needs to be partitioned accordingly, and the sum of squares should be obtained for each partition. The two sums of squares (in each part) are then added to obtain the reduced sum of squares. Thus, we create the required function to meet all these requirements.
- Create the getNode function in R by running the following code:
getNode xu[i]] partL
The getNode function gives the best split for a given covariate. It returns a list consisting of four objects:
- xnode, which is a datum of the covariate x that gives the minimum residual sum of squares for the regressand y
- The value of the minimum residual sum of squares
- The vector of the residual sum of squares for the distinct points of the vector x
- The vector of the distinct x values
We will run this function for each of the six covariates, and find the best overall split. The argument na.rm=TRUE is required, as at the maximum value of x we won’t get a numeric value.
- We will first execute the getNode function on the syct covariate, and look at the output we get as a result:
> getNode(cpus$syct,log10(cpus$perf))$xnode [1] 48 > getNode(cpus$syct,log10(cpus$perf))$minss [1] 24.72 > getNode(cpus$syct,log10(cpus$perf))[[3]] [1] 43.12 42.42 41.23 39.93 39.44 37.54 37.23 36.87 36.51 36.52 35.92 34.91 [13] 34.96 35.10 35.03 33.65 33.28 33.49 33.23 32.75 32.96 31.59 31.26 30.86 [25] 30.83 30.62 29.85 30.90 31.15 31.51 31.40 31.50 31.23 30.41 30.55 28.98 [37] 27.68 27.55 27.44 26.80 25.98 27.45 28.05 28.11 28.66 29.11 29.81 30.67 [49] 28.22 28.50 24.72 25.22 26.37 28.28 29.10 33.02 34.39 39.05 39.29 > getNode(cpus$syct,log10(cpus$perf))[[4]] [1] 1500 1100 900 810 800 700 600 480 400 350 330 320 300 250 240 [16] 225 220 203 200 185 180 175 167 160 150 143 140 133 125 124 [31] 116 115 112 110 105 100 98 92 90 84 75 72 70 64 60 [46] 59 57 56 52 50 48 40 38 35 30 29 26 25 23 17
The least sum of squares at a split for the best split value of the syct variable is 24.72, and it occurs at a value of syct greater than 48. The third and fourth list objects given by getNode, respectively, contain the details of the sum of squares for the potential candidates and the unique values of syct. The values of interest are highlighted. Thus, we will first look at the second object from the list output for all the six covariates to find the best split among the best split of each of the variables, by the residual sum of squares criteria.
- Now, run the getNode function for the remaining five covariates:
getNode(cpus$syct,log10(cpus$perf))[[2]] getNode(cpus$mmin,log10(cpus$perf))[[2]] getNode(cpus$mmax,log10(cpus$perf))[[2]] getNode(cpus$cach,log10(cpus$perf))[[2]] getNode(cpus$chmin,log10(cpus$perf))[[2]] getNode(cpus$chmax,log10(cpus$perf))[[2]] getNode(cpus$cach,log10(cpus$perf))[[1]] sort(getNode(cpus$cach,log10(cpus$perf))[[4]],decreasing=FALSE)
The output is as follows:
Figure 7: Obtaining the best “first split” of regression tree
The sum of squares for cach is the lowest, and hence we need to find the best split associated with it, which is 24. However, the regression tree shows that the best split is for the cach value of 27. The getNode function says that the best split occurs at a point greater than 24, and hence we take the average of 24 and the next unique point at 30. Having obtained the best overall split, we next obtain the first partition of the dataset.
- Partition the data by using the best overall split point:
cpus_FS_R =27,] cpus_FS_L
The new names of the data objects are clear with _FS_R indicating the dataset obtained on the right side for the first split, and _FS_L indicating the left side. In the rest of the section, the nomenclature won’t be further explained.
- Identify the best split in each of the partitioned datasets:
getNode(cpus_FS_R$syct,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$mmin,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$cach,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$chmin,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$chmax,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[1]] sort(getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[4]], decreasing=FALSE) getNode(cpus_FS_L$syct,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$mmin,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$cach,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$chmin,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$chmax,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[1]] sort(getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[4]], decreasing=FALSE)
The following screenshot gives the results of running the preceding R code:
Figure 8: Obtaining the next two splits
Thus, for the first right partitioned data, the best split is for the mmax value as the mid-point between 24000 and 32000; that is, at mmax = 28000. Similarly, for the first left-partitioned data, the best split is the average value of 6000 and 6200, which is 6100, for the same mmax covariate. Note the important step here. Even though we used cach as the criteria for the first partition, it is still used with the two partitioned data. The results are consistent with the display given by the regression tree, Figure 6: Regression tree for the “perf” of a CPU . The next R program will take care of the entire first split’s right side’s future partitions.
- Partition the first right part cpus_FS_R as follows:
cpus_FS_R_SS_R =28000,] cpus_FS_R_SS_L
Obtain the best split for cpus_FS_R_SS_R and cpus_FS_R_SS_L by running the following code:
cpus_FS_R_SS_R =28000,] cpus_FS_R_SS_L decreasing=FALSE)
For the cpus_FS_R_SS_R part, the final division is according to cach being greater than 56 or not (average of 48 and 64). If the cach value in this partition is greater than 56, then perf (actually log10(perf)) ends in the terminal leaf 3, else 2. However, for the region cpus_FS_R_SS_L, we partition the data further by the cach value being greater than 96.5 (average of 65 and 128). In the right side of the region, log10(perf) is found as 2, and a third level split is required for cpus_FS_R_SS_L with cpus_FS_R_SS_L_TS_L. Note that though the final terminal leaves of the cpus_FS_R_SS_L_TS_L region shows the same 2 as the final log10(perf), this may actually result in a significant variability reduction of the difference between the predicted and the actual log10(perf) values. We will now focus on the first main split’s left side.
Figure 9: Partitioning the right partition after the first main split
- Partition cpus_FS_L accordingly, as the mmax value being greater than 6100 or otherwise:
cpus_FS_L_SS_R =6100,] cpus_FS_L_SS_L
The rest of the partition for cpus_FS_L is completely given next.
- The details will be skipped and the R program is given right away:
cpus_FS_L_SS_R =6100,] cpus_FS_L_SS_L decreasing=FALSE) cpus_FS_L_SS_R_TS_R decreasing=FALSE)
We will now see how the :
Figure 10: Partitioning the left partition after the first main split
We leave it to you to interpret the output arising from the previous action.
What just happened?
Using the rpart function from the rpart library we first built the regression tree for log10(perf). Then, we explored the basic definitions underlying the construction of a regression tree and defined the getNode function to obtain the best split for a pair of regressands and a covariate. This function is then applied for all the covariates, and the best overall split is obtained; using this we get our first partition of the data, which will be in agreement with the tree given by the rpart function. We then recursively partitioned the data by using the getNode function and verified that all the best splits in each partitioned data are in agreement with the one provided by the rpart function.
The reader may wonder if the preceding tedious task was really essential. However, it has been the experience of the author that users/readers seldom remember the rationale behind using direct code/functions for any software after some time. Moreover, CART is a difficult concept and it is imperative that we clearly understand our first tree, and return to the preceding program whenever the understanding of a science behind CART is forgotten.
Summary
We began with the idea of recursive partitioning and gave a legitimate reason as to why such an approach is practical. The CART technique is completely demystified by using the getNode function, which has been defined appropriately depending upon whether we require a regression or a classification tree.
Resources for Article :
Further resources on this subject:
- Organizing, Clarifying and Communicating the R Data Analyses [Article]
- Graphical Capabilities of R [Article]
- Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article]