In this article by Gopi Subramanian author of the book Python Data Science Cookbook you will learn bagging methods based on decision tree-based algorithms are very popular among the data science community.
Rotation Forest
The claim to fame of most of these methods is that they need zero data preparation as compared to the other methods, can obtain very good results, and can be provided as a black box of tools in the hands of software engineers.
By design, bagging lends itself nicely to parallelization. Hence, these methods can be easily applied on a very large dataset in a cluster environment.
The decision tree algorithms split the input data into various regions at each level of the tree. Thus, they perform implicit feature selection. Feature selection is one of the most important tasks in building a good model. By providing implicit feature selection, trees are at an advantageous position as compared to other techniques. Hence, bagging with decision trees comes with this advantage.
Almost no data preparation is needed for decision trees. For example, consider the scaling of attributes. Attribute scaling has no impact on the structure of the trees. The missing values also do not affect decision trees. The effect of outliers is very minimal on a decision tree. We don’t have to do explicit feature transformations to accommodate feature interactions.
One of the major complaints against tree-based methods is the difficulty with pruning the trees to avoid overfitting. Big trees tend to also fit the noise present in the underlying data and hence lead to a low bias and high variance. However, when we grow a lot of trees and the final prediction is an average of the output of all the trees in the ensemble, we avoid these problems.
In this article, we will see a powerful tree-based ensemble method called rotation forest. A typical random forest requires a large number of trees to be a part of its ensemble in order to achieve good performance. Rotation forest can achieve similar or better performance with less number of trees. Additionally, the authors of this algorithm claim that the underlying estimator can be anything other than a tree. This way, it is projected as a new framework to build an ensemble similar to gradient boosting.
(For more resources related to this topic, see here.)
The algorithm
Random forest and bagging gives impressive results with very large ensembles; having a large number of estimators results in the improvement of the accuracy of these methods. On the contrary, rotation forest is designed to work with a smaller number of ensembles.
Let’s write down the steps involved in building a rotation forest. The number of trees required in the forest is typically specified by the user.
Let T be the number of trees required to be built.
We start with iterating from one through T, that is, we build T trees.
For each tree T, perform the following steps:
- Split the attributes in the training set into K nonoverlapping subsets of equal size.
- We have K datasets, each with K attributes. For each of the K datasets, we proceed to do the following:
- Bootstrap 75% of the data from each K dataset and use the bootstrapped sample for further steps.
- Run a principal component analysis on the ith subset in K. Retain all the principal components. For every feature j in the Kth subset, we have a principle component, a. Let’s denote it as aij, where it’s the principal component for the jth attribute in the ith subset.
- Store the principal components for the subset.
- Create a rotation matrix of size, n X n, where n is the total number of attributes. Arrange the principal component in the matrix such that the components match the position of the feature in the original training dataset.
- Project the training dataset on the rotation matrix using the matrix multiplication.
- Build a decision tree with the projected dataset.
- Store the tree and rotation matrix.
A quick note about PCA: PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss or, in other words, retain maximum variation in the data. In PCA, we find the directions in the data with the most variation, that is, the eigenvectors corresponding to the largest eigenvalues of the covariance matrix and project the data onto these directions.
With a dataset (n x m) with n instances and m dimensions, PCA projects it onto a smaller subspace (n x d), where d << m.
A point to note is that PCA is computationally very expensive.
Programming rotation forest in Python
Now let’s write a Python code to implement rotation forest. We will proceed to test it on a classification problem. We will generate some classification dataset to demonstrate rotation forest. To our knowledge, there is no Python implementation available for rotation forest. We will leverage scikit-learns’s implementation of the decision tree classifier and use the train_test_split method for the bootstrapping.
Refer to the following link to learn more about scikit-learn:
http://scikit-learn.org/stable/
First write the necessary code to implement rotation forest and apply it on a classification problem.
We will start with loading all the necessary libraries. Let’s leverage the make_classification method from the sklearn.dataset module to generate the training data. We will follow it with a method to select a random subset of attributes called gen_random_subset:
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
import numpy as np
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 50
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
n_informative = informative_features, n_redundant = redundant_features
,n_repeated = repeated_features,random_state=7)
return x,y
def get_random_subset(iterable,k):
subsets = []
iteration = 0
np.random.shuffle(iterable)
subset = 0
limit = len(iterable)/k
while iteration < limit:
if k <= len(iterable):
subset = k
else:
subset = len(iterable)
subsets.append(iterable[-subset:])
del iterable[-subset:]
iteration+=1
return subsets
We will write the build_rotationtree_model function, where we will build fully-grown trees and proceed to evaluate the forest’s performance using the model_worth function:
def build_rotationtree_model(x_train,y_train,d,k):
models = []
r_matrices = []
feature_subsets = []
for i in range(d):
x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7)
# Features ids
feature_index = range(x.shape[1])
# Get subsets of features
random_k_subset = get_random_subset(feature_index,k)
feature_subsets.append(random_k_subset)
# Rotation matrix
R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float)
for each_subset in random_k_subset:
pca = PCA()
x_subset = x[:,each_subset]
pca.fit(x_subset)
for ii in range(0,len(pca.components_)):
for jj in range(0,len(pca.components_)):
R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj]
x_transformed = x_train.dot(R_matrix)
model = DecisionTreeClassifier()
model.fit(x_transformed,y_train)
models.append(model)
r_matrices.append(R_matrix)
return models,r_matrices,feature_subsets
def model_worth(models,r_matrices,x,y):
predicted_ys = []
for i,model in enumerate(models):
x_mod = x.dot(r_matrices[i])
predicted_y = model.predict(x_mod)
predicted_ys.append(predicted_y)
predicted_matrix = np.asmatrix(predicted_ys)
final_prediction = []
for i in range(len(y)):
pred_from_all_models = np.ravel(predicted_matrix[:,i])
non_zero_pred = np.nonzero(pred_from_all_models)[0]
is_one = len(non_zero_pred) > len(models)/2
final_prediction.append(is_one)
print classification_report(y, final_prediction)
Finally, we will write a main function used to invoke the functions that we have defined:
if __name__ == "__main__":
x,y = get_data()
# plot_data(x,y)
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
# Build a bag of models
models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5)
model_worth(models,r_matrices,x_train,y_train)
model_worth(models,r_matrices,x_dev,y_dev)
Walking through our code
Let’s start with our main function. We will invoke get_data to get our predictor attributes in the response attributes. In get_data, we will leverage the make_classification dataset to generate our training data for our recipe:
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
n_informative = informative_features, n_redundant = redundant_features
,n_repeated = repeated_features,random_state=7)
return x,y
Let’s look at the parameters passed to the make_classification method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is about how many attributes per instance are required. We say that we need 30. The third parameter, flip_y, randomly interchanges 3% of the instances. This is done to introduce some noise in our data. The next parameter is how many of these 30 features should be informative enough to be used in our classification. We have specified that 60% of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce correlation among the features. Finally, the repeated features are duplicate features, which are drawn randomly from both the informative and redundant features.
Let’s split the data into training and testing sets using train_test_split. We will reserve 30% of our data to test:
Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
Once again, we will leverage train_test_split to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
With the data divided to build, evaluate, and test the model, we will proceed to build our models:
models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5)
We will invoke the build_rotationtree_model function to build our rotation forest. We will pass our training data, predictor, x_train and response variable, y_train, the total number of trees to be built—25 in this case—and finally, the subset of features to be used—5 in this case.
Let’s jump into this function:
models = []
r_matrices = []
feature_subsets = []
We will begin with declaring three lists to store each of the decision tree, rotation matrix for this tree, and subset of features used in this iteration. We will proceed to build each tree in our ensemble.
As a first order of business, we will bootstrap to retain only 75% of the data:
x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7)
We will leverage the train_test_split function from scikit-learn for the bootstrapping. We will then decide the feature subsets:
# Features ids
feature_index = range(x.shape[1])
# Get subsets of features
random_k_subset = get_random_subset(feature_index,k)
feature_subsets.append(random_k_subset)
The get_random_subset function takes the feature index and number of subsets required as parameters and returns K subsets.
In this function, we will shuffle the feature index. The feature index is an array of numbers starting from zero and ending with the number of features in our training set:
np.random.shuffle(iterable)
Let’s say that we have ten features and our k value is five, indicating that we need subsets with five nonoverlapping feature indices and we need to do two iterations. We will store the number of iterations needed in the limit variable:
limit = len(iterable)/k
while iteration < limit:
if k <= len(iterable):
subset = k
else:
subset = len(iterable)
iteration+=1
If our required subset is less than the total number of attributes, we can proceed to use the first k entries in our iterable. As we shuffled our iterables, we will be returning different volumes at different times:
subsets.append(iterable[-subset:])
On selecting a subset, we will remove it from the iterable as we need nonoverlapping sets:
del iterable[-subset:]
With all the subsets ready, we will declare our rotation matrix:
del iterable[-subset:]
With all the subsets ready, we will declare our rotation matrix:
# Rotation matrix
R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float)
As you can see, our rotation matrix is of size, n x n, where is the number of attributes in our dataset. You can see that we have used the shape attribute to declare this matrix filled with zeros:
for each_subset in random_k_subset:
pca = PCA()
x_subset = x[:,each_subset]
pca.fit(x_subset)
For each of the K subsets of data having only K features, we will proceed to do a principal component analysis.
We will fill our rotation matrix with the component values:
for ii in range(0,len(pca.components_)):
for jj in range(0,len(pca.components_)):
R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj]
For example, let’s say that we have three attributes in our subset in a total of six attributes. For illustration, let’s say that our subsets are as follows:
2,4,6 and 1,3,5
Our rotation matrix, R, is of size, 6 x 6. Assume that we want to fill the rotation matrix for the first subset of features. We will have three principal components, one each for 2,4, and 6 of size, 1 x 3.
The output of PCA from scikit-learn is a matrix of size components X features. We will go through each component value in the for loop. At the first run, our feature of interest is two, and the cell (0,0) in the component matrix output from PCA gives the value of contribution of feature two to component one. We have to find the right place in the rotation matrix for this value. We will use the index from the component matrices, ii and jj, with the subset list to get the right place in the rotation matrix:
R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj]
The each_subset[0] and each_subset[0] will put us in cell (2,2) in the rotation matrix. As we go through the loop, the next component value in cell (0,1) in the component matrix will be placed in cell (2,4) in the rotation matrix and the last one in cell (2,6) in the rotation matrix. This is done for all the attributes in the first subset. Let’s go to the second subset; here the first attribute is one. The cell (0,0) of the component matrix corresponds to the cell (1,1) in the rotation matrix.
Proceeding this way, you will notice that the attribute component values are arranged in the same order as the attributes themselves.
With our rotation matrix ready, let’s project our input onto the rotation matrix:
x_transformed = x_train.dot(R_matrix)
It’s time now to fit our decision tree:
model = DecisionTreeClassifier()
model.fit(x_transformed,y_train)
Finally, we will store our models and the corresponding rotation matrices:
models.append(model)
r_matrices.append(R_matrix)
With our model built, let’s proceed to see how good our model is in both the train and dev datasets using the model_worth function:
model_worth(models,r_matrices,x_train,y_train)
model_worth(models,r_matrices,x_dev,y_dev)
Let’s see our model_worth function:
for i,model in enumerate(models):
x_mod = x.dot(r_matrices[i])
predicted_y = model.predict(x_mod)
predicted_ys.append(predicted_y)
In this function, perform prediction using each of the trees that we built. However, before doing the prediction, we will project our input using the rotation matrix. We will store all our prediction output in a list called predicted_ys. Let’s say that we have 100 instances to predict and ten models in our tree; for each instance, we have ten predictions. We will store these as a matrix for convenience:
predicted_matrix = np.asmatrix(predicted_ys)
Now, we will proceed to give a final classification for each of our input records:
final_prediction = []
for i in range(len(y)):
pred_from_all_models = np.ravel(predicted_matrix[:,i])
non_zero_pred = np.nonzero(pred_from_all_models)[0]
is_one = len(non_zero_pred) > len(models)/2
final_prediction.append(is_one)
We will store our final prediction in a list called final_prediction. We will go through each of the predictions for our instance. Let’s say that we are in the first instance (i=0 in our for loop); pred_from_all_models stores the output from all the trees in our model. It’s an array of zeros and ones indicating which class has the model classified in this instance.
We will make another array out of it, non_zero_pred, which has only those entries from the parent arrays that are non-zero.
Finally, if the length of this non-zero array is greater than half the number of models that we have, we say that our final prediction is one for the instance of interest. What we have accomplished here is the classic voting scheme.
Let’s look at how good our models are now by calling a classification report:
print classification_report(y, final_prediction)
Here is how good our model has performed on the training set:
Let’s see how good our model performance is on the dev dataset:
References
More information about rotation forest can be obtained from the following paper:.
Rotation Forest: A New Classifier Ensemble Method Juan J. Rodrı´guez, Member, IEEE Computer Society, Ludmila I. Kuncheva, Member, IEEE, and Carlos J. Alonso
The paper also claims that when rotation forest was compared to bagging, AdBoost, and random forest on 33 datasets, rotation forest outperformed all the other three algorithms.
Similar to gradient boosting, authors of the paper claim that rotation forest is an overall framework and the underlying ensemble is not necessary to be a decision tree. Work is in progress in testing other algorithms such as Naïve Bayes, Neural Networks, and similar others.
Summary
In this article will learnt the bagging methods based on decision tree-based algorithms are very popular among the data science community.
Resources for Article:
Further resources on this subject:
- Mobile Phone Forensics – A First Step into Android Forensics [article]
- Develop a Digital Clock [article]
- Monitoring Physical Network Bandwidth Using OpenStack Ceilometer [article]