13 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner’s guide to combining powerful machine learning algorithms to build optimized models.[/box]

In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library:

  1. Univariate selection
  2. Recursive Feature Elimination (RFE)
  3. Principle Component Analysis (PCA)
  4. Choosing important features (feature importance)

We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community.

Univariate selection

Statistical tests can be used to select those features that have the strongest relationships with the output variable.

The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features.

The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset:

#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm

from sklearn.feature_selection import SelectKBest

#Import chi2 for performing chi square test from sklearn.feature_selection import chi2

#URL for loading the dataset

url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#We will select the features using chi square

test = SelectKBest(score_func=chi2, k=4)

#Fit the function for ranking the features by score

fit = test.fit(X, Y)

#Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_)

#Apply the transformation on to dataset

features = fit.transform(X)

#Summarize selected features print(features[0:5,:])

You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age.

Scores for each feature:

[111.52   1411.887 17.605 53.108  2175.565   127.669 5.393

181.304]

Selected Features:

[[148. 0. 33.6 50. ]

[85. 0. 26.6 31. ]

[183. 0. 23.3 32. ]

[89. 94. 28.1 21. ]

[137. 168. 43.1 33. ]]

Recursive Feature Elimination

RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation.

The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent:

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE

#Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

model = LogisticRegression() rfe = RFE(model, 3)

fit = rfe.fit(X, Y)

print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)

After execution, we will get:

Num Features: 3

Selected Features: [ True False False False False   True  True False]

Feature Ranking: [1 2 3 5 6 1 1 4]

You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array.

Principle Component Analysis

PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result.

In the following example, we use PCA and select three principal components:

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's PCA algorithm

from sklearn.decomposition import PCA

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

pca = PCA(n_components=3) fit = pca.fit(X)

#Summarize components

print("Explained Variance: %s") % fit.explained_variance_ratio_

print(fit.components_)

You can see that the transformed dataset (three principal components) bears little resemblance to the source data:

Explained Variance: [ 0.88854663   0.06159078  0.02579012]

[[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02

9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03]

[ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01

[ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]]

Choosing important features (feature importance)

Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let’s understand it in detail.

Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy.

A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini

impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

Let’s see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory.

This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy).

We will start with importing all of the libraries:

#Import the supporting libraries

#Import pandas to load the dataset from csv file

from pandas import read_csv

#Import numpy for array based operations and calculations

import numpy as np

#Import Random Forest classifier class from sklearn

from sklearn.ensemble import RandomForestClassifier

#Import feature selector class select model of sklearn

        from sklearn.feature_selection

        import SelectFromModel

         np.random.seed(1)

Let’s define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model:

#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split):

np.random.seed(0) training = [] testing = []

np.random.shuffle(dataset) shape = np.shape(dataset)

trainlength = np.uint16(np.floor(split*shape[0]))

for i in range(trainlength): training.append(dataset[i])

for i in range(trainlength,shape[0]): testing.append(dataset[i])

training = np.array(training) testing = np.array(testing)

return training,testing

We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy:

#Function to evaluate model performance

def getAccuracy(pre,ytest): count = 0

for i in range(len(ytest)):

if ytest[i]==pre[i]: count+=1

acc = float(count)/len(ytest)

return acc

This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier:

#Load dataset as pandas data frame

data = read_csv('train.csv')

#Extract attribute names from the data frame

feat = data.keys()

feat_labels = feat.get_values()

#Extract data values from the data frame

dataset = data.values

#Shuffle the dataset

np.random.shuffle(dataset)

#We will select 50000 instances to train the classifier

inst = 50000

#Extract 50000 instances from the dataset

dataset = dataset[0:inst,:]

#Create Training and Testing data for performance evaluation

train,test = getTrainTestData(dataset, 0.7)

#Split data into input and output variable with selected features

Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain)

print("Shape of the dataset ",shape)

#Print the size of Data in MBs

print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))

Let’s take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let’s see:

Shape of the dataset (35000, 94)

Size of Data set before feature selection: 26.32 MB

As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data.

In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn:

#Lets select the test data for model evaluation purpose

Xtest = test[:,0:94] ytest = test[:,94]

#Create a random forest classifier with the following Parameters

trees            = 250

max_feat     = 7

max_depth = 30

min_sample = 2

clf = RandomForestClassifier(n_estimators=trees,

max_features=max_feat,

max_depth=max_depth,

min_samples_split= min_sample, random_state=0,

n_jobs=-1)

#Train the classifier and calculate the training time

import time

start = time.time() clf.fit(Xtrain, ytrain) end = time.time()

#Lets Note down the model training time

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

pre = clf.predict(Xtest)

Let's see how much time is required to train the model on the training dataset:

Execution time for building the Tree is: 2.913641

#Evaluate the model performance for the test data

acc = getAccuracy(pre, ytest)

print("Accuracy of model before feature selection is %.2f"%(100*acc))

The accuracy of our model is:

Accuracy of model before feature selection is 98.82

As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes.

So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let’s find out the importance of each feature:

#Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_):

print(feature)

('id', 0.33346650420175183)

('feat_1', 0.0036186958628801214)

('feat_2', 0.0037243050888530957)

('feat_3', 0.011579217472062748)

('feat_4', 0.010297382675187445)

('feat_5', 0.0010359139416194116)

('feat_6', 0.00038171336038056165)

('feat_7', 0.0024867672489765021)

('feat_8', 0.0096689721610546085)

('feat_9', 0.007906150362995093)

('feat_10', 0.0022342480802130366)

As you can see here, each feature has a different importance based on its contribution to the final prediction.

We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training:

#Select features which have higher contribution in the final prediction

sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)

Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset:

#Transform input dataset

Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest)

#Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6))

shape = np.shape(Xtrain_1)

print("Shape of the dataset ",shape)

Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)

Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That’s about 80% reduction from the original dataset.

In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let’s see what accuracy we get after modifying the training set:

#Model training time

start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time()

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

#Let's evaluate the model on test data

pre = clf.predict(Xtest_1) count = 0

acc2 = getAccuracy(pre, ytest)

print("Accuracy after feature selection %.2f"%(100*acc2))

Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97

Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly.

This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table:

Evaluation criteria Before feature selection After feature selection
Number of features 94 20
Size of dataset 26.32 MB 5.60 MB
Training time 2.91 seconds 1.71 seconds
Accuracy 98.82 percent 99.97 percent

The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before.

To summarize the article, we explored 4 ways of feature selection in machine learning.

If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.

Ensemble Machine Learning

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.

4 COMMENTS

  1. Hello, the above methods are very interesting, especially the Choosing Important Features technique. I was wondering whether or not it is capable of running on this type of data as shown below:

    #Gene gene1 gene2 gene3 gene4 gene 5
    gene1 0.1 0.2 0.4 0.5 -0.4
    gene2 0.7 0.5 0.9 0.988 0.123
    gene3 5.4667 8.112 7.123 4.012 5.234
    gene4 8.955179 9.620444 9.672363 9.311175

    Thanks in advance

    Yours Sincerely,
    Isaac

  2. First, I found the overall article very useful.
    But, I think there is an oversight in the last example:
    The id column of the input data is being included as a feature. Not a typical practice. This is what is giving the high accuracy results. To understand this, realize that the input data set is sorted by the target class value i.e., all records labeled with a given class are grouped together. In addition, the id column is a sequential enumeration of the input records. This results in strong (step-wise) linear correlation between a record’s position in the input file and the target class labels. Which, in turn, makes the id field value the strongest, but useless, predictor of the class. By looking at clf.feature_importance_ after fitting the model, one can see that the id column accounts for nearly all of the predictive strength of the model.

LEAVE A REPLY

Please enter your comment!
Please enter your name here