Classifier Construction

7 min read

In this article by Pratik Joshi, author of the book Python Machine Learning Cookbook, we will build a simple classifier using supervised learning, and then go onto build a logistic-regression classifier.

Building a simple classifier

In the field of machine learning, classification refers to the process of using the characteristics of data to separate it into a certain number of classes. A supervised learning classifier builds a model using labeled training data, and then uses this model to classify unknown data. Let’s take a look at how to build a simple classifier.

(For more resources related to this topic, see here.)

How to do it…

Before we begin, make sure thatyou have imported thenumpy and matplotlib.pyplot packages. After this, let’s create some sample data:
```
X = np.array([[3,1], [2,5], [1,8], [6,4], [5,2], [3,5], [4,7], [4,-1]])
```
Let’s assign some labels to these points:
```
y = [0, 1, 1, 0, 0, 1, 1, 0]
```
As we have only two classes, the list y contains 0s and 1s. In general, if you have N classes, then the values in y will range from 0 to N-1. Let’s separate the data into classes that are based on the labels:
```
class_0 = np.array([X[i] for i in range(len(X)) if y[i]==0])

class_1 = np.array([X[i] for i in range(len(X)) if y[i]==1])
```
To get an idea about our data, let’s plot this, as follows:
```
plt.figure()

plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s')

plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x')
```
This is a scatterplot where we use squares and crosses to plot the points. In this context,the marker parameter specifies the shape that you want to use. We usesquares to denote points in class_0 and crosses to denote points in class_1. If you run this code, you will see the following figure:
In the preceding two lines, we just use the mapping between X and y to create two lists. If you were asked to inspect the datapoints visually and draw a separating line, what would you do? You would simply draw a line in between them. Let’s go ahead and do this:
```
line_x = range(10)

line_y = line_x
```

We just created a line with the mathematical equation,y = x. Let’s plot this, as follows:

plt.figure()

plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s')

plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x')

plt.plot(line_x, line_y, color='black', linewidth=3)

plt.show()

If you run this code, you should see the following figure:

There’s more…

We built a really simple classifier using the following rule: the input point (a, b) belongs to class_0 if a is greater than or equal tob;otherwise, it belongs to class_1. If you inspect the points one by one, you will see that this is true. This is it! You just built a linear classifier that can classify unknown data. It’s a linear classifier because the separating line is a straight line. If it’s a curve, then it becomes a nonlinear classifier.

This formation worked fine because there were a limited number of points, and we could visually inspect them. What if there are thousands of points? How do we generalize this process? Let’s discuss this in the next section.

Building a logistic regression classifier

Despite the word regression being present in the name, logistic regression is actually used for classification purposes. Given a set of datapoints, our goal is to build a model that can draw linear boundaries between our classes. It extracts these boundaries by solving a set of equations derived from the training data. Let’s see how to do that in Python:

We will use the logistic_regression.pyfile that is already provided to you as a reference. Assuming that you have imported the necessary packages, let’s create some sample data along with training labels:
```
X = np.array([[4, 7], [3.5, 8], [3.1, 6.2], [0.5, 1], [1, 2], [1.2, 1.9], [6, 2], [5.7, 1.5], [5.4, 2.2]])

y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
```
Here, we assume that we have three classes.
Let’s initialize the logistic regression classifier:
```
classifier = linear_model.LogisticRegression(solver='liblinear', C=100)
```
There are a number of input parameters that can be specified for the preceding function, but a couple of important ones are solver and C. The solverparameter specifies the type of solver that the algorithm will use to solve the system of equations. The C parameter controls the regularization strength. A lower value indicates higher regularization strength.
Let’s train the classifier:
```
classifier.fit(X, y)
```
Let’s draw datapoints and boundaries:
```
plot_classifier(classifier, X, y)
```
We need to define this function:
```
def plot_classifier(classifier, X, y):

    # define ranges to plot the figure

    x_min, x_max = min(X[:, 0]) - 1.0, max(X[:, 0]) + 1.0

    y_min, y_max = min(X[:, 1]) - 1.0, max(X[:, 1]) + 1.0
```
The preceding values indicate the range of values that we want to use in our figure. These values usually range from the minimum value to the maximum value present in our data. We add some buffers, such as 1.0 in the preceding lines, for clarity.
In order to plot the boundaries, we need to evaluate the function across a grid of points and plot it. Let’s go ahead and define the grid:
```
 # denotes the step size that will be used in the mesh grid

    step_size = 0.01

 

    # define the mesh grid

    x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size))
```
The x_values and y_valuesvariables contain the grid of points where the function will be evaluated.

Let’s compute the output of the classifier for all these points:

 # compute the classifier output

    mesh_output = classifier.predict(np.c_[x_values.ravel(), y_values.ravel()])

 

    # reshape the array

    mesh_output = mesh_output.reshape(x_values.shape)

Let’s plot the boundaries using colored regions:
```
 # Plot the output using a colored plot

    plt.figure()

 

    # choose a color scheme

    plt.pcolormesh(x_values, y_values, mesh_output, cmap=plt.cm.Set1)
```
This is basically a 3D plotter that takes the 2D points and the associated values to draw different regions using a color scheme. You can find all the color scheme options athttp://matplotlib.org/examples/color/colormaps_reference.html.
Let’s overlay the training points on the plot:
```
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='black', linewidth=2, cmap=plt.cm.Paired)

 

    # specify the boundaries of the figure

    plt.xlim(x_values.min(), x_values.max())

    plt.ylim(y_values.min(), y_values.max())

 

    # specify the ticks on the X and Y axes

    plt.xticks((np.arange(int(min(X[:, 0])-1), int(max(X[:, 0])+1), 1.0)))

    plt.yticks((np.arange(int(min(X[:, 1])-1), int(max(X[:, 1])+1), 1.0)))

 

    plt.show()
```
Here, plt.scatter plots the points on the 2D graph. TheX[:, 0] specifies that we should take all the values along axis 0 (X-axis in our case), and X[:, 1] specifies axis 1 (Y-axis). The c=y parameter indicates the color sequence. We use the target labels to map to colors using cmap. We basically want different colors based on the target labels; hence, we use y as the mapping. The limits of the display figure are set using plt.xlim and plt.ylim. In order to mark the axes with values, we need to use plt.xticks and plt.yticks. These functions mark the axes with values so that it’s easier for us to see where the points are located. In the preceding code, we want the ticks to lie between the minimum and maximum values with a buffer of 1 unit. We also want these ticks to be integers. So, we use theint() function to round off the values.
If you run this code, you should see the following output:
Let’s see how the Cparameter affects our model. The C parameter indicates the penalty for misclassification. If we set this to 1.0, we will get the following figure:
If we set C to 10000, we get the following figure:

As we increase C, there is a higher penalty for misclassification. Hence, the boundaries get more optimal.

Summary

We successfully employed supervised learning to build a simple classifier. We subsequently went on to construct a logistic-regression classifier and saw different results of tweaking C—the regularization strength parameter.