This is an introductory post on scikit-learn where we will learn basic terminology and functionality of this amazing Python package. We will also explore basic principles of machine learning and how machine learning can be done with sklearn.

What is scikit-learn (sklearn)?

scikit-learn is a python framework for machine learning.
It has an efficient implementation of various machine learning and data mining algorithms.
It is easy to use and accessible to everybody – open source, and a commercially usable BSD license.
Data Scientists love Python and most scientists in the industry use this as their data science stack:

numpy + pandas + sklearn

Dependencies

Python (>= 2.6)
numpy (>= 1.6.1)
scipy (>= 0.9)
matplotlib (for some tasks)

Installation

Mac - pip install -U numpy scipy scikit-learn

Linux - sudo apt-get install build-essential python-dev python-setuptools python-numpy python-scipy libatlas-dev libatlas3gf-base

After you have installed sklearn and all its dependencies, you are ready to dive further.

Input data

Most machine learning algorithms implemented in sklearn expect the input data in the form of a numpy array of shape [nSamples, nFeatures].

nSamples is the number of samples in the data. Each sample is an observation or an instance of the data. A sample can be a text document, a picture, a row in a database or a csv file – anything you can describe with a fixed set of quantitative traits.

nFeatures is the number of features or distinct traits that describe each sample quantitatively. Features can be real-valued, boolean or discrete.

The data can be very high dimensional, such as with hundreds of thousands of features, and it can be sparse, such as most of the features values are zero.

Example

As an example, we will look at the Iris dataset, which comes with sklearn and every other ML package that I know of!

   from sklearn.datasets import load_iris
   iris = load_iris()
   input = iris.data
   output = iris.target

What are the number of samples and features in this dataset ?

Since the input data is a numpy array, we can access its shape using the following:

   nSamples = input.shape[0]
   nFeatures = input.shape[1]
   >> nSamples = 150
   >> nFeatures = 4

This dataset has 150 samples, where each sample has 4 features. Let's look at the names of the target output:

   iris.target_names
   >> array(['setosa','versicolor', 'virginica'], dtype='|S10')

To get a better idea of the data, let's look at a sample:

   input[0]
   >> array([5.1, 3.5, 1.4, 0.2])
   output[0]
   >> 0

The data is given as a numpy array of shape (150,4) which consists of the measurements of physical traits for three species of irises. The features include:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm

The target values {0,1,2} denote three species:

Setosa
Versicolour
Virginica

introduction-sklearn-img-0

introduction-sklearn-img-1

introduction-sklearn-img-2

Here is the basic idea of machine learning.

The basic setting for a supervised machine learning model is as follows:

We have a labeled training set, such as samples with known values of a target.
We are given an unlabeled testing set, such as samples for which the target values are unknown.
The goal is to build a model that trains on the labeled data to predict the output for the unlabeled data.

Supervised learning is further broken down into two categories: classification and regression.

In classification, the target value is discrete
In regression, the target value is continuous.

There are various machine learning methods that can be used to build a supervised learning model, for example decision trees, k-nearest neighbors, SVM, linear and logistic regression, random forests, and more. I'll not talk about these methods and their differences in this post. I will give an illustration of using sklearn for predictive modeling using a regression and a classification model.

Iris Example continued (Clasification):

We saw that data is a numpy array of shape (150,4) consisting of measurements of physical traits for three iris species.

Goal

The task is to build a machine learning model to predict the species of a sample given the values of the features.

We will split the iris set into a training and a test set. The model will be built on a training set and evaluated on the test set. Before we do that, let's look at the general outline of a machine learning model in sklearn.

Outline of sklearn models:

The basic outline of a sklearn model is given by the following pseudocode.

input = labeled data
        X_train = input.features
        Y_train = input.target
        algorithm = sklearn.ClassImplementingTheAlgorithm(parameters of the algorithm)
        fitting = algorithm.fit(X_train, Y_train)
        X_test = unlabeled set
        prediction = algorithm.predict(X_test)

introduction-sklearn-img-3

Here, as before, the labeled training data is in the form of a numpy array with X_train as the array of feature values and Y_train as the corresponding target values. In sklearn, different machine learning algorithms are implemented as classes and we will choose the class corresponding to the algorithm we want to use. Each class has a method called fit which fits the input training data to estimate the parameters of the algorithm. Now with these estimated parameters, the predict method computes the estimated value of the target for the test examples.

sklearn model on iris data:

Following the general outline of the sklearn model, we will now build a model on iris data to predict the species.

        from sklearn.datasets import load_iris
        iris = load_iris()
        X = iris.data
        Y = iris.target

        from sklearn import cross_validation
        X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X,Y, test_size=0.4)

        from sklearn.neighbors import KNeighborsClassifier
        algorithm = KNeighborsClassifier(n_neighbors=5)
        fitting = algorithm.fit(X_train, Y_train)
        prediction = algorithm.predict(X_test)

The iris data set is split into a training and a test set using a cross validation class from sklearn. The 60% of the iris data was formed and the remaining 40% was the test. The cross_validation picks training and test examples randomly. We used the K-nearest neighbor algorithm to build this model. There is no reason for choosing this method, other than simplicity. The prediction of the sklearn model is a label from {0,1,2} for each of the test case.

Let's check how well this model performed:

        from sklearn.metrics import accuracy_score
        accuracy_score(Y_test, prediction)
        >> 0.97

Regression:

We will discuss the simplest example of fitting a line through the data.

                # Create some simple data
                import numpy as np
                np.random.seed(0)
                X = np.random.random(size=(20, 1))
                y = 3 * X.squeeze() + 2 + np.random.normal(size=20)


                # Fit a linear regression to it
                from sklearn.linear_model import LinearRegression
                model = LinearRegression(fit_intercept=True)
                model.fit(X, y)
                
                print ("Model coefficient: %.5f, and intercept: %.5f"% (model.coef_, model.intercept_))
                >> Model coefficient: 3.93491, and intercept: 1.46229
                
                # model prediction
                X_test = np.linspace(0, 1, 100)[:, np.newaxis]
                y_test = model.predict(X_test)

Thus we get the values of the target (which were continous).

We gave a simple model based on sklearn implementation of K-Nearest neighbor algorithm and linear regression. You can try other models. The python code will be same for most of the methods in sklearn, except for a change in the name of the algorithm.

Discovert more Machine Learning content and tutorials on our dedicated Machine Learning page.

About the Author

Janu Verma is a Quantitative Researcher at the Buckler Lab, Cornell University, where he works on problems in bioinformatics and genomics. His background is in mathematics and machine learning and he leverages tools from these areas to answer questions in biology.

He holds a Masters in Theoretical Physics from University of Cambridge in UK, and he dropped out from mathematics PhD program (after 3 years) at Kansas State University.

He has held research positions at Indian Statistical Institute – Delhi, Tata Institute of Fundamental Research – Mumbai and at JN Center for Advanced Scientific Research – Bangalore.

He is a voracious reader and an avid traveler. He hangs out at the local coffee shops, which serve as his office away from office. He writes about data science, machine learning and mathematics at Random Inferences.