In this article by Luca Massaron and Alberto Boschetti the authors of the book Python Data Science Essentials – Second Edition we will cover steps on installing Python, the different installation packages and have a glance at the essential packages will constitute a complete Data Science Toolbox.
(For more resources related to this topic, see here.)
Whether you are an eager learner of data science or a well-grounded data science practitioner, you can take advantage of this essential introduction to Python for data science. You can use it to the fullest if you already have at least some previous experience in basic coding, in writing general-purpose computer programs in Python, or in some other data-analysis-specific language such as MATLAB or R.
Introducing data science and Python
Data science is a relatively new knowledge domain, though its core components have been studied and researched for many years by the computer science community. Its components include linear algebra, statistical modelling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval.
Data science is a new domain and you have to take into consideration that currently its frontiers are still somewhat blurred and dynamic. Since data science is made of various constituent sets of disciplines, please also keep in mind that there are different profiles of data scientists depending on their competencies and areas of expertise.
In such a situation, what can be the best tool of the trade that you can learn and effectively use in your career as a data scientist? We believe that the best tool is Python, and we intend to provide you with all the essential information that you will need for a quick start.
In addition, other tools such as R and MATLAB provide data scientists with specialized tools to solve specific problems in statistical analysis and matrix manipulation in data science. However, only Python really completes your data scientist skill set. This multipurpose language is suitable for both development and production alike; it can handle small- to large-scale data problems and it is easy to learn and grasp no matter what your background or experience is.
Created in 1991 as a general-purpose, interpreted, and object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory development, and prompt deployment of scientific applications.
At present, the core Python characteristics that render it an indispensable data science tool are as follows:
- It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you may need in the course of a data analysis, and sometimes even more.
- Python can easily integrate different tools and offers a truly unifying ground for different languages, data strategies, and learning algorithms that can be fitted together easily and which can concretely help data scientists forge powerful solutions. There are packages that allow you to call code in other languages (in Java, C, FORTRAN, R, or Julia), outsourcing some of the computations to them and improving your script performance.
- It is very versatile. No matter what your programming background or style
- is (object-oriented, procedural, or even functional), you will enjoy programming with Python.
- It is cross-platform; your solutions will work perfectly and smoothly
- on Windows, Linux, and Mac OS systems. You won’t have to worry
- all that much about portability.
- Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language). Moreover, there are also static compilers such as Cython or just-in-time compilers such as PyPy that can transform Python code into C for higher performance.
- It can work with large in-memory data because of its minimal memory footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using various iterations and reiterations of data wrangling.
- It is very simple to learn and use. After you grasp the basics, there’s no better way to learn more than by immediately starting with the coding.
- Moreover, the number of data scientists using Python is continuously growing: new packages and improvements have been released by the community every day, making the Python ecosystem an increasingly prolific and rich language for data science.
First, let’s proceed to introduce all the settings you need in order to create a fully working data science environment to test the examples and experiment with the code that we are going to provide you with.
Python is an open source, object-oriented, and cross-platform programming language.
Compared to some of its direct competitors (for instance, C++ or Java), Python is very concise. It allows you to build a working software prototype in a very short time. Yet it has become the most used language in the data scientist’s toolbox not just because of that. It is also a general-purpose language, and it is very flexible due to a variety of available packages that solve a wide spectrum of problems and necessities.
Python 2 or Python 3?
There are two main branches of Python: 2.7.x and 3.x. At the time of writing this article, the Python foundation (www.python.org) is offering downloads for Python version 2.7.11 and 3.5.1. Although the third version is the newest, the older one is still the most used version in the scientific area, since a few packages (check on the website py3readiness.org for a compatibility overview) won’t run otherwise yet.
In addition, there is no immediate backward compatibility between Python 3 and 2. In fact, if you try to run some code developed for Python 2 with a Python 3 interpreter, it may not work. Major changes have been made to the newest version, and that has affected past compatibility. Some data scientists, having built most of their work on Python 2 and its packages, are reluctant to switch to the new version.
We intend to address a larger audience of data scientists, data analysts and developers, who may not have such a strong legacy with Python 2. Thus, we agreed that it would be better to work with Python 3 rather than the older version. We suggest using a version such as Python 3.4 or above. After all, Python 3 is the present and the future of Python. It is the only version that will be further developed and improved by the Python foundation and it will be the default version of the future on many operating systems.
Anyway, if you are currently working with version 2 and you prefer to keep on working with it, you can still the examples. In fact, for the most part, our code will simply work on Python 2 after having the code itself preceded by these imports:
from __future__ import (absolute_import, division, print_function, unicode_literals) from builtins import * from future import standard_library standard_library.install_aliases()
The from __future__ import commands should always occur at the beginning of your scripts or else you may experience Python reporting an error.
As described in the Python-future website (python-future.org), these imports will help convert several Python 3-only constructs to a form compatible with both Python 3 and Python 2 (and in any case, most Python 3 code should just simply work on Python 2 even without the aforementioned imports).
In order to run the upward commands successfully, if the future package is not already available on your system, you should install it (version >= 0.15.2) using the following command to be executed from a shell:
$> pip install –U future
If you’re interested in understanding the differences between Python 2 and Python 3 further, we recommend reading the wiki page offered by the Python foundation itself: wiki.python.org/moin/Python2orPython3.
Novice data scientists who have never used Python (who likely don’t have the language readily installed on their machines) need to first download the installer from the main website of the project, www.python.org/downloads/, and then install it on their local machine.
We will now coversteps which will provide you with full control over what can be installed on your machine. This is very useful when you have to set up single machines to deal with different tasks in data science. Anyway, please be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and it may be well suited for first starting and learning because it saves you time and sometimes even trouble, though it will put a large number of packages (and we won’t use most of them) on your computer all at once.
This being a multiplatform programming language, you’ll find installers for machines that either run on Windows or Unix-like operating systems.
Please remember that some of the latest versions of most Linux distributions (such as CentOS, Fedora, Red Hat Enterprise, and Ubuntu) have Python 2 packaged in the repository. In such a case and in the case that you already have a Python version on your computer (since our examples run on Python 3), you first have to check what version you are exactly running. To do such a check, just follow these instructions:
- Open a python shell, type python in the terminal, or click on any Python icon you find on your system.
- Then, after having Python started, to test the installation, run the following code in the Python interactive shell or REPL:
>>> import sys >>> print (sys.version_info)
- If you can read that your Python version has the major=2 attribute, it means that you are running a Python 2 instance. Otherwise, if the attribute is valued 3, or if the print statements reports back to you something like v3.x.x (for instance v3.5.1), you are running the right version of Python and you are ready to move forward.
To clarify the operations we have just mentioned, when a command is given in the terminal command line, we prefix the command with $>. Otherwise, if it’s for the Python REPL, it’s preceded by >>>.
The installation of packages
Python won’t come bundled with all you need, unless you take a specific premade distribution. Therefore, to install the packages you need, you can use either pip or easy_install. Both these two tools run in the command line and make the process of installation, upgrade, and removal of Python packages a breeze. To check which tools have been installed on your local machine, run the following command:
To install pip, follow the instructions given at pip.pypa.io/en/latest/installing.html.
Alternatively, you can also run this command:
If both of these commands end up with an error, you need to install any one of them.
We recommend that you use pip because it is thought of as an improvement over easy_install. Moreover, easy_install is going to be dropped in future and pip has important advantages over it. It is preferable to install everything using pip because:
- It is the preferred package manager for Python 3. Starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers.
- It provides an uninstall functionality.
- It rolls back and leaves your system clear if, for whatever reason, the package installation fails.
Using easy_install in spite of pip’s advantages makes sense if you are working on Windows because pip won’t always install pre-compiled binary packages.Sometimes it will try to build the package’s extensions directly from C source, thus requiring a properly configured compiler (and that’s not an easy task on Windows). This depends on whether the package is running on eggs (and pip cannot directly use their binaries, but it needs to build from their source code) or wheels (in this case, pip can install binaries if available, as explained here: pythonwheels.com/). Instead, easy_install will always install available binaries from eggs and wheels. Therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day (at some price anyway, as we just mentioned in the list).
The most recent versions of Python should already have pip installed by default. Therefore, you may have it already installed on your system. If not, the safest way is to download the get-pi.py script from bootstrap.pypa.io/get-pip.py and then run it using the following:
$> python get-pip.py
The script will also install the setup tool from pypi.python.org/pypi/setuptools, which also contains easy_install.
You’re now ready to install the packages you need in order to run the examples provided in this article. To install the < package-name > generic package, you just need to run this command:
$> pip install < package-name >
Alternatively, you can run the following command:
$> easy_install < package-name >
Note that in some systems, pip might be named as pip3 and easy_install as easy_install-3 to stress the fact that both operate on packages for Python 3. If you’re unsure, check the version of Python pip is operating on with:
$> pip –V
For easy_install, the command is slightly different:
$> easy_install --version
After this, the <pk> package and all its dependencies will be downloaded and installed. If you’re not certain whether a library has been installed or not, just try to import a module inside it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed.
This is what happens when the NumPy library has been installed:
>>> import numpy
This is what happens if it’s not installed:
>>> import numpy Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named numpy
In the latter case, you’ll need to first install it through pip or easy_install.
Take care that you don’t confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and the module have the same name, but in many cases, they don’t match. For example, the sklearn module is included in the package named Scikit-learn.
Finally, to search and browse the Python packages available for Python, look at pypi.python.org.
More often than not, you will find yourself in a situation where you have to upgrade a package because either the new version is required by a dependency or it has additional features that you would like to use. First, check the version of the library you have installed by glancing at the __version__ attribute, as shown in the following example, numpy:
>>> import numpy >>> numpy.__version__ # 2 underscores before and after '1.9.2'
Now, if you want to update it to a newer release, say the 1.11.0 version, you can run the following command from the command line:
$> pip install -U numpy==1.11.0
Alternatively, you can use the following command:
$> easy_install --upgrade numpy==1.11.0
Finally, if you’re interested in upgrading it to the latest available version, simply run this command:
$> pip install -U numpy
You can alternatively run the following command:
$> easy_install --upgrade numpy
As you’ve read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need (sometimes, the installation procedures may not go as smoothly as you’d hoped for earlier).
If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes, they even have additional tools and an IDE. A few of them are very well known among data scientists, and in the following content, you will find some of the key features of each of these packages.
We suggest that you promptly download and install a scientific distribution, such as Anaconda (which is the most complete one).
Anaconda (continuum.io/downloads) is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, which comprises NumPy, SciPy, pandas, Jupyter, Matplotlib, Scikit-learn, and NLTK. It’s a cross-platform distribution (Windows, Linux, and Mac OS X) that can be installed on machines with other existing Python distributions and versions. Its base version is free; instead, add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on the website, Anaconda’s goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics, and scientific computing.
Leveraging conda to install packages
If you’ve decided to install an Anaconda distribution, you can take advantage of the conda binary installer we mentioned previously. Anyway, conda is an open source package management system, and consequently it can be installed separately from an Anaconda distribution.
You can test immediately whether conda is available on your system. Open a shell and digit:
$> conda -V
If conda is available, there will appear the version of your conda; otherwise an error will be reported. If conda is not available, you can quickly install it on your system by going to conda.pydata.org/miniconda.html and installing the Miniconda software suitable for your computer. Miniconda is a minimal installation that only includes conda and its dependencies.
conda can help you manage two tasks: installing packages and creating virtual environments. In this paragraph, we will explore how conda can help you easily install most of the packages you may need in your data science projects.
Before starting, please check to have the latest version of conda at hand:
$> conda update conda
Now you can install any package you need. To install the <package-name> generic package, you just need to run the following command:
$> conda install <package-name>
You can also install a particular version of the package just by pointing it out:
$> conda install <package-name>=1.11.0
Similarly you can install multiple packages at once by listing all their names:
$> conda install <package-name-1> <package-name-2>
If you just need to update a package that you previously installed, you can keep on using conda:
$> conda update <package-name>
You can update all the available packages simply by using the –all argument:
$> conda update --all
Finally, conda can also uninstall packages for you:
$> conda remove <package-name>
If you would like to know more about conda, you can read its documentation at conda.pydata.org/docs/index.html. In summary, as a main advantage, it handles binaries even better than easy_install (by always providing a successful installation on Windows without any need to compile the packages from source) but without its problems and limitations. With the use of conda, packages are easy to install (and installation is always successful), update, and even uninstall. On the other hand, conda cannot install directly from a git server (so it cannot access the latest version of many packages under development) and it doesn’t cover all the packages available on PyPI as pip itself.
Enthought Canopy (enthought.com/products/canopy) is a Python distribution by Enthought Inc. It includes more than 200 preinstalled packages, such as NumPy, SciPy, Matplotlib, Jupyter, and pandas. This distribution is targeted at engineers, data scientists, quantitative and data analysts, and enterprises. Its base version is free (which is named Canopy Express), but if you need advanced features, you have to buy a front version. It’s a multiplatform distribution and its command-line install tool is canopy_cli.
PythonXY (python-xy.github.io) is a free, open source Python distribution maintained by the community. It includes a number of packages, which include NumPy, SciPy, NetworkX, Jupyter, and Scikit-learn. It also includes Spyder, an interactive development environment inspired by the MATLAB IDE. The distribution is free. It works only on Microsoft Windows, and its command-line installation tool is pip.
WinPython (winpython.sourceforge.net) is also a free, open-source Python distribution maintained by the community. It is designed for scientists, and includes many packages such as NumPy, SciPy, Matplotlib, and Jupyter. It also includes Spyder as an IDE. It is free and portable. You can put WinPython into any directory, or even into a USB flash drive, and at the same time maintain multiple copies and versions of it on your system. It works only on Microsoft Windows, and its command-line tool is the WinPython Package Manager (WPPM).
Explaining virtual environments
No matter you have chosen installing a stand-alone Python or instead you used a scientific distribution, you may have noticed that you are actually bound on your system to the Python’s version you have installed. The only exception, for Windows users, is to use a WinPython distribution, since it is a portable installation and you can have as many different installations as you need.
A simple solution to break free of such a limitation is to use virtualenv that is a tool to create isolated Python environments. That means, by using different Python environments, you can easily achieve these things:
- Testing any new package installation or doing experimentation on your Python environment without any fear of breaking anything in an irreparable way. In this case, you need a version of Python that acts as a sandbox.
- Having at hand multiple Python versions (both Python 2 and Python 3), geared with different versions of installed packages. This can help you in dealing with different versions of Python for different purposes (for instance, some of the packages we are going to present on Windows OS only work using Python 3.4, which is not the latest release).
- Taking a replicable snapshot of your Python environment easily and having your data science prototypes work smoothly on any other computer or in production. In this case, your main concern is the immutability and replicability of your working environment.
You can find documentation about virtualenv at virtualenv.readthedocs.io/en/stable, though we are going to provide you with all the directions you need to start using it immediately. In order to take advantage of virtualenv, you have first to install it on your system:
$> pip install virtualenv
After the installation completes, you can start building your virtual environments. Before proceeding, you have to take a few decisions:
- If you have more versions of Python installed on your system, you have to decide which version to pick up. Otherwise, virtualenv will take the Python version virtualenv was installed by on your system. In order to set a different Python version you have to digit the argument –p followed by the version of Python you want or inserting the path of the Python executable to be used (for instance, –p python2.7 or just pointing to a Python executable such as -p c:Anaconda2python.exe).
- With virtualenv, when required to install a certain package, it will install it from scratch, even if it is already available at a system level (on the python directory you created the virtual environment from). This default behavior makes sense because it allows you to create a completely separated empty environment. In order to save disk space and limit the time of installation of all the packages, you may instead decide to take advantage of already available packages on your system by using the argument –system-site-packages.
- You may want to be able to later move around your virtual environment across Python installations, even among different machines. Therefore you may want to make the functioning of all of the environment’s scripts relative to the path it is placed in by using the argument –relocatable.
After deciding on the Python version, the linking to existing global packages, and the relocability of the virtual environment, in order to start, you just launch the command from a shell. Declare the name you would like to assign to your new environment:
$> virtualenv clone
virtualenv will just create a new directory using the name you provided, in the path from which you actually launched the command. To start using it, you just enter the directory and digit activate:
$> cd clone $> activate
At this point, you can start working on your separated Python environment, installing packages and working with code.
If you need to install multiple packages at once, you may need some special function from pip—pip freeze—which will enlist all the packages (and their version) you have installed on your system. You can record the entire list in a text file by this command:
$> pip freeze > requirements.txt
After saving the list in a text file, just take it into your virtual environment and install all the packages in a breeze with a single command:
$> pip install -r requirements.txt
Each package will be installed according to the order in the list (packages are listed in a case-insensitive sorted order). If a package requires other packages that are later in the list, that’s not a big deal because pip automatically manages such situations. So if your package requires Numpy and Numpy is not yet installed, pip will install it first.
When you’re finished installing packages and using your environment for scripting and experimenting, in order to return to your system defaults, just issue this command:
If you want to remove the virtual environment completely, after deactivating and getting out of the environment’s directory, you just have to get rid of the environment’s directory itself by a recursive deletion. For instance, on Windows you just do this:
$> rd /s /q clone
On Linux and Mac, the command will be:
$> rm –r –f clone
If you are working extensively with virtual environments, you should consider using virtualenvwrapper, which is a set of wrappers for virtualenv in order to help you manage multiple virtual environments easily. It can be found at bitbucket.org/dhellmann/virtualenvwrapper. If you are operating on a Unix system (Linux or OS X), another solution we have to quote is pyenv (which can be found at https://github.com/yyuu/pyenv). It lets you set your main Python version, allow installation of multiple versions, and create virtual environments. Its peculiarity is that it does not depend on Python to be installed and works perfectly at the user level (no need for sudo commands).
conda for managing environments
If you have installed the Anaconda distribution, or you have tried conda using a Miniconda installation, you can also take advantage of the conda command to run virtual environments as an alternative to virtualenv. Let’s see in practice how to use conda for that. We can check what environments we have available like this:
>$ conda info -e
This command will report to you what environments you can use on your system based on conda. Most likely, your only environment will be just “root”, pointing to your Anaconda distribution’s folder.
As an example, we can create an environment based on Python version 3.4, having all the necessary Anaconda-packaged libraries installed. That makes sense, for instance, for using the package Theano together with Python 3 on Windows (because of an issue we will explain in a few paragraphs). In order to create such an environment, just do:
$> conda create -n python34 python=3.4 anaconda
The command asks for a particular python version (3.4) and requires the installation of all packages available on the anaconda distribution (the argument anaconda). It names the environment as python34 using the argument –n. The complete installation should take a while, given the large number of packages in the Anaconda installation. After having completed all of the installation, you can activate the environment:
$> activate python34
If you need to install additional packages to your environment, when activated, you just do:
$> conda install -n python34 <package-name1> <package-name2>
That is, you make the list of the required packages follow the name of your environment. Naturally, you can also use pip install, as you would do in a virtualenv environment.
You can also use a file instead of listing all the packages by name yourself. You can create a list in an environment using the list argument and piping the output to a file:
$> conda list -e > requirements.txt
Then, in your target environment, you can install the entire list using:
$> conda install --file requirements.txt
You can even create an environment, based on a requirements’ list:
$> conda create -n python34 python=3.4 --file requirements.txt
Finally, after having used the environment, to close the session, you simply do this:
Contrary to virtualenv, there is a specialized argument in order to completely remove an environment from your system:
$> conda remove -n python34 --all
A glance at the essential packages
We mentioned that the two most relevant characteristics of Python are its ability to integrate with other languages and its mature package system, which is well embodied by PyPI (the Python Package Index: pypi.python.org/pypi), a common repository for the majority of Python open source packages that is constantly maintained and updated.
The packages that we are now going to introduce are strongly analytical and they will constitute a complete Data Science Toolbox. All the packages are made up of extensively tested and highly optimized functions for both memory usage and performance, ready to achieve any scripting operation with successful execution. A walkthrough on how to install them is provided next.
Partially inspired by similar tools present in R and MATLAB environments, we will together explore how a few selected Python commands can allow you to efficiently handle data and then explore, transform, experiment, and learn from the same without having to write too much code or reinvent the wheel.
NumPy, which is Travis Oliphant’s creation, is the true analytical workhorse of the Python language. It provides the user with multidimensional arrays, along with a large set of functions to operate a multiplicity of mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Characterized by optimal memory allocation, arrays are useful not just for storing data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems:
- Website: www.numpy.org
- Version at the time of print: 1.11.0
- Suggested install command: pip install numpy
As a convention largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np:
import numpy as np
An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy’s functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more:
- Website: www.scipy.org
- Version at time of print: 0.17.1
- Suggested install command: pip install scipy
The pandas package deals with everything that NumPy and SciPy cannot do. Thanks to its specific data structures, namely DataFrames and Series, pandas allows you to handle complex tables of data of different types (which is something that NumPy’s arrays cannot do) and time series. Thanks to Wes McKinney’s creation, you will be able easily and smoothly to load data from a variety of sources. You can then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize your data at will:
- Website: pandas.pydata.org
- Version at the time of print: 0.18.1
- Suggested install command: pip install pandas
Conventionally, pandas is imported as pd:
import pandas as pd
Started as part of the SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations on Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at INRA (French Institute for Research in Computer Science and Automation):
- Website: scikit-learn.org/stable
- Version at the time of print: 0.17.1
- Suggested install command: pip install scikit-learn
Note that the imported module is named sklearn.
A scientific approach requires the fast experimentation of different hypotheses in a reproducible fashion. Initially named IPython and limited to working only with the Python language, Jupyter was created by Fernando Perez in order to address the need for an interactive Python command shell (which is based on shell, web browser, and the application interface), with graphical integration, customizable commands, rich history (in the JSON format), and computational parallelism for an enhanced performance. Jupyter is our favoured choice; it is used to clearly and effectively illustrate operations with scripts and data, and the consequent results:
- Website: jupyter.org
- Version at the time of print: 1.0.0 (ipykernel = 4.3.1)
- Suggested install command: pip install jupyter
Originally developed by John Hunter, matplotlib is a library that contains all the building blocks that are required to create quality plots from arrays and to visualize them interactively.
You can find all the MATLAB-like plotting frameworks inside the pylab module:
- Website: matplotlib.org
- Version at the time of print: 1.5.1
- Suggested install command: pip install matplotlib
You can simply import what you need for your visualization purposes with the following command:
import matplotlib.pyplot as plt
Previously part of SciKits, statsmodels was thought to be a complement to SciPy’s statistical functions. It features generalized linear models, discrete choice models, time series analysis, and a series of descriptive statistics as well as parametric and nonparametric tests:
- Website: statsmodels.sourceforge.net
- Version at the time of print: 0.6.1
- Suggested install command: pip install statsmodels
Beautiful Soup, a creation of Leonard Richardson, is a great tool to scrap out data from HTML and XML files retrieved from the Internet. It works incredibly well, even in the case of tag soups (hence the name), which are collections of malformed, contradictory, and incorrect tags. After choosing your parser (the HTML parser included in Python’s standard library works fine), thanks to Beautiful Soup, you can navigate through the objects in the page and extract text, tables, and any other information that you may find useful:
- Website: www.crummy.com/software/BeautifulSoup
- Version at the time of print: 4.4.1
- Suggested install command: pip install beautifulsoup4
Note that the imported module is named bs4.
Developed by the Los Alamos National Laboratory, NetworkX is a package specialized in the creation, manipulation, analysis, and graphical representation of real-life network data (it can easily operate with graphs made up of a million nodes and edges). Besides specialized data structures for graphs and fine visualization methods (2D and 3D), it provides the user with many standard graph measures and algorithms, such as the shortest path, centrality, components, communities, clustering, and PageRank.
- Website: networkx.github.io
- Version at the time of print: 1.11
- Suggested install command: pip install networkx
Conventionally, NetworkX is imported as nx:
import networkx as nx
The Natural Language Toolkit (NLTK) provides access to corpora and lexical resources and to a complete suite of functions for statistical Natural Language Processing (NLP), ranging from tokenizers to part-of-speech taggers and from tree models to named-entity recognition. Initially, Steven Bird and Edward Loper created the package as an NLP teaching infrastructure for their course at the University of Pennsylvania. Now, it is a fantastic tool that you can use to prototype and build NLP systems:
- Website: www.nltk.org
- Version at the time of print: 3.2.1
- Suggested install command: pip install nltk
Gensim, programmed by Radim Rehurek, is an open source package that is suitable for the analysis of large textual collections with the help of parallel distributable online algorithms. Among advanced functionalities, it implements Latent Semantic Analysis (LSA), topic modelling by Latent Dirichlet Allocation (LDA), and Google’s word2vec, a powerful algorithm that transforms text into vector features that can be used in supervised and unsupervised machine learning.
- Website: radimrehurek.com/gensim
- Version at the time of print: 0.12.4
- Suggested install command: pip install gensim
PyPy is not a package; it is an alternative implementation of Python 2.7.8 that supports most of the commonly used Python standard packages (unfortunately, NumPy is currently not fully supported). As an advantage, it offers enhanced speed and memory handling. Thus, it is very useful for heavy duty operations on large chunks of data and it should be part of your big data handling strategies:
- Website: pypy.org/
- Version at time of print: 5.1
- Download page: pypy.org/download.html
XGBoost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). Initially created by Tianqi Chen from Washington University, it has been enriched by a Python wrapper by Bing Xu and an R interface by Tong He (you can read the story behind XGBoost directly from its principal creator at homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html). XGBoost is available for Python, R, Java, Scala, Julia, and C++, and it can work on a single machine (leveraging multithreading) in both Hadoop and Spark clusters:
Version at the time of print: 0.4
Download page: github.com/dmlc/xgboost
Detailed instructions for installing XGBoost on your system can be found at this page:
The installation of XGBoost on both Linux and MacOS is quite straightforward, whereas it is a little bit trickier for Windows users.
On a Posix system you just have
For this reason, we provide specific installation steps to get XGBoost working on Windows:
- First download and install Git for Windows (git-for-windows.github.io).
- Then you need a MINGW compiler present on your system. You can download it from www.mingw.org accordingly to the characteristics of your system.
- From the command line, execute:
$> git clone --recursive https://github.com/dmlc/xgboost $> cd xgboost $> git submodule init $> git submodule update
- Then, always from command line, copy the configuration for 64-byte systems to be the default one:
$> copy makemingw64.mk config.mk
- Alternatively, you just copy the plain 32-byte version:
$> copy makemingw.mk config.mk
- After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling procedure:
$> mingw32-make -j4
- In MinGW, the make command comes with the name mingw32-make. If you are using a different compiler, the previous command may not work; then you can simply try:
$> make -j4
- Finally, if the compiler completes its work without errors, you can install the package in your Python by this:
$> cd python-package $> python setup.py install
After following all the preceding instructions, if you try to import XGBoost in Python and yet it doesn’t load and results in an error, it may well be that Python cannot find the MinGW’s g++ runtime libraries.
You just need to find the location on your computer of MinGW’s binaries (in our case, it was in C:mingw-w64mingw64bin; just modify the next code to put yours) and place the following code snippet before importing XGBoost:
import os mingw_path = 'C:\mingw-w64\mingw64\bin' os.environ['PATH']=mingw_path + ';' + os.environ['PATH'] import xgboost as xgb
Depending on the state of the XGBoost project, similarly to many other projects under continuous development, the preceding installation commands may or may not temporarily work at the time you will try them. Usually waiting for an update of the project or opening an issue with the authors of the package may solve the problem.
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Basically, it provides you with all the building blocks you need to create deep neural networks. Created by academics (an entire development team; you can read their names on their most recent paper at arxiv.org/pdf/1605.02688.pdf), Theano has been used for large scale and intensive computations since 2007:
Release at the time of print: 0.8.2
In spite of many installation problems experienced by users in the past (expecially Windows users), the installation of Theano should be straightforward, the package being now available on PyPI:
$> pip install Theano
If you want the most updated version of the package, you can get it by Github cloning:
$> git clone git://github.com/Theano/Theano.git
Then you can proceed with direct Python installation:
$> cd Theano $> python setup.py install
To test your installation, you can run from shell/CMD and verify the reports:
$> pip install nose $> pip install nose-parameterized $> nosetests theano
If you are working on a Windows OS and the previous instructions don’t work, you can try these steps using the conda command provided by the Anaconda distribution:
- Install TDM GCC x64 (this can be found at tdm-gcc.tdragon.net)
- Open an Anaconda prompt interface and execute:
$> conda update conda $> conda update --all $> conda install mingw libpython $> pip install git+git://github.com/Theano/Theano.git
Theano needs libpython, which isn’t compatible yet with the version 3.5. So if your Windows installation is not working, this could be the likely cause. Anyway, Theano installs perfectly on Python version 3.4. Our suggestion in this case is to create a virtual Python environment based on version 3.4, install, and use Theano only on that specific version. Directions on how to create virtual environments are provided in the paragraph about virtualenv and conda create.
In addition, Theano’s website provides some information to Windows users; it could support you when everything else fails:
An important requirement for Theano to scale out on GPUs is to install Nvidia CUDA drivers and SDK for code generation and execution on GPU. If you do not know too much about the CUDA Toolkit, you can actually start from this web page in order to understand more about the technology being used:
Therefore, if your computer has an NVidia GPU, you can find all the necessary instructions in order to install CUDA using this tutorial page from NVidia itself:
Keras is a minimalist and highly modular neural networks library, written in Python and capable of running on top of either Theano or TensorFlow (the source software library for numerical computation released by Google). Keras was created by François Chollet, a machine learning researcher working at Google:
Version at the time of print: 1.0.3
Suggested installation from PyPI:
$> pip install keras
As an alternative, you can install the latest available version (which is advisable since the package is in continuous development) using the command:
$> pip install git+git://github.com/fchollet/keras.git
In this article, we performed a lot of installations, from Python packages to examples.They were installed either directly or by using a scientific distribution. We also introduced Jupyter notebooks and demonstrated how you can have access to the data run in the tutorials.
Resources for Article:
- Python for Driving Hardware [Article]
- Mining Twitter with Python – Influence and Engagement [Article]
- Python Data Structures [Article]