Getting started with Python Web Scraping

0
1549
12 min read
Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping.

The amount of data available on the web is consistently growing both in quantity and in form. Businesses require this data to make decisions, particularly with the explosive growth of machine learning tools which require large amounts of data for training. Much of this data is available via Application Programming Interfaces, but at the same time a lot of valuable data is still only available through the process of web scraping.

Python is the choice of programing language for many who build systems to perform scraping. It is an easy to use programming language with a rich ecosystem of tools for other tasks. In this article, we will focus on the fundamentals of setting up a scraping environment and perform basic requests for data with several tools of trade.

Setting up a Python development environment

If you have not used Python before, it is important to have a working development  environment. The recipes in this book will be all in Python and be a mix of interactive examples, but primarily implemented as scripts to be interpreted by the Python interpreter. This recipe will show you how to set up an isolated development environment with virtualenv and manage project dependencies with pip . We also get the code for the book and install it into the Python virtual environment.

Getting ready

We will exclusively be using Python 3.x, and specifically in my case 3.6.1. While Mac and Linux normally have Python version 2 installed, and Windows systems do not. So it is likely that in any case that Python 3 will need to be installed. You can find references for Python installers at www.python.org. You can check Python’s version with python –version


Python Install

pip comes installed with Python 3.x, so we will omit instructions on its installation. Additionally, all command line examples in this book are run on a Mac. For Linux users the commands should be identical. On Windows, there are alternate commands (like dir instead of ls), but these alternatives will not be covered.

How to do it

We will be installing a number of packages with pip. These packages are installed into a Python environment. There often can be version conflicts with other packages, so a good practice for following along with the recipes in the book will be to create a new virtual Python environment where the packages we will use will be ensured to work properly.

Virtual Python environments are managed with the virtualenv tool. This can be installed with the following command:

~ $ pip install virtualenv

Collecting virtualenv

Using cached virtualenv-15.1.0-py2.py3-none-any.whl

Installing collected packages: virtualenv

Successfully installed virtualenv-15.1.0

Now we can use virtualenv. But before that let’s briefly look at pip. This command installs Python packages from PyPI, a package repository with literally 10’s of thousands of packages. We just saw using the install subcommand to pip, which ensures a package is installed. We can also see all currently installed packages with pip list:

~ $ pip list

alabaster (0.7.9)

amqp (1.4.9)

anaconda-client (1.6.0)

anaconda-navigator (1.5.3)

anaconda-project (0.4.1)

aniso8601 (1.3.0)

Packages can also be uninstalled using pip uninstall followed by the package name. I’ll leave it to you to give it a try.

Now back to virtualenv. Using virtualenv is very simple. Let’s use it to create an environment and install the code from github. Let’s walk through the steps:

  1. Create a directory to represent the project and enter the directory.
~ $ mkdir pywscb

~ $ cd pywscb
  1. Initialize a virtual environment folder named env:
pywscb $ virtualenv env

Using base prefix '/Users/michaelheydt/anaconda'

New python executable in /Users/michaelheydt/pywscb/env/bin/python

copying /Users/michaelheydt/anaconda/bin/python =>

/Users/michaelheydt/pywscb/env/bin/python

copying /Users/michaelheydt/anaconda/bin/../lib/libpython3.6m.dylib

=> /Users/michaelheydt/pywscb/env/lib/libpython3. 6m.dylib

Installing setuptools, pip, wheel...done.
  1. This creates an env folder. Let’s take a look at what was installed.
pywscb $ ls -la env

total 8

drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 .

drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 ..

drwxr-xr-x 16 michaelheydt staff 544 Jan 18 15:38 bin

drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 include

drwxr-xr-x 4 michaelheydt staff 136 Jan 18 15:38 lib

-rw-r--r-- 1 michaelheydt staff 60 Jan 18 15:38 pipselfcheck.

json
  1. New we activate the virtual environment. This command uses the content in the env folder to configure Python. After this all python activities are relative to this virtual environment.
pywscb $ source env/bin/activate

(env) pywscb $
  1. We can check that python is indeed using this virtual environment with the following command:
(env) pywscb $ which python

/Users/michaelheydt/pywscb/env/bin/python

With our virtual environment created, let’s clone the books sample code and take a look at its structure.

(env) pywscb $ git clone

https://github.com/PacktBooks/PythonWebScrapingCookbook.git

Cloning into 'PythonWebScrapingCookbook'...

remote: Counting objects: 420, done.

remote: Compressing objects: 100% (316/316), done.

remote: Total 420 (delta 164), reused 344 (delta 88), pack-reused 0

Receiving objects: 100% (420/420), 1.15 MiB | 250.00 KiB/s, done.

Resolving deltas: 100% (164/164), done.

Checking connectivity... done.

This created a PythonWebScrapingCookbook directory.

(env) pywscb $ ls -l

total 0

drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 PythonWebScrapingCookbook

drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 env

Let’s change into it and examine the content.

(env) PythonWebScrapingCookbook $ ls -l

total 0

drwxr-xr-x 15 michaelheydt staff 510 Jan 18 16:21 py

drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 www

There are two directories. Most the the Python code is is the py directory. www contains some web content that we will use from time-to-time using a local web server. Let’s look at the contents of the py directory:

(env) py $ ls -l

total 0

drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 01

drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 03

drwxr-xr-x 21 michaelheydt staff 714 Jan 18 16:21 04

drwxr-xr-x 10 michaelheydt staff 340 Jan 18 16:21 05

drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 06

drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 07

drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 08

drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 09

drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 10

drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 11

drwxr-xr-x 8 michaelheydt staff 272 Jan 18 16:21 modules

Code for each chapter is in the numbered folder matching the chapter (there is no code for chapter 2 as it is all interactive Python).

Note that there is a modules folder. Some of the recipes throughout the book use code in those modules. Make sure that your Python path points to this folder. On Mac and Linux you can sets this in your .bash_profile file (and environments variables dialog on Windows):

Export PYTHONPATH="/users/michaelheydt/dropbox/packt/books/pywebscrcookbook/code/py/modules" export PYTHONPATH

The contents in each folder generally follows a numbering scheme matching the sequence of the recipe in the chapter. The following is the contents of the chapter 6 folder:

(env) py $ ls -la 06

total 96

drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 .

drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:26 ..

-rw-r--r-- 1 michaelheydt staff 902 Jan 18 16:21 01_scrapy_retry.py

-rw-r--r-- 1 michaelheydt staff 656 Jan 18 16:21 02_scrapy_redirects.py

-rw-r--r-- 1 michaelheydt staff 1129 Jan 18 16:21 03_scrapy_pagination.py

-rw-r--r-- 1 michaelheydt staff 488 Jan 18 16:21 04_press_and_wait.py

-rw-r--r-- 1 michaelheydt staff 580 Jan 18 16:21 05_allowed_domains.py

-rw-r--r-- 1 michaelheydt staff 826 Jan 18 16:21 06_scrapy_continuous.py

-rw-r--r-- 1 michaelheydt staff 704 Jan 18 16:21

07_scrape_continuous_twitter.py

-rw-r--r-- 1 michaelheydt staff 1409 Jan 18 16:21 08_limit_depth.py

-rw-r--r-- 1 michaelheydt staff 526 Jan 18 16:21 09_limit_length.py

-rw-r--r-- 1 michaelheydt staff 1537 Jan 18 16:21 10_forms_auth.py

-rw-r--r-- 1 michaelheydt staff 597 Jan 18 16:21 11_file_cache.py

-rw-r--r-- 1 michaelheydt staff 1279 Jan 18 16:21

12_parse_differently_based_on_rules.py

In the recipes I’ll state that we’ll be using the script in <chapter directory>/<recipe filename>.

Now just the be complete, if you want to get out of the Python virtual environment, you can exit using the following command:

(env) py $ deactivate

py $

And checking which python we can see it has switched back:

py $ which python

/Users/michaelheydt/anaconda/bin/python

Scraping Python.org with Requests and Beautiful Soup

In this recipe we will install Requests and Beautiful Soup and scrape some content from www.python.org. We’ll install both of the libraries and get some basic familiarity with them. We’ll come back to them both in subsequent chapters and dive deeper into each.

Getting ready

In this recipe, we will scrape the upcoming Python events from https:/ / www. python. org/events/ pythonevents. The following is an an example of The Python.org Events Page (it changes frequently, so your experience will differ):

Python

We will need to ensure that Requests and Beautiful Soup are installed. We can do that with the following:

pywscb $ pip install requests

Downloading/unpacking requests

Downloading requests-2.18.4-py2.py3-none-any.whl (88kB): 88kB downloaded

Downloading/unpacking certifi>=2017.4.17 (from requests)

Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB): 151kB

downloaded

Downloading/unpacking idna>=2.5,<2.7 (from requests)

Downloading idna-2.6-py2.py3-none-any.whl (56kB): 56kB downloaded

Downloading/unpacking chardet>=3.0.2,<3.1.0 (from requests)

Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloaded

Downloading/unpacking urllib3>=1.21.1,<1.23 (from requests)

Downloading urllib3-1.22-py2.py3-none-any.whl (132kB): 132kB downloaded

Installing collected packages: requests, certifi, idna, chardet, urllib3

Successfully installed requests certifi idna chardet urllib3

Cleaning up...

pywscb $ pip install bs4

Downloading/unpacking bs4

Downloading bs4-0.0.1.tar.gz

Running setup.py (path:/Users/michaelheydt/pywscb/env/build/bs4/setup.py)

egg_info for package bs4

How to do it

Now let’s go and learn to scrape a couple events. For this recipe we will start by using interactive python.

  1. Start it with the ipython command:
$ ipython

Python 3.6.1 |Anaconda custom (x86_64)| (default, Mar 22 2017,

19:25:17)

Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object', use 'object??' for extra

details.

In [1]:
  1. Next we import Requests
In [1]: import requests
  1. We now use requests to make a GET HTTP request for the following url:https://www.python.org/events/ python-events/ by making a GET request:
In [2]: url = 'https://www.python.org/events/python-events/'

In [3]: req = requests.get(url)
  1. That downloaded the page content but it is stored in our requests object req. We can retrieve the content using the .text property. This prints the first 200 characters.
req.text[:200]

Out[4]: '<!doctype html>n<!--[if lt IE 7]> <html class="no-js ie6

lt-ie7 lt-ie8 lt-ie9"> <![endif]-->n<!--[if IE 7]> <html

class="no-js ie7 lt-ie8 lt-ie9"> <![endif]-->n<!--[if IE 8]> <h'

We now have the raw HTML of the page. We can now use beautiful soup to parse the HTML and retrieve the event data.

  1. First import Beautiful Soup
In [5]: from bs4 import BeautifulSoup
  1. Now we create a BeautifulSoup object and pass it the HTML.
In [6]: soup = BeautifulSoup(req.text, 'lxml')
  1. Now we tell Beautiful Soup to find the main <ul> tag for the recent events, and then to get all the <li> tags below it.
In [7]: events = soup.find('ul', {'class': 'list-recentevents'}).

findAll('li')
  1. And finally we can loop through each of the <li> elements, extracting the event details, and print each to the console:
In [13]: for event in events:

...: event_details = dict()

...: event_details['name'] = event_details['name'] =

event.find('h3').find("a").text

...: event_details['location'] = event.find('span', {'class'

'event-location'}).text

...: event_details['time'] = event.find('time').text

...: print(event_details)

...:

{'name': 'PyCascades 2018', 'location': 'Granville Island Stage,

1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan.

– 24 Jan. 2018'}

{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon',

'time': '24 Jan. – 29 Jan. 2018'}

{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F.

Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05
Feb. 2018'}

{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08

Feb. – 12 Feb. 2018'}

{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia',

'time': '09 Feb. – 12 Feb. 2018'}

{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA',

'time': '10 Feb. – 12 Feb. 2018'}

This entire example is available in the 01/01_events_with_requests.py script file. The following is its content and it pulls together all of what we just did step by step:

import requests

from bs4 import BeautifulSoup

def get_upcoming_events(url):

req = requests.get(url)

soup = BeautifulSoup(req.text, 'lxml')

events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')

for event in events:

event_details = dict()

event_details['name'] = event.find('h3').find("a").text

event_details['location'] = event.find('span', {'class', 'eventlocation'}).

text

event_details['time'] = event.find('time').text

print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

You can run this using the following command from the terminal:

$ python 01_events_with_requests.py

{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585

Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan.

2018'}

{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24

Jan. – 29 Jan. 2018'}

{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D.

Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb. 2018'}

{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12

Feb. 2018'}

{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time':

'09 Feb. – 12 Feb. 2018'}

{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10

Feb. – 12 Feb. 2018'}

How it works

We will dive into details of both Requests and Beautiful Soup in the next chapter, but for now let’s just summarize a few key points about how this works. The following important points about Requests:

  1. Requests is used to execute HTTP requests. We used it to make a GET verb request of the URL for the events page.
  2. The Requests object holds the results of the request. This is not only the page content, but also many other items about the result such as HTTP status codes and headers.
  3. Requests is used only to get the page, it does not do an parsing.

We use Beautiful Soup to do the parsing of the HTML and also the finding of content within the HTML. To understand how this worked, the content of the page has the following HTML to start the Upcoming Events section:

Python

We used the power of Beautiful Soup to:

  1. Find the <ul> element representing the section, which is found by looking for a <ul> with the a class attribute that has a value of list-recent-events.
  2. From that object, we find all the <li> elements.

Each of these <li> tags represent a different event. We iterate over each of those making a dictionary from the event data found in child HTML tags:

  1. The name is extracted from the <a> tag that is a child of the <h3> tag
  2. The location is the text content of the <span> with a class of event-location
  3. And the time is extracted from the datetime attribute of the <time> tag.

To summarize, we saw how to setup a Python environment for effective data scraping from the web and also explored ways to use Beautiful Soup to perform preliminary data scraping for ethical purposes.

If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Apache Kafka installation.

Web Scraping with Python

 

 

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here