17 min read

In this article by Richard Lawson, author of the book Web Scraping with Python, we will first cover a browser extension called Firebug Lite to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the article will conclude with a comparison of these three scraping alternatives.

(For more resources related to this topic, see here.)

Analyzing a web page

To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option:

Web Scraping with Python

The data we are interested in is found in this part of the HTML:

<table>

<tr id="places_national_flag__row"><td class="w2p_fl"><label   for="places_national_flag"     id="places_national_flag__label">National Flag:       </label></td><td class="w2p_fw"><img         src="/places/static/images/flags/gb.png" /></td><td           class="w2p_fc"></td></tr>

…

<tr id="places_neighbours__row"><td class="w2p_fl"><label   for="places_neighbours"     id="places_neighbours__label">Neighbours: </label></td><td       class="w2p_fw"><div><a href="/iso/IE">IE </a></div></td><td         class="w2p_fc"></td></tr></table>

This lack of whitespace and formatting is not an issue for a web browser to interpret, but it is difficult for us. To help us interpret this table, we will use the Firebug Lite extension, which is available for all web browsers at https://getfirebug.com/firebuglite. Firefox users can install the full Firebug extension if preferred, but the features we will use here are included in the Lite version.

Now, with Firebug Lite installed, we can right-click on the part of the web page we are interested in scraping and select Inspect with Firebug Lite from the context menu, as shown here:

Web Scraping with Python

This will open a panel showing the surrounding HTML hierarchy of the selected element:

Web Scraping with Python

In the preceding screenshot, the country attribute was clicked on and the Firebug panel makes it clear that the country area figure is included within a <td> element of class w2p_fw, which is the child of a <tr> element of ID places_area__row. We now have all the information needed to scrape the area data.

Three approaches to scrape a web page

Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module.

Regular expressions

If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html.

To scrape the area using regular expressions, we will first try matching the contents of the <td> element, as follows:

>>> import re

>>> url = 'http://example.webscraping.com/view/United       Kingdom-239'

>>> html = download(url)

>>> re.findall('<td class="w2p_fw">(.*?)</td>', html)

['<img src="/places/static/images/flags/gb.png" />',

'244,820 square kilometres',

'62,348,447',

'GB',

'United Kingdom',

'London',

'<a href="/continent/EU">EU</a>',

'.uk',

'GBP',

'Pound',

'44',

'@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA',

'^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2}   [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2})     |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$',

'en-GB,cy-GB,gd',

'<div><a href="/iso/IE">IE </a></div>']

This result shows that the <td class=”w2p_fw”> tag is used for multiple country attributes. To isolate the area, we can select the second element, as follows:

>>> re.findall('<td class="w2p_fw">(.*?)</td>', html)[1]

'244,820 square kilometres'

This solution works but could easily fail if the web page is updated. Consider if the website is updated and the population data is no longer available in the second table row. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data in future, we want our solution to be as robust against layout changes as possible. To make this regular expression more robust, we can include the parent <tr> element, which has an ID, so it ought to be unique:

>>> re.findall('<tr id="places_area__row"><td   class="w2p_fl"><label for="places_area"     id="places_area__label">Area: </label></td><td       class="w2p_fw">(.*?)</td>', html)

['244,820 square kilometres']

This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra space could be added between the <td> tags, or the area_label could be changed. Here is an improved version to try and support these various possiblilities:

>>> re.findall('<tr   id="places_area__row">.*?<tds*class=["']w2p_fw["']>(.*?)   </td>', html)[0]

'244,820 square kilometres'

This regular expression is more future-proof but is difficult to construct, becoming unreadable. Also, there are still other minor layout changes that would break it, such as if a title attribute was added to the <td> tag.

From this example, it is clear that regular expressions provide a simple way to scrape data but are too brittle and will easily break when a web page is updated. Fortunately, there are better solutions.

Beautiful Soup

Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have it installed, the latest version can be installed using this command:

pip install beautifulsoup4

The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Most web pages do not contain perfectly valid HTML and Beautiful Soup needs to decide what is intended. For example, consider this simple web page of a list with missing attribute quotes and closing tags:

     

 <ul class=country>

           <li>Area

           <li>Population

       </ul>

If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this:

>>> from bs4 import BeautifulSoup

>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'

>>> # parse the HTML

>>> soup = BeautifulSoup(broken_html, 'html.parser')

>>> fixed_html = soup.prettify()

>>> print fixed_html

<html>

   <body>

       <ul class="country">

           <li>Area</li>

           <li>Population</li>

       </ul>

   </body>

</html>

Here, BeautifulSoup was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. Now, we can navigate to the elements we want using the find() and find_all() methods:

>>> ul = soup.find('ul', attrs={'class':'country'})

>>> ul.find('li') # returns just the first match

<li>Area</li>

>>> ul.find_all('li') # returns all matches

[<li>Area</li>, <li>Population</li>]

Beautiful Soup overview

Here are the common methods and parameters you will use when scraping web pages with Beautiful Soup:

  • BeautifulSoup(markup, builder): This method creates the soup object. The markup parameter can be a string or file object, and builder is the library that parses the markup parameter.
  • find_all(name, attrs, text, **kwargs): This method returns a list of elements matching the given tag name, dictionary of attributes, and text. The contents of kwargs are used to match attributes.
  • find(name, attrs, text, **kwargs): This method is the same as find_all(), except that it returns only the first match. If no element matches, it returns None.
  • prettify(): This method returns the parsed HTML in an easy-to-read format with indentation and line breaks.

For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Now, using these techniques, here is a full example to extract the area from our example country:

>>> from bs4 import BeautifulSoup

>>> url = 'http://example.webscraping.com/places/view/       United-Kingdom-239'

>>> html = download(url)

>>> soup = BeautifulSoup(html)

>>> # locate the area row

>>> tr = soup.find(attrs={'id':'places_area__row'})

>>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag

>>> area = td.text # extract the text from this tag

>>> print area

244,820 square kilometres

This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes.

Lxml

Lxml is a Python wrapper on top of the libxml2 XML parsing library written in C, which makes it faster than Beautiful Soup but also harder to install on some computers. The latest installation instructions are available at http://lxml.de/installation.html.

As with Beautiful Soup, the first step is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML:

>>> import lxml.html

>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'

>>> tree = lxml.html.fromstring(broken_html) # parse the HTML

>>> fixed_html = lxml.html.tostring(tree, pretty_print=True)

>>> print fixed_html

<ul class="country">

   <li>Area</li>

   <li>Population</li>

</ul>

As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags.

After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here and in future examples, because they are more compact. Also, some readers will already be familiar with them from their experience with jQuery selectors.

Here is an example using the lxml CSS selectors to extract the area data:

>>> tree = lxml.html.fromstring(html)

>>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]

>>> area = td.text_content()

>>> print area

244,820 square kilometres

The key line with the CSS selector is highlighted. This line finds a table row element with the places_area__row ID, and then selects the child table data tag with the w2p_fw class.

CSS selectors

CSS selectors are patterns used for selecting elements. Here are some examples of common selectors you will need:

Select any tag: *

Select by tag <a>: a

Select by class of "link": .link

Select by tag <a> with class "link": a.link

Select by tag <a> with ID "home": a#home

Select by child <span> of tag <a>: a > span

Select by descendant <span> of tag <a>: a span

Select by tag <a> with attribute title of "Home": a[title=Home]

The CSS3 specification was produced by the W3C and is available for viewing at http://www.w3.org/TR/2011/REC-css3-selectors-20110929/.

Lxml implements most of CSS3, and details on unsupported features are available at https://pythonhosted.org/cssselect/#supported-selectors.

Note that, internally, lxml converts the CSS selectors into an equivalent XPath.

Comparing performance

To help evaluate the trade-offs of the three scraping approaches described in this article, it would help to compare their relative efficiency. Typically, a scraper would extract multiple fields from a web page. So, for a more realistic comparison, we will implement extended versions of each scraper that extract all the available data from a country’s web page. To get started, we need to return to Firebug to check the format of the other country features, as shown here:

Web Scraping with Python

Firebug shows that each table row has an ID starting with places_ and ending with __row. Then, the country data is contained within these rows in the same format as the earlier area example. Here are implementations that use this information to extract all of the available country data:

FIELDS = ('area', 'population', 'iso', 'country', 'capital',   'continent', 'tld', 'currency_code', 'currency_name', 'phone',     'postal_code_format', 'postal_code_regex', 'languages',       'neighbours')

 

import re

def re_scraper(html):

   results = {}

   for field in FIELDS:

       results[field] = re.search('<tr id="places_%s__row">.*?<td             class="w2p_fw">(.*?)</td>' % field, html).groups()[0]

   return results

 

from bs4 import BeautifulSoup

def bs_scraper(html):

   soup = BeautifulSoup(html, 'html.parser')

   results = {}

   for field in FIELDS:

        results[field] = soup.find('table').find('tr',             id='places_%s__row' % field).find('td',                 class_='w2p_fw').text

   return results

 

import lxml.html

def lxml_scraper(html):

   tree = lxml.html.fromstring(html)

   results = {}

   for field in FIELDS:

       results[field] = tree.cssselect('table > tr#places_%s__row             > td.w2p_fw' % field)[0].text_content()

   return results

Scraping results

Now that we have complete implementations for each scraper, we will test their relative performance with this snippet:

import time

NUM_ITERATIONS = 1000 # number of times to test each scraper

html = download('http://example.webscraping.com/places/view/

United-Kingdom-239')

for name, scraper in [('Regular expressions', re_scraper),

   ('BeautifulSoup', bs_scraper),

   ('Lxml', lxml_scraper)]:

   # record start time of scrape

   start = time.time()

   for i in range(NUM_ITERATIONS):

       if scraper == re_scraper:

           re.purge()

       result = scraper(html)

        # check scraped result is as expected

       assert(result['area'] == '244,820 square kilometres')

   # record end time of scrape and output the total

   end = time.time()

   print '%s: %.2f seconds' % (name, end – start)

This example will run each scraper 1000 times, check whether the scraped results are as expected, and then print the total time taken. Note the highlighted line calling re.purge(); by default, the regular expression module will cache searches and this cache needs to be cleared to make a fair comparison with the other scraping approaches.

Here are the results from this script on my computer:

$ python performance.py

Regular expressions: 5.50 seconds

BeautifulSoup: 42.84 seconds

Lxml: 7.06 seconds

The results on your computer will quite likely be different because of the different hardware used. However, the relative difference between each approach should be equivalent. The results show that Beautiful Soup is over six times slower than the other two approaches when used to scrape our example web page. This result could be anticipated because lxml and the regular expression module were written in C, while BeautifulSoup is pure Python. An interesting fact is that lxml performed comparatively well with regular expressions, since lxml has the additional overhead of having to parse the input into its internal format before searching for elements. When scraping many features from a web page, this initial parsing overhead is reduced and lxml becomes even more competitive. It really is an amazing module!

Overview

The following table summarizes the advantages and disadvantages of each approach to scraping:

Scraping approach

Performance

Ease of use

Ease to install

Regular expressions

Fast

Hard

Easy (built-in module)

Beautiful Soup

Slow

Easy

Easy (pure Python)

Lxml

Fast

Easy

Moderately difficult

If the bottleneck to your scraper is downloading web pages rather than extracting data, it would not be a problem to use a slower approach, such as Beautiful Soup. Or, if you just need to scrape a small amount of data and want to avoid additional dependencies, regular expressions might be an appropriate choice. However, in general, lxml is the best choice for scraping, because it is fast and robust, while regular expressions and Beautiful Soup are only useful in certain niches.

Adding a scrape callback to the link crawler

Now that we know how to scrape the country data, we can integrate this into the link crawler. To allow reusing the same crawling code to scrape multiple websites, we will add a callback parameter to handle the scraping. A callback is a function that will be called after certain events (in this case, after a web page has been downloaded). This scrape callback will take a url and html as parameters and optionally return a list of further URLs to crawl. Here is the implementation, which is simple in Python:

def link_crawler(..., scrape_callback=None):

   …

   links = []

   if scrape_callback:

       links.extend(scrape_callback(url, html) or [])

       …

The new code for the scraping callback function are highlighted in the preceding snippet. Now, this crawler can be used to scrape multiple websites by customizing the function passed to scrape_callback.

Here is a modified version of the lxml example scraper that can be used for the callback function:

def scrape_callback(url, html):

   if re.search('/view/', url):

       tree = lxml.html.fromstring(html)

       row = [tree.cssselect('table > tr#places_%s__row >             td.w2p_fw' % field)[0].text_content() for field in                FIELDS]

       print url, row

This callback function would scrape the country data and print it out. Usually, when scraping a website, we want to reuse the data, so we will extend this example to save results to a CSV spreadsheet, as follows:

import csv

class ScrapeCallback:

   def __init__(self):

       self.writer = csv.writer(open('countries.csv', 'w'))

       self.fields = ('area', 'population', 'iso', 'country',             'capital', 'continent', 'tld', 'currency_code',               'currency_name', 'phone', 'postal_code_format',                     'postal_code_regex', 'languages',                         'neighbours')

       self.writer.writerow(self.fields)

 

   def __call__(self, url, html):

       if re.search('/view/', url):

           tree = lxml.html.fromstring(html)

           row = []

           for field in self.fields:

               row.append(tree.cssselect('table >                     tr#places_{}__row >                         td.w2p_fw'.format(field))                            [0].text_content())

           self.writer.writerow(row)      

To build this callback, a class was used instead of a function so that the state of the csv writer could be maintained. This csv writer is instantiated in the constructor, and then written to multiple times in the __call__ method. Note that __call__ is a special method that is invoked when an object is “called” as a function, which is how the cache_callback is used in the link crawler. This means that scrape_callback(url, html) is equivalent to calling scrape_callback.__call__(url, html). For further details on Python’s special class methods, refer to https://docs.python.org/2/reference/datamodel.html#special-method-names.

This code shows how to pass this callback to the link crawler:

link_crawler('http://example.webscraping.com/', '/(index|view)',     max_depth=-1, scrape_callback=ScrapeCallback())

Now, when the crawler is run with this callback, it will save results to a CSV file that can be viewed in an application such as Excel or LibreOffice:

Web Scraping with Python

Success! We have completed our first working scraper.

Summary

In this article, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, and we will use it in future examples.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here