To perform this task, usually three basic steps are followed:
- Explore the website to find out where the desired information is located in the HTML DOM tree
- Download as many web pages as needed
- Parse downloaded web pages and extract the information from the places found in the exploration step
The exploration step is performed manually with the aid of some tools that make it easier to locate the information and reduce the development time in next steps. The download and parsing steps are usually performed in an iterative cycle since they are interrelated. This is because the next page to download may depend on a link or similar in the current page, so not every web page can be downloaded without previously looking into the earlier one.
This article will show an example covering the three steps mentioned and how this could be done using python with some development. The code that will be displayed is guaranteed to work at the time of writing, however it should be taken into account that it may stop working in future if the presentation format changes. The reason is that web scraping depends on the DOM tree to be stable enough, that is to say, as happens with regular expressions, it will work fine for slight changes in the information being parsed. However, when the presentation format is completely changed, the web scraping scripts have to be modified to match the new DOM tree.
Let’s say you are a fan of Pack Publishing article network and that you want to keep a list of the titles of all the articles that have been published until now and the link to them. First of all, you will need to connect to the main article network page (http://www.packtpub.com/article-network) and start exploring the web page to have an idea about where the information that you want to extract is located.
Many ways are available to perform this task such as view the source code directly in your browser or download it and inspect it with your favorite editor. However, HTML pages often contain auto-generated code and are not as readable as they should be, so using a specialized tool might be quite helpful. In my opinion, the best one for this task is the Firebug add-on for the Firefox browser.
With this add-on, instead of looking carefully in the code looking for some string, all you have to do is press the Inspect button, move the pointer to the area in which you are interested and click. After that, the HTML code for the area marked and the location of the tag in the DOM tree will be clearly displayed. For example, the links to the different pages containing all the articles are located inside a right tag,
and, in every page, the links to the articles are contained as list items in an unnumbered list. In addition to this, the links URLs, as you probably have noticed while reading other articles, start with http://www.packtpub.com/article/
So, our scraping strategy will be
- Get the list of links to all pages containing articles
- Follow all links so as to extract the article information in all pages
One small optimization here is that main article network page is the same as the one pointed by the first page link, so we will take this into account to avoid loading the same page twice when we develop the code.
Before parsing any web page, the contents of that page must be downloaded. As usual, there are many ways to do this:
- Creating your own HTTP requests using urllib2 standard python library
- Using a more advanced library that provides the capability to navigate through a website simulating a browser such as mechanize.
In this article mechanize will be covered as it is the easiest choice. mechanize is a library that provides a Browser class that lets the developer to interact with a website in a similar way a real browser would. In particular it provides methods to open pages, follow links, change form data and submit forms.
Recalling the scraping strategy in our previous version, the first thing we would like to do is to download the main article network web page. To do that we will create a Browser class instance and then open the main article network page:
>>> import mechanize
>>> BASE_URL = "http://www.packtpub.com/article-network"
>>> br = mechanize.Browser()
>>> data = br.open(BASE_URL).get_data()
>>> links = scrape_links(BASE_URL, data)
Where the result of the open method is an HTTP response object, the get_data method returns the contents of the web page. The scrape_links function will be explained later. For now, as pointed out in the introduction section, bear in mind that the downloading and parsing steps are usually performed iteratively since some contents to be downloaded depends on the parsing done in some kind of initial contents such as in this case.