5 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book written by Alberto Boschetti and Luca Massaron, titled Python Data Science Essentials – Second Edition. This book will take you beyond the fundamentals of data science by diving into the world of data visualizations, web development, and deep learning.[/box]

In this article, we will learn to scrape a web page and learn to download it manually.

This process happens more often than you might expect; and it’s a very popular topic of interest in data science. For example:

  • Financial institutions scrape the Web to extract fresh details and information about the companies in their portfolio. Newspapers, social networks, blogs, forums, and corporate websites are the ideal targets for these analyses.
  • Advertisement and media companies analyze sentiment and the popularity of many pieces of the Web to understand people’s reactions.
  • Companies specialized in insight analysis and recommendation scrape the Web to understand patterns and model user behaviors.
  • Comparison websites use the web to compare prices, products, and services, offering the user an updated synoptic table of the current situation.

Unfortunately, understanding websites is a very hard work since each website is built and maintained by different people, with different infrastructures, locations, languages, and structures. The only common aspect among them is represented by the standard exposed language, which, most of the time, is HTML.

That’s why the vast majority of the web scrapers, available as of today, are only able to understand and navigate HTML pages in a general-purpose way. One of the most used web parsers is named BeautifulSoup. It’s written in Python, and it’s very stable and simple to use. Moreover, it’s able to detect errors and pieces of malformed code in the HTML page (always remember that web pages are often human-made products and prone to errors).

A complete description of Beautiful Soup would require an entire book; here we will see just a few bits. First at all, Beautiful Soup is not a crawler. In order to download a web page, we should use the urllib library, for example.

Let’s now download the code behind the William Shakespeare page on Wikipedia:

In: import urllib.request

url = 'https://en.wikipedia.org/wiki/William_Shakespeare' request = urllib.request.Request(url)

response = urllib.request.urlopen(request)

It’s time to instruct Beautiful Soup to read the resource and parse it using the HTML parser:

In: from bs4 import BeautifulSoup

soup = BeautifulSoup(response, 'html.parser')

Now the soup is ready, and can be queried. To extract the title, we can simply ask for the title attribute:

In: soup.title

Out: <title>William Shakespeare - Wikipedia, the free encyclopedia</title>

As you can see, the whole title tag is returned, allowing a deeper investigation of the nested HTML structure. What if we want to know the categories associated with the Wikipedia page of William Shakespeare? It can be very useful to create a graph of the entry, simply recurrently downloading and parsing adjacent pages. We should first manually analyze the HTML page itself to figure out what’s the best HTML tag containing the information we’re looking for. Remember here the “no free lunch” theorem in data science: there are no auto- discovery functions, and furthermore, things can change if Wikipedia modifies its format.

After a manual analysis, we discover that categories are inside a div named “mw-normal- catlinks”; excluding the first link, all the others are okay. Now it’s time to program. Let’s put into code what we’ve observed, printing for each category, the title of the linked page and the relative link to it:

In:

section = soup.find_all(id='mw-normal-catlinks')[0] for catlink in section.find_all("a")[1:]:

print(catlink.get("title"), "->", catlink.get("href")) Out:

Category:William Shakespeare -> /wiki/Category:William_Shakespeare Category:1564 births -> /wiki/Category:1564_births

Category:1616 deaths -> /wiki/Category:1616_deaths

Category:16th-century English male actors -> /wiki/Category:16th- century_English_male_actors

Category:English male stage actors ->

/wiki/Category:English_male_stage_actors

Category:16th-century English writers -> /wiki/Category:16th- century_English_writers

We’ve used the find_all method twice to find all the HTML tags with the text contained in the argument. In the first case, we were specifically looking for an ID; in the second case, we were looking for all the “a” tags.

Given the output then, and using the same code with the new URLs, it’s possible to download recursively the Wikipedia category pages, arriving at this point at the ancestor categories.

A final note about scraping: always remember that this practice is not always allowed, and when so, remember to tune down the rate of the download (at high rates, the website’s server may think you’re doing a small-scale DoS attack and will probably blacklist/ban your IP address). For more information, you can read the terms and conditions of the website, or simply contact the administrators. Downloading data from various sites where there are copyright laws in place is bound to get you into real legal trouble. That’s also why most companies that employ web scraping use external vendors for this task, or have a special arrangement with the site owners.

We learned the process of data scraping, its practical use case in the industries and finally we put the theory to practice by learning how to scrape data from web with the help of Beautiful Soup.

If you enjoyed this excerpt, check out the book Python Data Science Essentials – Second Edition to know more about different libraries such as pandas and NumPy,  that can provide you with all the tools to load and effectively manage your data

Python Data Science Essentials Second Edition

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here