Home Tutorials Getting started with Web Scraping using Python

Tutorials

Getting started with Web Scraping using Python [Tutorial]

November 29, 2018 - 4:00 am

7849

15 min read

Small manual tasks like scanning through information sources in search of small bits of relevant information are in fact, automatable. Instead of performing tasks that get repeated over and over, we can use computers to do these kinds of menial tasks and focus our own efforts instead on what humans are good for—high-level analysis and decision making based on the result. This tutorial shows how to use the Python language to automatize common business tasks that can be greatly sped up if a computer is doing them.

The code files for this article are available on Github. This tutorial is an excerpt from a book written by Jaime Buelta titled Python Automation Cookbook.

The internet and the WWW (World Wide Web) is the most prominent source of information today. In this article, we will learn to perform operations programmatically to automatically retrieve and process information. Python requests module makes it very easy to perform these operations.

We’ll cover the following recipes:

Downloading web pages
Parsing HTML
Crawling the web
Accessing password-protected pages
Speeding up web scraping

Downloading web pages

The basic ability to download a web page involves making an HTTP GET request against a URL. This is the basic operation of any web browser. We’ll see in this recipe how to make a simple request to obtain a web page.

Install requests module:

$ echo "requests==2.18.3" >> requirements.txt
$ source .venv/bin/activate
(.venv) $ pip install -r requirements.txt

Download the example page because it is a straightforward HTML page that is easy to read in text mode.

How to Download web pages

Import the requests module:

>>> import requests

Make a request to the URL, which will take a second or two:

>>> url = 'http://www.columbia.edu/~fdc/sample.html'
>>> response = requests.get(url)

Check the returned object status code:

>>> response.status_code
200

Check the content of the result:

>>> response.text
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head>\n
...
FULL BODY
...
<!-- close the <html> begun above -->\n'

Check the ongoing and returned headers:

>>> response.request.headers
{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> response.headers
{'Date': 'Fri, 25 May 2018 21:51:47 GMT', 'Server': 'Apache', 'Last-Modified': 'Thu, 22 Apr 2004 15:52:25 GMT', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Encoding': 'gzip', 'Content-Length': '8664', 'Keep-Alive': 'timeout=15, max=85', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html', 'Set-Cookie': 'BIGipServer~CUIT~www.columbia.edu-80-pool=1764244352.20480.0000; expires=Sat, 26-May-2018 03:51:47 GMT; path=/; Httponly'}

The operation of requests is very simple; perform the operation, GET in this case, over the URL. This returns a result object that can be analyzed. The main elements are the status_code and the body content, which can be presented as text.

The full request can be checked in the request field:

>>> response.request
<PreparedRequest [GET]>
>>> response.request.url
'http://www.columbia.edu/~fdc/sample.html'

You can check out the full request’s documentation for more information.

Parsing HTML

We’ll use the excellent Beautiful Soup module to parse the HTML text into a memory object that can be analyzed. We need to use the beautifulsoup4 package to use the latest Python 3 version that is available. Add the package to your requirements.txt and install the dependencies in the virtual environment:

$ echo "beautifulsoup4==4.6.0" >> requirements.txt
$ pip install -r requirements.txt

How to perform HTML Parsing

Import BeautifulSoup and requests:

>>> import requests
>>> from bs4 import BeautifulSoup

Set up the URL of the page to download and retrieve it:

>>> URL = 'http://www.columbia.edu/~fdc/sample.html'
>>> response = requests.get(URL)
>>> response
<Response [200]>

Parse the downloaded page:

>>> page = BeautifulSoup(response.text, 'html.parser')

Obtain the title of the page. See that it is the same as what’s displayed in the browser:

>>> page.title
<title>Sample Web Page</title>
>>> page.title.string
'Sample Web Page'

Find all the h3 elements in the page, to determine the existing sections:

>>> page.find_all('h3') [<h3><a name="contents">CONTENTS</a></h3>, <h3><a name="basics">1. Creating a Web Page</a></h3>, <h3><a name="syntax">2. HTML Syntax</a></h3>, <h3><a name="chars">3. Special Characters</a></h3>, <h3><a name="convert">4. Converting Plain Text to HTML</a></h3>, <h3><a name="effects">5. Effects</a></h3>, <h3><a name="lists">6. Lists</a></h3>, <h3><a name="links">7. Links</a></h3>, <h3><a name="tables">8. Tables</a></h3>, <h3><a name="install">9. Installing Your Web Page on the Internet</a></h3>, <h3><a name="more">10. Where to go from here</a></h3>]

6. Extract the text on the section links. Stop when you reach the next <h3> tag:

>>> link_section = page.find('a', attrs={'name': 'links'})
>>> section = []
>>> for element in link_section.next_elements:
...     if element.name == 'h3':
...         break
...     section.append(element.string or '')
...
>>> result = ''.join(section)
>>> result
'7. Links\n\nLinks can be internal within a Web page (like to\nthe Table of ContentsTable of Contents at the top), or they\ncan be to external web pages or pictures on the same website, or they\ncan be to websites, pages, or pictures anywhere else in the world.\n\n\n\nHere is a link to the Kermit\nProject home pageKermit\nProject home page.\n\n\n\nHere is a link to Section 5Section 5 of this document.\n\n\n\nHere is a link to\nSection 4.0Section 4.0\nof the C-Kermit\nfor Unix Installation InstructionsC-Kermit\nfor Unix Installation Instructions.\n\n\n\nHere is a link to a picture:\nCLICK HERECLICK HERE to see it.\n\n\n'

Notice that there are no HTML tags; it’s all raw text.

The first step is to download the page. Then, the raw text can be parsed, as in step 3. The resulting page object contains the parsed information.

BeautifulSoup allows us to search for HTML elements. It can search for the first one with .find() or return a list with .find_all(). In step 5, it searched for a specific tag <a> that had a particular attribute, name=link. After that, it kept iterating on .next_elements until it finds the next h3 tag, which marks the end of the section.

The text of each element is extracted and finally composed into a single text. Note the or that avoids storing None, returned when an element has no text.

Crawling the web

Given the nature of hyperlink pages, starting from a known place and following links to other pages is a very important tool in the arsenal when scraping the web.

To do so, we crawl a page looking for a small phrase and will print any paragraph that contains it. We will search only in pages that belong to the same site. I.e. only URLs starting with www.somesite.com. We won’t follow links to external sites.

We’ll use as an example a prepared example, available in the GitHub repo. Download the whole site and run the included script.

$ python simple_delay_server.py

This serves the site in the URL http://localhost:8000. You can check it on a browser. It’s a simple blog with three entries. Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python.

How to crawl the web

The full script, crawling_web_step1.py, is available in GitHub. The most relevant bits are displayed here:

...
def process_link(source_link, text):
    logging.info(f'Extracting links from {source_link}')
    parsed_source = urlparse(source_link)
    result = requests.get(source_link)
    # Error handling. See GitHub for details
    ...
    page = BeautifulSoup(result.text, 'html.parser')
    search_text(source_link, page, text)
    return get_links(parsed_source, page)

def get_links(parsed_source, page):
    '''Retrieve the links on the page'''
    links = []
    for element in page.find_all('a'):
        link = element.get('href')
        # Validate is a valid link. See GitHub for details
        ...
        links.append(link)
    return links

Search for references to python, to return a list with URLs that contain it and the paragraph. Notice there are a couple of errors because of broken links:

$ python crawling_web_step1.py https://localhost:8000/ -p python
Link http://localhost:8000/: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/5eabef23f63024c20389c34b94dee593-1.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/33714fc865e02aeda2dabb9a42a787b2-0.html: --> This is the actual bit with a python reference that we are interested in.
Link http://localhost:8000/files/archive-september-2018.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/index.html: --> A smaller article , that contains a reference to Python

Another good search term is crocodile. Try it out:

$ python crawling_web_step1.py http://localhost:8000/ -p crocodile

Let’s see each of the components of the script:

A loop that goes through all the found links, in the main function:

Downloading and parsing the link, in the process_link function:

It downloads the file, and checks that the status is correct to skip errors such as broken links. It also checks that the type (as described in Content-Type) is a HTML page to skip PDFs and other formats. And finally, it parses the raw HTML into a BeautifulSoup object.

It also parses the source link using urlparse, so later, in step 4, it can skip all the references to external sources. urlparse divides a URL into its composing elements:

>>> from urllib.parse import urlparse
>>> >>> urlparse('http://localhost:8000/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html')
ParseResult(scheme='http', netloc='localhost:8000', path='/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html', params='', query='', fragment='')

It finds the text to search, in the search_text function:

It searches the parsed object for the specified text. Note the search is done as a regex and only in the text. It prints the resulting matches, including source_link, referencing the URL where the match was found:

for element in page.find_all(text=re.compile(text)):
    print(f'Link {source_link}: --> {element}')

The get_links function retrieves all links on a page:

It searches in the parsed page all <a> elements, and retrieves the href elements, but only elements that have such href elements and that are a fully qualified URL (starting with http). This removes links that are not a URL, such as a '#' link, or that are internal to the page.

An extra check is done to check they have the same source as the original link, then they are registered as valid links. The netloc attribute allows to detect that the link comes from the same URL domain than the parsed URL generated in step 2.

Finally, the links are returned, where they’ll be added to the loop described in step 1.

Accessing password-protected pages

Sometimes a web page is not open to the public but protected in some way. The most basic aspect is to use basic HTTP authentication, which is integrated into virtually every web server, and it’s a user/password schema.

We can test this kind of authentication in https://httpbin.org.

It has a path, /basic-auth/{user}/{password}, which forces authentication, with the user and password stated. This is very handy for understanding how authentication works.

How to Access password protected pages

Import requests:

>>> import requests

Make a GET request to the URL with the wrong credentials. Notice that we set the credentials on the URL to be user and psswd:

>>> requests.get('https://httpbin.org/basic-auth/user/psswd', 
                 auth=('user', 'psswd'))
<Response [200]>

Use the wrong credentials to return a 401 status code (Unauthorized):

>>> requests.get('https://httpbin.org/basic-auth/user/psswd', 
                 auth=('user', 'wrong'))
<Response [401]>

The credentials can be also passed directly in the URL, separated by a colon and an @ symbol before the server, like this:

>>> requests.get('https://user:[email protected]/basic-auth/user/psswd')
<Response [200]>
>>> requests.get('https://user:[email protected]/basic-auth/user/psswd')
<Response [401]>

Speeding up web scraping

Most of the time spent downloading information from web pages is usually spent waiting. A request goes from our computer to whatever server will process it, and until the response is composed and comes back to our computer, we cannot do much about it.

During the execution of the recipes in the book, you’ll notice there’s a wait involved in requests calls, normally of around one or two seconds. But computers can do other stuff while waiting, including making more requests at the same time. In this recipe, we will see how to download a list of pages in parallel and wait until they are all ready. We will use an intentionally slow server to show the point.

We’ll get the code to crawl and search for keywords, making use of the futures capabilities of Python 3 to download multiple pages at the same time.

A future is an object that represents the promise of a value. This means that you immediately receive an object while the code is being executed in the background. Only, when specifically requesting for its .result() the code blocks until getting it.

To generate a future, you need a background engine, called executor. Once created, submit a function and parameters to it to retrieve a future. The retrieval of the result can be delayed as long as necessary, allowing the generation of several futures in a row, and waiting until all are finished, executing them in parallel, instead of creating one, wait until it finishes, creating another, and so on.

There are several ways to create an executor; in this recipe, we’ll use ThreadPoolExecutor, which will use threads.

We’ll use as an example a prepared example, available in the GitHub repo. Download the whole site and run the included script

$ python simple_delay_server.py -d 2

This serves the site in the URL http://localhost:8000. You can check it on a browser. It’s s simple blog with three entries. Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python. The parameter -d 2 makes the server intentionally slow, simulating a bad connection.

How to speed up web scraping

Write the following script, speed_up_step1.py. The full code is available in GitHub.

Notice the differences in the main function. Also, there’s an extra parameter added (number of concurrent workers), and the function process_link now returns the source link.
Run the crawling_web_step1.py script to get a time baseline. Notice the output has been removed here for clarity:

$ time python crawling_web_step1.py http://localhost:8000/
... REMOVED OUTPUT
real 0m12.221s
user 0m0.160s
sys 0m0.034s

Run the new script with one worker, which is slower than the original one:

$ time python speed_up_step1.py -w 1
... REMOVED OUTPUT
real 0m16.403s
user 0m0.181s
sys 0m0.068s

Increase the number of workers:

$ time python speed_up_step1.py -w 2
... REMOVED OUTPUT
real 0m10.353s
user 0m0.199s
sys 0m0.068s

Adding more workers decreases the time:

$ time python speed_up_step1.py -w 5
... REMOVED OUTPUT
real 0m6.234s
user 0m0.171s
sys 0m0.040s

The main engine to create the concurrent requests is the main function. Notice that the rest of the code is basically untouched (other than returning the source link in the process_link function).

This is the relevant part of the code that handles the concurrent engine:

with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
    while to_check:
        futures = [executor.submit(process_link, url, to_search)
                   for url in to_check]
        to_check = []
        for data in concurrent.futures.as_completed(futures):
            link, new_links = data.result()
             checked_links.add(link)
            for link in new_links:
                if link not in checked_links and link not in to_check:
                    to_check.append(link)

             max_checks -= 1
             if not max_checks:
                return

The with context creates a pool of workers, specifying its number. Inside, a list of futures containing all the URLs to retrieve is created. The .as_completed() function returns the futures that are finished, and then there’s some work dealing with obtaining newly found links and checking whether they need to be added to be retrieved or not. This process is similar to the one presented in the Crawling the web recipe.

The process starts again until enough links have been retrieved or there are no links to retrieve.

In this post, we learned to use the power of Python to automate web scraping tasks. To understand how to automate monotonous tasks with Python 3.7, check out our book: Python Automation Cookbook

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Getting started with Web Scraping using Python [Tutorial]

Downloading web pages

How to Download web pages

Parsing HTML

How to perform HTML Parsing

Crawling the web

How to crawl the web

Accessing password-protected pages

How to Access password protected pages

Speeding up web scraping

How to speed up web scraping

Read next

Interviews

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring Forms in Angular – types, benefits and differences

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9...

Exploring the Strategy Behavioral Design Pattern in Node.js

Giving material.angular.io a refresh from Angular Blog – Medium

Popular on Packt Hub

How to use arrays, lists, and dictionaries in Unity for 3D...

Customizing Elgg Themes

Using Python Automation to interact with network devices [Tutorial]

Basics of Jupyter Notebook and Python

OpenCV: Detecting Edges, Lines, and Shapes

MobilePro

datapro

Programming

Subscribe to our newsletter