8 min read
Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping.

A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server.

Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally.

Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won’t look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It’s a simple step from there to also transcode video with ffmpeg.

Downloading media content from the web

Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content.

Getting ready

There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file.

How to do it

Here is how we proceed with the recipe:

  1. The URLUtility class can download content from a URL. The code in the recipe’s file is the following:
import const

from util.urls import URLUtility

util = URLUtility(const.ApodEclipseImage())

print(len(util.data))
  1. When running this you will see the following output: 
Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014

The example reads 171014 bytes of data.

How it works

The URL is defined as a constant const.ApodEclipseImage() in the const module:

def ApodEclipseImage():

return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg"

The constructor of the URLUtility class has the following implementation:

def __init__(self, url, readNow=True):

""" Construct the object, parse the URL, and download now if

specified"""

self._url = url

self._response = None

self._parsed = urlparse(url)

if readNow:

self.read()

The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method:

def read(self):

self._response = urllib.request.urlopen(self._url)

self._data = self._response.read()

This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property:

@property

def data(self):

self.ensure_response()

return self._data

The code then simply reports on the length of that data, with the value of 171014.

There’s more

This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next.

Parsing a URL with urllib to get the filename

When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name?

Getting ready

We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py.

How to do it

Execute the recipe’s file with your python interpreter. It will run the following code:

util = URLUtility(const.ApodEclipseImage())

print(util.filename_without_ext)

This results in the following output:

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg

Read 171014 bytes

The filename is: BT5643s

How it works

In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively:

>>> parsed = urlparse(const.ApodEclipseImage())

>>> parsed

ParseResult(scheme='https', netloc='apod.nasa.gov',

path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='')

The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension:

@property

def filename_without_ext(self):

filename = os.path.splitext(os.path.basename(self._parsed.path))[0]

return filename

The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename).

There’s more

It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That’s our next recipe.

Determining the type of content for a URL

When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content.

Getting ready

We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py.

How to do it

We proceed as follows:

  1. Execute the script for the recipe. It contains the following code:
util = URLUtility(const.ApodEclipseImage())

print("The content type is: " + util.contenttype)
  1. With the following result:
Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg

Read 171014 bytes

The content type is: image/jpeg

How it works

The .contentype property is implemented as follows:

@property

def contenttype(self):

self.ensure_response()

return self._response.headers['content-type']

The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed.

There’s more

The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned:

>>> response = urllib.request.urlopen(const.ApodEclipseImage())

>>> for header in response.headers: print(header)

Date

Server

Last-Modified

ETag

Accept-Ranges

Content-Length

Connection

Content-Type

Strict-Transport-Security

And we can see the values for each of these headers.

>>> for header in response.headers: print(header + " ==> " +

response.headers[header])

Date ==> Tue, 26 Sep 2017 19:31:41 GMT

Server ==> WebServer/1.0

Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT

ETag ==> "547bb44-29c06-5581275ce2b86"

Accept-Ranges ==> bytes

Content-Length ==> 171014

Connection ==> close

Content-Type ==> image/jpeg

Strict-Transport-Security ==> max-age=31536000; includeSubDomains

Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist.

Determining the file extension from a content type

It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file.

Getting ready

We again use the URLUtility object that we created. The recipe’s script is 04/04_determine_file_extension_from_contenttype.py):.

How to do it

Proceed by running the recipe’s script. An extension for the media type can be found using the .extension property:

util = URLUtility(const.ApodEclipseImage())

print("Filename from content-type: " + util.extension_from_contenttype)

print("Filename from url: " + util.extension_from_url)

This results in the following output:

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg

Read 171014 bytes

Filename from content-type: .jpg

Filename from url: .jpg

This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same.

How it works

The following is the implementation of the .extension_from_contenttype property:

@property

def extension_from_contenttype(self):

self.ensure_response()

map = const.ContentTypeToExtensions()

if self.contenttype in map:

return map[self.contenttype]

return None

The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension:

def ContentTypeToExtensions():

return {

"image/jpeg": ".jpg",

"image/jpg": ".jpg",

"image/png": ".png"

}

If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url:

@property

def extension_from_url(self):

ext = os.path.splitext(os.path.basename(self._parsed.path))[1]

return ext

This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename.

To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python.

If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.

Web Scraping with Python


Subscribe to the weekly Packt Hub newsletter. We'll send you this year's Skill Up Developer Skills Report.

* indicates required

LEAVE A REPLY

Please enter your comment!
Please enter your name here