[box type=”note” align=”” class=”” width=””]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box]
A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server.
Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally.
Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won’t look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It’s a simple step from there to also transcode video with ffmpeg.
Downloading media content from the web
Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content.
Getting ready
There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file.
How to do it
Here is how we proceed with the recipe:
- The URLUtility class can download content from a URL. The code in the recipe’s file is the following:
import const
from util.urls import URLUtility
util = URLUtility(const.ApodEclipseImage())
print(len(util.data))
- When running this you will see the following output:
Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014
The example reads 171014 bytes of data.
How it works
The URL is defined as a constant const.ApodEclipseImage() in the const module:
def ApodEclipseImage():
return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg"
The constructor of the URLUtility class has the following implementation:
def __init__(self, url, readNow=True):
""" Construct the object, parse the URL, and download now if
specified"""
self._url = url
self._response = None
self._parsed = urlparse(url)
if readNow:
self.read()
The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method:
def read(self):
self._response = urllib.request.urlopen(self._url)
self._data = self._response.read()
This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property:
@property
def data(self):
self.ensure_response()
return self._data
The code then simply reports on the length of that data, with the value of 171014.
There’s more
This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next.
Parsing a URL with urllib to get the filename
When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name?
Getting ready
We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py.
How to do it
Execute the recipe’s file with your python interpreter. It will run the following code:
util = URLUtility(const.ApodEclipseImage())
print(util.filename_without_ext)
This results in the following output:
Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
The filename is: BT5643s
How it works
In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively:
>>> parsed = urlparse(const.ApodEclipseImage())
>>> parsed
ParseResult(scheme='https', netloc='apod.nasa.gov',
path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='')
The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension:
@property
def filename_without_ext(self):
filename = os.path.splitext(os.path.basename(self._parsed.path))[0]
return filename
The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename).
There’s more
It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That’s our next recipe.
Determining the type of content for a URL
When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content.
Getting ready
We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py.
How to do it
We proceed as follows:
- Execute the script for the recipe. It contains the following code:
util = URLUtility(const.ApodEclipseImage())
print("The content type is: " + util.contenttype)
- With the following result:
Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
The content type is: image/jpeg
How it works
The .contentype property is implemented as follows:
@property
def contenttype(self):
self.ensure_response()
return self._response.headers['content-type']
The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed.
There’s more
The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned:
>>> response = urllib.request.urlopen(const.ApodEclipseImage())
>>> for header in response.headers: print(header)
Date
Server
Last-Modified
ETag
Accept-Ranges
Content-Length
Connection
Content-Type
Strict-Transport-Security
And we can see the values for each of these headers.
>>> for header in response.headers: print(header + " ==> " +
response.headers[header])
Date ==> Tue, 26 Sep 2017 19:31:41 GMT
Server ==> WebServer/1.0
Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT
ETag ==> "547bb44-29c06-5581275ce2b86"
Accept-Ranges ==> bytes
Content-Length ==> 171014
Connection ==> close
Content-Type ==> image/jpeg
Strict-Transport-Security ==> max-age=31536000; includeSubDomains
Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist.
Determining the file extension from a content type
It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file.
Getting ready
We again use the URLUtility object that we created. The recipe’s script is 04/04_determine_file_extension_from_contenttype.py):.
How to do it
Proceed by running the recipe’s script. An extension for the media type can be found using the .extension property:
util = URLUtility(const.ApodEclipseImage())
print("Filename from content-type: " + util.extension_from_contenttype)
print("Filename from url: " + util.extension_from_url)
This results in the following output:
Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
Filename from content-type: .jpg
Filename from url: .jpg
This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same.
How it works
The following is the implementation of the .extension_from_contenttype property:
@property
def extension_from_contenttype(self):
self.ensure_response()
map = const.ContentTypeToExtensions()
if self.contenttype in map:
return map[self.contenttype]
return None
The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension:
def ContentTypeToExtensions():
return {
"image/jpeg": ".jpg",
"image/jpg": ".jpg",
"image/png": ".png"
}
If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url:
@property
def extension_from_url(self):
ext = os.path.splitext(os.path.basename(self._parsed.path))[1]
return ext
This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename.
To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python.
If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.