Using Crawlers and Spiders

7 min read

As part of every reconnaissance phase in a web penetration test, we will need to browse every link included in a web page and have a record of every file displayed by it. There are tools that will help us automate and accelerate this task; they are called web crawlers or web spiders. These tools browse a web page that follows all links and references to external files, sometimes fills forms and sends them to servers, saves all requests and responses made, and gives us the opportunity to analyze them offline.

In this article by Gilberto Najera Gutierrez, the author of Kali Linux Web Penetration Testing Cookbook, we will cover the use of some crawlers included in Kali Linux.

(For more resources related to this topic, see here.)

Downloading the page for offline analysis with wget

Wget is part of the GNU project and is included in most major Linux distributions, including Kali Linux. It has the ability to recursively download a web page for offline browsing, including the conversion of links and downloading of non-HTML files.

In this recipe, we will use wget to download the pages associated to an application in our vulnerable_vm.

Getting ready

All recipes in this article will require to run vulnerable_vm. In this particular scenario, it will have the IP address 192.168.56.102.

How to do it…

Let’s make a first attempt to download the page by calling wget using the URL as the only parameter:
```
wget http://192.168.56.102/bodgeit/
```
As you can see, it only downloaded the index.html file, which is the starting page of the application, to the current directory

a. We will need to use some options to tell wget to save all downloaded files to a specific directory and copy all files contained in the URL that we set as parameters. Let’s first create a directory to save the files:
```
mkdir bodgeit_offline
```
Now, we will recursively download all files in the application and save them in the corresponding directory:
```
wget –r –P bodgeit_offline/ http://192.168.56.102/bodgeit/
```

How it works…

As mentioned earlier, wget is a tool that is created to download the HTTP content. With the -r parameter, we made it to act recursively, that is, to follow all links in every page it downloads and download them as well. The -P option allows us to set the prefix of the directory, that is, the directory where wget will start saving the downloaded content, which is set to the current path, by default.

There’s more…

There are other useful options that are to be considered when using wget:

-l: When downloading recursively, it may be necessary to establish limits to the depth wget goes, in order to follow links. This option, followed by the number of levels of depth that we want to go to, lets us establish such limits.
-k: After files are downloaded, wget will modify all links to allow them to point to the corresponding local files, making the site locally browsable.
-p: This downloads all images needed by the page, even if they are in other sites.
-w: This waits for a number of seconds that are specified after this option between one download and the next. It is useful when there is a mechanism used to prevent automatic browsing in the server.

Downloading the page for offline analysis with HTTrack

As stated in HTTrack’s official website (http://www.httrack.com):

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.

In this recipe, we will be using HTTrack to download the entire content of an application’s site.

Getting ready

HTTrack is not installed by default in Kali Linux, so we will need to install it:

apt-get update

apt-get install httrack

How to do it…

Our first step will be to create a directory to store the downloaded site, and in it:
```
mkdir bodgeit_httrack

cd bodgeit_httrack
```
The simplest way to use HTTrack is by just adding the URL that we want to download to the command:
```
httrack http://192.168.56.102/bodgeit/
```
It is important to set the last /; if it is omitted, httrack will return a 404 error because there is no bodgeit file in the root of the server.
Now, we can go to file:///root/MyCookbook/test/bodgeit_httrack/index.html (or the path that you selected in your test environment), and we will see that we can browse the whole site offline.

How it works…

HTTrack creates a full static copy of the site; this means that all dynamic content-like responses to forms or other user inputs won’t be available. Inside the folder, we downloaded the site we can see with the following structure:

A directory named after the server’s name or address contains all files that were downloaded
A cookies.txt file contains the cookies information that is used to download the site
The hts-cache directory contains the list of files detected by the crawler; this is the list of files that httrack has processed
The hts-log.txt file contains all errors, warnings, and other information reported during the crawling and downloading of the site
An index.html file redirects to the copy if the original index file is located in the server-name directory

There’s more…

HTTrack also has an extensive collection of options that will allow us to customize its behavior to better fit our requirements. Some useful modifiers are as follows:

-rN: This sets the depth to N levels of links to be followed
-%eN: This sets the limit depth to external links
+[pattern]: This tells HTTrack to whitelist all URLs that match [pattern]; for example, +*google.com/*
-[pattern]: This tells HTTrack to blacklist (omit from downloading) all links that match the pattern
-F [user-agent]: This option allows us to define the user-agent (browser identifier) that we want to use to download the site

Using ZAP’s spider

Downloading a full site to a directory in our computer leaves us with a static copy of the information, which means that we have the output produced by different requests, but we do not have such requests nor the response states of the server. To have a record of this information, we have spiders such as the one integrated in OWASP ZAP.

In this recipe, we will use ZAP’s spider to crawl a directory in our vulnerable_vm and check the information that it captures.

Getting ready

For this recipe, we need to have the vulnerable_vm running, OWASP ZAP running, and the browser configured to use ZAP as a proxy.

How to do it…

Having ZAP running and the browser using it as a proxy, go to http://192.168.56.102/bodgeit/.
In the Sites tab, open the folder that corresponds to the test site (http://192.168.56.102 here).
Right-click on GET:bodgeit.
From the drop-down menu, navigate to Attack | Spider….
In the dialog box, leave all default options, and click on Start Scan.
Results will appear in the bottom panel in the Spider tab.
If we want to analyze the requests and responses of individual files, we go to the Sites tab, and open the site folder and the bodgeit folder inside it. Let’s take a look at POST:contact.jsp(anticsrf,comments,null):

On the right-hand side of the panel, we can see the full request made, including the parameters used (bottom half).
Now, we select the Response tab in the section of the right-hand side of the page:

In the top half, we see the response header, including the server banner and the session cookie, and in the bottom, we have the complete HTML response.

How it works…

As any other crawler, ZAP’s spider follows every link it finds on every page that is included in the scope requested and the links inside it. This spider also follows the form responses, redirects an URL included in the robots.txt and sitemap.xml files, and then it stores all requests and responses for later analysis and use.

There’s more…

After crawling a website or directory, we may want to use the stored requests to perform some tests. Using ZAP’s capabilities, we will be able to do the following, among other things: