As part of every reconnaissance phase in a web penetration test, we will need to browse every link included in a web page and have a record of every file displayed by it. There are tools that will help us automate and accelerate this task; they are called web crawlers or web spiders. These tools browse a web page that follows all links and references to external files, sometimes fills forms and sends them to servers, saves all requests and responses made, and gives us the opportunity to analyze them offline.
In this article by Gilberto Najera Gutierrez, the author of Kali Linux Web Penetration Testing Cookbook, we will cover the use of some crawlers included in Kali Linux.
(For more resources related to this topic, see here.)
Wget is part of the GNU project and is included in most major Linux distributions, including Kali Linux. It has the ability to recursively download a web page for offline browsing, including the conversion of links and downloading of non-HTML files.
In this recipe, we will use wget to download the pages associated to an application in our vulnerable_vm.
All recipes in this article will require to run vulnerable_vm. In this particular scenario, it will have the IP address 192.168.56.102.
wget http://192.168.56.102/bodgeit/
As you can see, it only downloaded the index.html file, which is the starting page of the application, to the current directory
a. We will need to use some options to tell wget to save all downloaded files to a specific directory and copy all files contained in the URL that we set as parameters. Let’s first create a directory to save the files:
mkdir bodgeit_offline
wget –r –P bodgeit_offline/ http://192.168.56.102/bodgeit/
As mentioned earlier, wget is a tool that is created to download the HTTP content. With the -r parameter, we made it to act recursively, that is, to follow all links in every page it downloads and download them as well. The -P option allows us to set the prefix of the directory, that is, the directory where wget will start saving the downloaded content, which is set to the current path, by default.
There are other useful options that are to be considered when using wget:
As stated in HTTrack’s official website (http://www.httrack.com):
It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.
In this recipe, we will be using HTTrack to download the entire content of an application’s site.
HTTrack is not installed by default in Kali Linux, so we will need to install it:
apt-get update
apt-get install httrack
mkdir bodgeit_httrack
cd bodgeit_httrack
httrack http://192.168.56.102/bodgeit/
It is important to set the last /; if it is omitted, httrack will return a 404 error because there is no bodgeit file in the root of the server.
HTTrack creates a full static copy of the site; this means that all dynamic content-like responses to forms or other user inputs won’t be available. Inside the folder, we downloaded the site we can see with the following structure:
HTTrack also has an extensive collection of options that will allow us to customize its behavior to better fit our requirements. Some useful modifiers are as follows:
Downloading a full site to a directory in our computer leaves us with a static copy of the information, which means that we have the output produced by different requests, but we do not have such requests nor the response states of the server. To have a record of this information, we have spiders such as the one integrated in OWASP ZAP.
In this recipe, we will use ZAP’s spider to crawl a directory in our vulnerable_vm and check the information that it captures.
For this recipe, we need to have the vulnerable_vm running, OWASP ZAP running, and the browser configured to use ZAP as a proxy.
On the right-hand side of the panel, we can see the full request made, including the parameters used (bottom half).
In the top half, we see the response header, including the server banner and the session cookie, and in the bottom, we have the complete HTML response.
As any other crawler, ZAP’s spider follows every link it finds on every page that is included in the scope requested and the links inside it. This spider also follows the form responses, redirects an URL included in the robots.txt and sitemap.xml files, and then it stores all requests and responses for later analysis and use.
After crawling a website or directory, we may want to use the stored requests to perform some tests. Using ZAP’s capabilities, we will be able to do the following, among other things:
In this article, we studied two of the most important aspects of penetration testing, that is, crawlers and spiders.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…