8 min read

(For more resources related to this topic, see here.)

 

With the knowledge you have gained, we are now ready to develop our first bot, which will be a simple bot that gathers data (documents) based on a list of URLs and datasets (field and field values) that we will require.

First, let’s start by creating our bot package directory. So, create a directory called WebBot so that the files in our project_directory/lib directory look like the following:

'-- project_directory|-- lib | |-- HTTP (our existing HTTP package) | | '-- (HTTP package files here) | '-- WebBot | |-- bootstrap.php| |-- Document.php | '-- WebBot.php |-- (our other files)'-- 03_webbot.php

As you can see, we have a very clean and simple directory and file structure that any programmer should be able to easily follow and understand.

The WebBot class

Next, open the file WebBot.php file and add the code from the project_directory/lib/WebBot/WebBot.php file:

In our WebBot class, we first use the __construct() method to pass the array of URLs (or documents) we want to fetch, and the array of document fields are used to define the datasets and regular expression patterns. Regular expression patterns are used to populate the dataset values (or document field values). If you are unfamiliar with regular expressions, now would be a good time to study them. Then, in the __construct() method, we verify whether there are URLs to fetch or not. If there , we set an error message stating this problem.

Next, we use the __formatUrl() method to properly format URLs we fetch data. This method will also set the correct protocol: either HTTP or HTTPS ( Hypertext Transfer Protocol Secure ). If the protocol is already set for the URL, for example http://www.[dom].com, we ignore setting the protocol. Also, if the class configuration setting conf_force_https is set to true, we force the HTTPS protocol again unless the protocol is already set for the URL.

We then use the execute() method to fetch data for each URL, set and add the Document objects to the array of documents, and track document statistics. This method also implements fetchdelay logic that will delay each fetch by x number of seconds if set in the class configuration settings conf_delay_between_fetches. We also include the logic that only allows distinct URL fetches, meaning that, if we have already fetched data for a URL we won’t fetch it again; this eliminates duplicate URL data fetches. The Document object is used as a container for the URL data, and we can use the Document object to use the URL data, the data fields, and their corresponding data field values.

In the execute() method, you can see that we have performed a HTTPRequest::get() request using the URL and our default timeout value—which is set with the class configuration settings conf_default_timeout. We then pass the HTTPResponse object that is returned by the HTTPRequest::get() method to the Document object. Then, the Document object uses the data from the HTTPResponse object to build the document data.

Finally, we include the getDocuments() method, which simply returns all the Document objects in an array that we can use for our own purposes as we desire.

The WebBot Document class

Next, we need to create a class called Document that can be used to store document data and field names with their values. To do this we will carry out the following steps:

  1. We first pass the data retrieved by our WebBot class to the Document class.
  2. Then, we define our document’s fields and values using regular expression patterns.
  3. Next, add the code from the project_directory/lib/WebBot/Document.php file.

    Our Document class accepts the HTTPResponse object that is set in WebBot class’s execute() method, and the document fields and document ID.

  4. In the Document __construct() method, we set our class properties: the HTTP Response object, the fields (and regular expression patterns), the document ID, and the URL that we use to fetch the HTTP response.
  5. We then check if the HTTP response successful (status code 200), and if it isn’t, we set the error with the status code and message.
  6. Lastly, we call the __setFields() method.

The __setFields() method parses out and sets the field values from the HTTP response body. For example, if in our fields we have a title field defined as $fields = [‘title’ => ‘<title>(.*)</title>’];, the __setFields() method will add the title field and pull all values inside the <title>*</title> tags into the HTML response body. So, if there were two title tags in the URL data, the __setField() method would add the field and its values to the document as follows:

['title'] => [ 0 => 'title x', 1 => 'title y' ]

If we have the WebBot class configuration variable—conf_include_document_field_raw_values—set to true, the method will also add the raw values (it will include the tags or other strings as defined in the field’s regular expression patterns) as a separate element, for example:

['title'] => [ 0 => 'title x', 1 => 'title y', 'raw' => [ 0 => '<title>title x</title>', 1 => '<title>title y</title>' ] ]

The preceding code is very useful when we want to extract specific data (or field values) from URL data.

To conclude the Document class, we have two more methods as follows:

  • getFields(): This method simply returns the fields and field values
  • getHttpResponse(): This method can be used to get the HTTPResponse object that was originally set by the WebBot execute() method

This will allow us to perform logical requests to internal objects if we wish.

The WebBot bootstrap file

Now we will create a bootstrap.php file (at project_directory/lib/WebBot/) to load the HTTP package and our WebBot package classes, and set our WebBot class configuration settings:

<?php namespace WebBot; /** * Bootstrap file * * @package WebBot */ // load our HTTP package require_once './lib/HTTP/bootstrap.php'; // load our WebBot package classes require_once './lib/WebBot/Document.php'; require_once './lib/WebBot/WebBot.php'; // set unlimited execution time set_time_limit(0); // set default timeout to 30 seconds WebBotWebBot::$conf_default_timeout = 30; // set delay between fetches to 1 seconds WebBotWebBot::$conf_delay_between_fetches = 1; // do not use HTTPS protocol (we'll use HTTP protocol) WebBotWebBot::$conf_force_https = false; // do not include document field raw values WebBotWebBot::$conf_include_document_field_raw_values = false;

We use our HTTP package to handle HTTP requests and responses. You have seen in our WebBot class how we use HTTP requests to fetch the data, and then use the HTTP Response object to store the fetched data in the previous two sections. That is why we need to include the bootstrap file to load the HTTP package properly.

Then, we load our WebBot package files. Because our WebBot class uses the Document class, we load that class file first.

Next, we use the built-in PHP function set_time_limit() to tell the PHP interpreter that we want to allow unlimited execution time for our script. You don’t necessarily have to use unlimited execute time. However, for testing reasons, we will use unlimited execution time for this example.

Finally, we set the WebBot class configuration settings. These settings are used by the WebBot object internally to make our bot work as we desire. We should always make the configuration settings as simple as possible to help other developers understand. This means we should also include detailed comments in our code to ensure easy usage of package configuration settings.

We have set up four configuration settings in our WebBot class. These are static and public variables, meaning that we can set them from anywhere after we have included the WebBot class, and once we set them they will remain the same for all WebBot objects unless we change the configuration variables. If you do not understand the PHP keyword static, now would be a good time to research this subject.

  • The first configuration variable is conf_default_timeout. This variable is used to globally set the default timeout (in seconds) for all WebBot objects we create. The timeout value tells the HTTPRequest class how long it continue trying to send a request before stopping and deeming it as a bad request, or a timed-out request. By default, this configuration setting value is set to 30 (seconds).
  • The second configuration variable—conf_delay_between_fetches—is used to set a time delay (in seconds) between fetches (or HTTP requests). This can be very useful when gathering a lot of data from a website or web service. For example, say, you had to fetch one million documents from a website. You wouldn’t want to unleash your bot with that type of mission without fetch delays because you could inevitably cause—to that website—problems due to massive requests. By default, this value is set to 0, or no delay.
  • The third WebBot class configuration variable—conf_force_https—when set to true, can be used to force the HTTPS protocol. As mentioned earlier, this will not override any protocol that is already set in the URL. If the conf_force_https variable is set to false, the HTTP protocol will be used. By default, this value is set to false.
  • The fourth and final configuration variable—conf_include_document_field_raw_values—when set to true, will force the Document object to include the raw values gathered from the ‘ regular expression patterns. We’ve discussed configuration settings in detail in the WebBot Document Class section earlier in this article. By default, this value is set to false.

Summary

In this article you have learned how to get started with building your first bot using HTTP requests and responses.

Resources for Article :


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here