Sphinx: Index Searching

0
103
9 min read

 

Sphinx Search Beginner’s Guide

Sphinx Search Beginner's Guide

Implement full-text search with lightning speed and accuracy using Sphinx

        Read more about this book      

(For more resources on Search Engine, see here.)

Client API implementations for Sphinx

Sphinx comes with a number of native searchd client API implementations. Some third-party open source implementations for Perl, Ruby, and C++ are also available.

All APIs provide the same set of methods and they implement the same network protocol. As a result, they more or less all work in a similar fashion, they all work in a similar fashion.

All examples in this article are for PHP implementation of the Sphinx API. However, you can just as easily use other programming languages. Sphinx is used with PHP more widely than any other language.

Search using client API

Let’s see how we can use native PHP implementation of Sphinx API to search. We will add a configuration related to searchd and then create a PHP file to search the index using the Sphinx client API implementation for PHP.

Time for action – creating a basic search script

  1. Add the searchd config section to /usr/local/sphinx/etc/sphinx-blog.conf:

    source blog {
    # source options
    }

    index posts {
    # index options
    }

    indexer {
    # indexer options
    }

    # searchd options (used by search daemon)
    searchd
    {
    listen = 9312
    log = /usr/local/sphinx/var/log/searchd.log
    query_log = /usr/local/sphinx/var/log/query.log
    max_children = 30
    pid_file = /usr/local/sphinx/var/log/searchd.pid
    }

  2. Start the searchd daemon (as root user):
    $ sudo /usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/
    sphinx-blog.conf

    Sphinx Search

  3. Copy the sphinxapi.php file (the class with PHP implementation of Sphinx API) from the sphinx source directory to your working directory:
    $ mkdir /path/to/your/webroot/sphinx
    $ cd /path/to/your/webroot/sphinx
    $ cp /path/to/sphinx-0.9.9/api/sphinxapi.php ./
  4. Create a simple_search.php script that uses the PHP client API class to search the Sphinx-blog index, and execute it in the browser:
    <?php
    require_once('sphinxapi.php');
    // Instantiate the sphinx client
    $client = new SphinxClient();
    // Set search options
    $client->SetServer('localhost', 9312);
    $client->SetConnectTimeout(1);
    $client->SetArrayResult(true);

    // Query the index
    $results = $client->Query('php');

    // Output the matched results in raw format
    print_r($results['matches']);
  5. The output of the given code, as seen in a browser, will be similar to what’s shown in the following screenshot:

    Sphinx Search

What just happened?

Firstly, we added the searchd configuration section to our sphinx-blog.conf file. The following options were added to searchd section:

  • listen: This options specifies the IP address and port that searchd will listen on. It can also specify the Unix-domain socket path. This options was introduced in v0.9.9 and should be used instead of the port (deprecated) option. If the port part is omitted, then the default port used is 9312.

    Examples:

    • listen = localhost
    • listen = 9312
    • listen = localhost:9898
    • listen = 192.168.1.25:4000
    • listen = /var/run/sphinx.s
  • log: Name of the file where all searchd runtime events will be logged. This is an optional setting and the default value is “searchd.log”.
  • query_log: Name of the file where all search queries will be logged. This is an optional setting and the default value is empty, that is, do not log queries.
  • max_children: The maximum number of concurrent searches to run in parallel. This is an optional setting and the default value is 0 (unlimited).
  • pid_file: Filename of the searchd process ID. This is a mandatory setting. The file is created on startup and it contains the head daemon process ID while the daemon is running. The pid_file becomes unlinked when the daemon is stopped.

Once we were done with adding searchd configuration options, we started the searchd daemon with root user. We passed the path of the configuration file as an argument to searchd. The default configuration file used is /usr/local/sphinx/etc/sphinx.conf.

After a successful startup, searchd listens on all network interfaces, including all the configured network cards on the server, at port 9312. If we want searchd to listen on a specific interface then we can specify the hostname or IP address in the value of the listen option:

listen = 192.168.1.25:9312

The listen setting defined in the configuration file can be overridden in the command line while starting searchd by using the -l command line argument.

There are other (optional) arguments that can be passed to searchd as seen in the following screenshot:

Sphinx Search

searchd needs to be running all the time when we are using the client API. The first thing you should always check is whether searchd is running or not, and start it if it is not running.

We then created a PHP script to search the sphinx-blog index. To search the Sphinx index, we need to use the Sphinx client API. As we are working with a PHP script, we copied the PHP client implementation class, (sphinxapi.php) which comes along with Sphinx source, to our working directory so that we can include it in our script. However, you can keep this file anywhere on the file system as long as you can include it in your PHP script.

Throughout this article we will be using /path/to/webroot/sphinx as the working directory and we will create all PHP scripts in that directory. We will refer to this directory simply as webroot.

We initialized the SphinxClient class and then used the following class methods to set upthe Sphinx client API:

  • SphinxClient::SetServer($host, $port)—This method sets the searchd hostname and port. All subsequent requests use these settings unless this method is called again with some different parameters. The default host is localhost and port is 9312.
  • SphinxClient::SetConnectTimeout($timeout)—This is the maximum time allowed to spend trying to connect to the server before giving up.
  • SphinxClient::SetArrayResult($arrayresult)—This is a PHP client APIspecific method. It specifies whether the matches should be returned as an array or a hash. The Default value is false, which means that matches will be returned in a PHP hash format, where document IDs will be the keys, and other information (attributes, weight) will be the values. If $arrayresult is true, then the matches will be returned in plain arrays with complete per-match information.

After that, the actual querying of index was pretty straightforward using the SphinxClient::Query($query) method. It returned an array with matched results, as well as other information such as error, fields in index, attributes in index, total records found, time taken for search, and so on. The actual results are in the $results[‘matches’] variable. We can run a loop on the results, and it is a straightforward job to get the actual document’s content from the document ID and display it.

Matching modes

When a full-text search is performed on the Sphinx index, different matching modes can be used by Sphinx to find the results. The following matching modes are supported by Sphinx:

  • SPH_MATCH_ALL—This is the default mode and it matches all query words, that is, only records that match all of the queried words will be returned.
  • SPH_MATCH_ANY—This matches any of the query words.
  • SPH_MATCH_PHRASE—This matches query as a phrase and requires a perfect match.
  • SPH_MATCH_BOOLEAN—This matches query as a Boolean expression.
  • SPH_MATCH_EXTENDED—This matches query as an expression in Sphinx internal query language.
  • SPH_MATCH_EXTENDED2—This matches query using the second version of Extended matching mode. This supersedes SPH_MATCH_EXTENDED as of v0.9.9.
  • SPH_MATCH_FULLSCAN—In this mode the query terms are ignored and no text-matching is done, but filters and grouping are still applied.

Time for action – searching with different matching modes

  1. Create a PHP script display_results.php in your webroot with the following code:

    <?php
    // Database connection credentials
    $dsn ='mysql:dbname=myblog;host=localhost';
    $user = 'root';
    $pass = '';

    // Instantiate the PDO (PHP 5 specific) class
    try {
    $dbh = new PDO($dsn, $user, $pass);
    } catch (PDOException $e){
    echo'Connection failed: '.$e->getMessage();
    }
    // PDO statement to fetch the post data
    $query = "SELECT p.*, a.name FROM posts AS p " .
    "LEFT JOIN authors AS a ON p.author_id = a.id " .
    "WHERE p.id = :post_id";
    $post_stmt = $dbh->prepare($query);

    // PDO statement to fetch the post's categories
    $query = "SELECT c.name FROM posts_categories AS pc ".
    "LEFT JOIN categories AS c ON pc.category_id = c.id " .
    "WHERE pc.post_id = :post_id";
    $cat_stmt = $dbh->prepare($query);

    // Function to display the results in a nice format
    function display_results($results, $message = null)
    {
    global $post_stmt, $cat_stmt;
    if ($message) {
    print "<h3>$message</h3>";
    }
    if (!isset($results['matches'])) {
    print "No results found<hr />";
    return;
    }
    foreach ($results['matches'] as $result) {
    // Get the data for this document (post) from db
    $post_stmt->bindParam(':post_id',
    $result['id'],
    PDO::PARAM_INT);
    $post_stmt->execute();
    $post = $post_stmt->fetch(PDO::FETCH_ASSOC);

    // Get the categories of this post
    $cat_stmt->bindParam(':post_id',
    $result['id'],
    PDO::PARAM_INT);
    $cat_stmt->execute();
    $categories = $cat_stmt->fetchAll(PDO::FETCH_ASSOC);

    // Output title, author and categories
    print "Id: {$posmt['id']}<br />" .
    "Title: {$post['title']}<br />" .
    "Author: {$post['name']}";
    $cats = array();
    foreach ($categories as $category) {
    $cats[] = $category['name'];
    }
    if (count($cats)) {
    print "<br />Categories: " . implode(', ', $cats);
    }
    print "<hr />";
    }
    }

  2. Create a PHP script search_matching_modes.php in your webroot with the following code:
    <?php
    // Include the api class
    Require('sphinxapi.php');
    // Include the file which contains the function to display results
    require_once('display_results.php');

    $client = new SphinxClient();
    // Set search options
    $client->SetServer('localhost', 9312);
    $client->SetConnectTimeout(1);
    $client->SetArrayResult(true);

    // SPH_MATCH_ALL mode will be used by default
    // and we need not set it explicitly
    display_results(
    $client->Query('php'),
    '"php" with SPH_MATCH_ALL');

    display_results(
    $client->Query('programming'),
    '"programming" with SPH_MATCH_ALL');

    display_results(
    $client->Query('php programming'),
    '"php programming" with SPH_MATCH_ALL');

    // Set the mode to SPH_MATCH_ANY
    $client->SetMatchMode(SPH_MATCH_ANY);

    display_results(
    $client->Query('php programming'),
    '"php programming" with SPH_MATCH_ANY');

    // Set the mode to SPH_MATCH_PHRASE
    $client->SetMatchMode(SPH_MATCH_PHRASE);

    display_results(
    $client->Query('php programming'),
    '"php programming" with SPH_MATCH_PHRASE');

    display_results(
    $client->Query('scripting language'),
    '"scripting language" with SPH_MATCH_PHRASE');

    // Set the mode to SPH_MATCH_FULLSCAN
    $client->SetMatchMode(SPH_MATCH_FULLSCAN);

    display_results(
    $client->Query('php'),
    '"php programming" with SPH_MATCH_FULLSCAN');
  3. Execute search_matching_modes.php in a browser (http://localhost/sphinx/search_matching_modes.php).

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here