8 min read

Got questions on Sphinx, the open source search engine? Not sure if it’s the right tool for you? You’re in the right place – we’ve put together an FAQ on Sphinx. It should help you make the right decision about the software that powers your search.

If you’ve got questions on the other kind of Sphinx, we recommend you look here instead.

What is Sphinx?

Sphinx is a full-text search engine (generally standalone) which provides fast, relevant, efficient full-text search functionality to third-party applications. It was especially created to facilitate searches on SQL databases and integrates very well with scripting languages; such as PHP, Python, Perl, Ruby, and Java.

What are the major features of Sphinx?

Some of the major features of Sphinx include:

  • High indexing speed (up to 10 MB/sec on modern CPUs)
  • High search speed (average query is under 0.1 sec on 2 to 4 GB of text collection)
  • High scalability (up to 100 GB of text, up to 100 Million documents on a single CPU)
  • Supports distributed searching (since v.0.9.6)
  • Supports MySQL (MyISAM and InnoDB tables are both supported) and PostgreSQL natively
  • Supports phrase searching
  • Supports phrase proximity ranking, providing good relevance
  • Supports English and Russian stemming
  • Supports any number of document fields (weights can be changed on the fly)
  • Supports document groups
  • Supports stopwords, that is, that it indexes only what’s most relevant from a given list of words
  • Supports different search modes (“match extended”, “match all”, “match phrase” and “match any” as of v.0.9.5)
  • Generic XML interface which greatly simplifies custom integration
  • Pure-PHP (that is, NO module compiling and so on) search client API

Which operating systems does Sphinx run on?

Sphinx was developed and tested mostly on UNIX based systems. All modern UNIX based operating systems with an ANSI compliant compiler should be able to compile and run Sphinx without any issues. However, Sphinx has also been found running on the following operating systems without any issues.

  • Linux (Kernel 2.4.x and 2.6.x of various distributions)
  • Microsoft Windows 2000 and XP
  • FreeBSD 4.x, 5.x, 6.x
  • NetBSD 1.6, 3.0
  • Solaris 9, 11
  • Mac OS X

What does the configure command do?

The configure command gets the details of our machine and also checks for all dependencies. If any of the dependency is missing, it will throw an error.

Which are the various options for the configure command?

There are many options that can be passed to the configure command but we will take a look at a few important ones:

  • prefix=/path: This option specifies the path to install the sphinx binaries.
  • with-mysql=/path: Sphinx needs to know where to find MySQL’s include and library files. It auto-detects this most of the time but if for any reason it fails, you can supply the path here.
  • with-pgsql=/path: Same as –-with-mysql but for PostgreSQL.

What is full-text search?

Full-text search is one of the techniques for searching a document or database stored on a computer. While searching, the search engine goes through and examines all of the words stored in the document and tries to match the search query against those words. A complete examination of all the words (text) stored in the document is undertaken and hence it is called a full-text search.
Full-text search excels in searching large volumes of unstructured text quickly and effectively. It returns pages based on how well they match the user’s query.

What are the advantages of full-text search?

The following points are some of the major advantages of full-text search:

  • It is quicker than traditional searches as it benefits from an index of words that is used to look up records instead of doing a full table scan
  • It gives results that can be sorted by relevance to the searched phrase or term, with sophisticated ranking capabilities to find the best documents or records
  • It performs very well on huge databases with millions of records
  • It skips the common words such as the, an, for, and so on

When should you use full-text search?

You should use full-text search when:

  • When there is a high volume of free-form text data to be searched
  • When there is a need for highly optimized search results
  • When there is a demand for flexible search querying

Why use Sphinx for full-text search?

If you’re looking for a good Database Management System (DBMS), there are plenty of options available with support for full-text indexing and searches, such as MySQL, PostgreSQL, and SQL Server. There are also external full-text search engines, such as Lucene and Solr. Let’s see the advantages of using Sphinx over the DBMS’s full-text searching capabilities and other external search engines:

  • It has a higher indexing speed. It is 50 to 100 times faster than MySQL FULLTEXT and 4 to 10 times faster than other external search engines.
  • It also has higher searching speed since it depends heavily on the mode, Boolean vs. phrase, and additional processing. It is up to 500 times faster than MySQL FULLTEXT in cases involving a large result set with GROUP BY. It is more than two times faster in searching than other external search engines available.
  • Relevancy is among the key features one expects when using a search engine, and Sphinx performs very well in this area. It has phrase-based ranking in addition to classic statistical BM25 ranking.
  • Last but not the least, Sphinx has better scalability. It can be scaled vertically (utilizing many CPUs, many HDDs) or horizontally (utilizing many servers), and this comes out of the box with Sphinx. One of the biggest known Sphinx cluster has over 3 billion records with more than 2 terabytes of size.

What are indexes?

Indexes in Sphinx are a bit different from indexes we have in databases. The data that Sphinx indexes is a set of structured documents and each document has the same set of fields. This is very similar to SQL, where each row in the table corresponds to a document and each column to a field. Sphinx builds a special data structure that is optimized for answering full-text search queries. This structure is called an index and the process of creating an index from the data is called indexing. The indexes in Sphinx can also contain attributes that are highly optimized for filtering. These attributes are not full-text indexed and do not contribute to matching. However, they are very useful at filtering out the results we want based on attribute values. There can be different types of indexes suited for different tasks. The index type, which has been implemented in Sphinx, is designed for maximum indexing and searching speed.

What are multi-value attributes (MVA)?

MVAs are a special type of attribute in Sphinx that make it possible to attach multiple values to every document. These attributes are especially useful in cases where each document can have multiple values for the same property (field).

How does weighting help?

Weighting decides which document gets priority over other documents and appear at the top. In Sphinx, weighting depends on the search mode. Weight can also be referred to as ranking. There are two major parts which are used in weighting functions:

  • Phrase rank: This is based on the length of Longest Common Subsequence (LCS) of search words between document body and query phrase. This means that the documents in which the queried phrase matches perfectly will have a higher phrase rank and the weight would be equal to the query word counts.
  • Statistical rank: This is based on BM25 function which takes only the frequency of the queried words into account. So, if a word appears only one time in the whole document then its weight will be low. On the other hand if a word appears a lot in the document then its weight will be higher. The BM25 weight is a floating point number between 0 and 1.

What is index merging, exactly?

Index merging is more efficient than indexing the data from scratch, that is, all over again. In this technique we define a delta index in the Sphinx configuration file. The delta index always gets the new data to be indexed. However, the main index acts as an archive and holds data that never changes.

What is SphinxQL?

Programmers normally issue search queries using one or more client libraries that relate to the database on which the search is to be performed. Some programmers may also find it easier to write an SQL query than to use the Sphinx Client API library.
SphinxQL is used to issue search queries in the form of SQL queries. These queries can be fired from any client of the database in question, and returns the results in the way that a normal query would. Currently MySQL binary network protocol is supported and this enables Sphinx to be accessed with the regular MySQL API.

What do you mean by Geo-distance search?

In a Geo-distance search, you can find geo coordinates nearby to the base anchor point. Thus you can use this technique to find the nearby places to the given location. It can be useful in many applications like hotel search, property search, restaurant search, tourist destination search etc.

Sphinx makes it very easy to perform a geo-distance search by providing an API method wherein you can set the anchor point (if you have latitude and longitude in your index) and all searches performed thereafter will return the results with a magic attribute “@geodist” holding the values of distance from the anchor point. You can then filter or sort your results based on this attribute.

 


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here