Text Search Features
If you think that text search is just a basic thing and nothing more than returning results that matched words in a user query, then think again! There are many technical details that a good search implementation will give you control over to affect how well this fundamental capability works, like text analysis and relevancy ranking. But there are also a variety of ancillary features to look for that make a big difference such as result highlighting and faceting.
- Text analysis: This is the processing of the original text into indexed terms, and there’s a lot to it. Being able to configure the tokenization of words could mean that searching for “Mac” will be found if the word “MacBook” is in the text. And then there’s synonym processing so that users can search for similar words. You might want both a common language dictionary and also hand-picked ones for your data. There’s the ability to smartly handle desired languages instead of the pervasive English. And then there’s stemming which normalizes word variations so that for example “working” and “work” can be indexed the same. Yet another variation of text analysis is phonetic indexing to find words that sound-like the search query.
- Relevancy ranking: This is the logic behind ranking the search results that closest match the query. In Lucene/Solr, there are a variety of factors in an overall equation with the opportunity to adjust factors based on matching certain fields, certain documents, or using field values in a configurable equation. By comparison, the commercial Endeca platform allows configuration of a variety of matching rules that behaves more like a strict sort.
- Query features & syntax: From boolean logic to grouping to phrases to fuzzy-searches, to score boosting… there are a variety of queries that you might perform and combine. Many apps would prefer to hide this from users but some may wish to expose it for “advanced” searches.
- Result highlighting: Displaying a text snippet of a matched document containing the matched word in context is clearly very useful. We have all seen this in Google.
- Query spell correction (i.e. “did you mean”): Instead of unhelpfully returning no results (or very few), the search implementation should be able to try and suggest variation(s) of the search that will yield more results. This feature is customarily based on the actual indexed words and not a language dictionary. The variations might be based on the so-called edit-distance which is basically the number of alterations needed, or it might be based on phonetic matching.
- Faceted navigation: This is a must-have feature which enables search results to include aggregated values for designated fields that users can subsequently choose to filter the results on. It is commonly used on e-commerce sites to the left of the results to navigate products by various attributes like price ranges, vendors, etc.
- Term-suggest (AKA search auto-complete): As seen on Google, as you start typing a word in the search box, it suggests possible completions of the word. These are relevancy sorted and filtered to those that are also found with any words prior to the word you are typing.
- Sub-string indexing: In some cases, it is needed to match arbitrary sub-strings of words instead of being limited to complete words. Unlike what happens with an SQL like clause, the data is indexed in such a way for this to be quick.
- Geo-location search: Given the coordinates to a location on the globe with records containing such coordinates, you should be able to search for matching records from a user-specified coordinate. An extension to Solr allows a radial based search with appropriate ranking, but it is also straight-forward to box the search based on a latitude & longitude.
- Field/facet suggestions: The Endeca search platform can determine that your search query matches some field values used for faceting and then offer a convenient filter for them. For example, given a data set of employees, the search box could have a pop-up suggestion that the word in the search box matches a department code and then offer the choice of navigating to those matching records. This can be easier and faster than choosing facets to filter on, especially if there are a great number of facet-able fields. Solr doesn’t have this feature but it would not be a stretch to implement it based on its existing foundation.
- Clustering: This is another aid to navigating search results besides faceting. Search result clustering will dynamically divide the results into multiple groups called clusters, based on statistical correlation of terms in common. It is a bit of an exotic feature, but is useful with lots of results with lots of text information and after any faceted navigation is done if applicable.
So that’s quite a list and there are other features you may find too. This should give you a list of features to look for in whatever you choose. Some features are obviously more important to you than others.
How NOT to implement text search: the SQL “like” clause
Perhaps “back in the day” you implemented search by simply adding like “%searchword%” in the where clause of your SQL (the author is guilty as charged!) But of course, this has serious problems such as:
- It is very slow, especially given a data set of decent size and any amount of load on the server. A database does a simple brute-force search.
- There is no concept of relevancy (e.g. a match score). A record simply matched or not. You are forced to sort on one of your columns.
- It is too literal. Even if the search is case insensitive, any extra punctuation can screw it up, or it may match parts of words when you only wanted to match whole words.
So the bottom line is don’t do it! There are smarter approaches to this problem. Probably, the only situation you would do this is if you had a particular database column holding a limited number of short values and you have it indexed. Searches should go quickly and it’s very easy to implement this approach.