Why fuzzy search?
Fuzzy search is important for the cases where users do not know the exact spelling of the entities they are looking for, such as a person name, a restaurant name, a book title, or a disease. As example, few people know the exact spelling of our former California governor, Schwarrzeneggar, despite his huge popularity in Hollywood and politics. The increasing number of mobile devices and applications make fuzzy search more important due to the "fat fingers problem," i.e., it is painful and error prone to tap in keywords on a small mobile device. Being able to support fuzzy search can make it easier for users to find relevant results.
What is "fuzzy"?
Whenever we talk about fuzzy search, the first question we have to answer is: "What is the meaning of fuzzy?" There are different ways to measure the similarity and fuzziness between two keywords or strings, including edit distance (Levenshtein distance), cosine, jaccard, and dice. In this blog I will focus on edit distance, which is a commonly used method suitable for keyword typos. Formally, the distance distance between two strings is the minimum number of single character operations (insertion, deletion, and substitution) to transform one string to another string. For instance, the edit distance between "Schwarzenaggar" and "Schwarrzeneggar" is 2 (one deletion and one substitution).
A main problem in fuzzy search is: given a query keyword such as "Schwarzenaggar", how to find the keywords that are similar to the query keyword, i.e., their edit distance to the query keyword is within a threshold? A main challenge is that we need to do such a search very efficiently on large amounts of data (millions or tens of millions keywords).
One way to solve the problem is to use grams of strings. For a small integer q, the q-grams of a string are its substrings with the length q. For instance, the 3-grams of "Schwarrzeneggar" are "Sch", "chw", "hwa", "war", etc. An important observation is that, if two strings are very similar, they should have enough common grams. In our running example, the following are the 3-grams of the two strings:
- Schwarzenaggar: Sch, schw, hwa, war, rze, zen, ena, nag, agg, gga, gar
- Schwarrzeneggar: Sch, scw, hwa, war, arr, rrz, rze, zen, ene, neg, egg, gga, gar
Clearly these two sets share a lot of common grams. At UC Irvne, I have a research project called Flamingo, in which we have developed efficient indexes and search algorithms do fuzzy search using this method.
Gram-based methods are suitable for fuzzy search on complete keywords, but these methods are not suitable for fuzzy search on the fly as a user is typing a keyword. The main reason is that the prefix from the user could be short (e.g, 4-5 characters), and it does not have enough grams to give pruning power to find similar keywords.
Another class methods to do fuzzy search are based on trie. Intuitively, we use a tree structure called trie to index all the keywords. As a user types in one character in the keyword, we traverse the trie downward to reach the next trie node. When doing fuzzy search, we can find multiple trie nodes that are similar to the query prefix. Such nodes are called "active nodes." As the user types in one more character, we can use the active nodes of the prefix to incrementally compute the active nodes of the prefix. We have published a paper to describe this technique, which has been patented. A big advantage of this method is that it is very suitable for instant search, since the trie structure has a good pruning power for prefixes.
Fuzzy search in other systems
One way to solve the problem is to use grams of strings. For a small integer q, the q-grams of a
- Lucene: It supports fuzzy search by providing a special kind of queries called FuzzyQuery. Realizing the low performance of the old versions, there is an effort to make it faster in the new 4.0 release.
- Oracle: It supports fuzzy matching, but with a low performance.
Comparing Bimaple and Lucene on Instant Fuzzy Search
We published a white paper to compare our instant fuzzy search engine with Lucene. A conclusion is that our engine is 10-40 times faster than Lucene for instant, fuzzy search.