Saturday, March 31, 2012

Location-Based Search

An increasing number of applications and services are location based, such as Yelp, Foursquare, and Foodspotting. In addition, mobile searches tend to have a location component, since users often use their mobile phone to search for information based on where they are. All these trends make location-based search more important. Formally speaking, a location-based search query specifies a spatial condition (such as downtown San Francisco) and a few keywords (such as "Ghirardelli"), and wants to retrieve documents or objects that satisfy both conditions (e.g., finding all stores mentioning Ghirardelli in downtown San Francisco.

Technical Challenges

Location-based search requires to deal with two types of data: keywords and geospatial data. For keywords, a classic index structure is inverted index, using which we can easily find documents that contain keywords. For geospatial data, there are many existing index structures, such as R-tree, R*-tree, quad-tree, and Hilbert curve. In order to answer location-based queries, we need integrate these two types of index structures to efficiently find answers that match both conditions.

As mentioned in our earlier posts, instant search and fuzzy search are also very important for location-based search, especially when users issue queries from their mobile devices. However, supporting this type of search is even more challenging, since we need to also consider how to use index structure to handle prefix condition (e.g., a prefix query "Ghira") and fuzzy condition (e.g., "Girardeli"with two typos).

Location-Based Search in Lucene

Although Lucene was initially designed and implemented to support traditional keyword search, it can be extended to support location-based search by using a technique called Geo Hash encoding. Grant Ingersoll wrote an excellent article on how to do it. Many popular search engines are taking such an approach or something similar.

One limitation of this approach is that we cannot easily extend it to enable more powerful capabilities such as instant search or fuzzy search. One metaphor is that a pair of basketball shoes can be used to play soccer, but they cannot give you a good performance, since they are not designed for that purpose!

Bimaple's Location-Based Search

At Bimaple we developed a novel technology that seamlessly integrates several index structures to enable location-based search with the powerful search capabilities. We have published a white paper to compare our solution with Lucene, and shown that ours is 10-100 times faster.

Fuzzy Search

Why fuzzy search?

Fuzzy search is important for the cases where users do not know the exact spelling of the entities they are looking for, such as a person name, a restaurant name, a book title, or a disease. As example, few people know the exact spelling of our former California governor, Schwarrzeneggar, despite his huge popularity in Hollywood and politics. The increasing number of mobile devices and applications make fuzzy search more important due to the "fat fingers problem," i.e., it is painful and error prone to tap in keywords on a small mobile device. Being able to support fuzzy search can make it easier for users to find relevant results.

What is "fuzzy"?

Whenever we talk about fuzzy search, the first question we have to answer is: "What is the meaning of fuzzy?" There are different ways to measure the similarity and fuzziness between two keywords or strings, including edit distance (Levenshtein distance), cosine, jaccard, and dice. In this blog I will focus on edit distance, which is a commonly used method suitable for keyword typos. Formally, the distance distance between two strings is the minimum number of single character operations (insertion, deletion, and substitution) to transform one string to another string. For instance, the edit distance between "Schwarzenaggar" and "Schwarrzeneggar" is 2 (one deletion and one substitution).

A main problem in fuzzy search is: given a query keyword such as "Schwarzenaggar", how to find the keywords that are similar to the query keyword, i.e., their edit distance to the query keyword is within a threshold? A main challenge is that we need to do such a search very efficiently on large amounts of data (millions or tens of millions keywords).

Gram-based Methods

One way to solve the problem is to use grams of strings. For a small integer q, the q-grams of a string are its substrings with the length q. For instance, the 3-grams of "Schwarrzeneggar" are "Sch", "chw", "hwa", "war", etc. An important observation is that, if two strings are very similar, they should have enough common grams. In our running example, the following are the 3-grams of the two strings:
  • Schwarzenaggar: Sch, schw, hwa, war, rze, zen, ena, nag, agg, gga, gar
  • Schwarrzeneggar: Sch, scw, hwa, war, arr, rrz, rze, zen, ene, neg, egg, gga, gar
Clearly these two sets share a lot of common grams. At UC Irvne, I have a research project called Flamingo, in which we have developed efficient indexes and search algorithms do fuzzy search using this method.

Trie-based Methods

Gram-based methods are suitable for fuzzy search on complete keywords, but these methods are not suitable for fuzzy search on the fly as a user is typing a keyword. The main reason is that the prefix from the user could be short (e.g, 4-5 characters), and it does not have enough grams to give pruning power to find similar keywords.

Another class methods to do fuzzy search are based on trie. Intuitively, we use a tree structure called trie to index all the keywords. As a user types in one character in the keyword, we traverse the trie downward to reach the next trie node. When doing fuzzy search, we can find multiple trie nodes that are similar to the query prefix. Such nodes are called "active nodes." As the user types in one more character, we can use the active nodes of the prefix to incrementally compute the active nodes of the prefix. We have published a paper to describe this technique, which has been patented. A big advantage of this method is that it is very suitable for instant search, since the trie structure has a good pruning power for prefixes.

Fuzzy search in other systems

One way to solve the problem is to use grams of strings. For a small integer q, the q-grams of a
  • Lucene: It supports fuzzy search by providing a special kind of queries called FuzzyQuery. Realizing the low performance of the old versions, there is an effort to make it faster in the new 4.0 release.
  • Oracle: It supports fuzzy matching, but with a low performance.

Comparing Bimaple and Lucene on Instant Fuzzy Search

We published a white paper to compare our instant fuzzy search engine with Lucene. A conclusion is that our engine is 10-40 times faster than Lucene for instant, fuzzy search.