Wednesday, May 16, 2012

Hosted search service to support instant and error-tolerant search

We are glad to announce our hosted search service at http://service.bimaple.com. The service features an advanced instant and error-tolerant search engine, which is a result of many years of academic research and commercial development. Our service (beta) allows developers to upload your data and launch a powerful search interface.

Functionality: The hosted search service allows you to do the following:
  • Create and configure an index, and upload your data;
  • Configure searchable attributes;
  • Enable instant search (search as you type);
  • Allow error-correction on the fly;
  • Achieve a high query throughput;
  • Support real-time updates.

Motivation and Technology: Our motivation is based on the observation that emerging applications have new search requirements that cannot be met by existing solutions such as Lucene and Sphinx. Specifically:
  • More users are accessing information from mobile devices, which have the “fat fingers problem,” i.e, tapping is hard and error prone.
  • Fuzzy search is important when users do not remember the exact spelling of keywords such as a name or a restaurant.
  • Data in many domains (especially social data) is very dynamic, and requires real-time indexing.
  • The increasing popularity of Google Instant shows the power of instant search. At the same time, there were no good solutions for developers to do this type of search on large amounts of data with high efficiency and good ranking.

In order to meet these requirements, we develop a hosted service using our own search engine and technology, which is a result of many years of both academic research and commercial development. We use state-of-the-art in-memory index structures and search algorithms specifically designed to support these powerful features.

Comparisons with Existing Solutions:  Compared to open source solutions such as Lucene, our engine supports instant, fuzzy search on large amounts of data with a high efficiency to ensure high-quality results.  The engine also allows real-time indexes for dynamic data. A detailed comparison with Lucene is available in this white paper.

We also did a comparison with Amazon Cloud Search. For both services, we uploaded 2 million records, and each record has 5 keywords on the average.  We used Jmeter to test the performance of both services of instant queries (i.e., prefix queries). Here are the results.
Index construction timeSearch throughputAverage response timeTime to update 200,000 recordsError correction
Amazon Cloud Search47 minutes752/minute74 ms720 secondsNot supported
Bimaple3 minutes762/minute73 ms21 secondsSupported


Our initial comparison shows several advantages of our hosted search: (1) Significantly lower index construction time; (2) Much faster incremental updates of the indexes; and (3) Efficient support of error correction (versus no error correction at all), all without compromising search throughput and response time.

Pricing: Our service is initially free for customers with less than 100MB data for an index. If your data is more than this limit, please contact us.

For more information, please email contact@bimaple.com.

Saturday, March 31, 2012

Location-Based Search

An increasing number of applications and services are location based, such as Yelp, Foursquare, and Foodspotting. In addition, mobile searches tend to have a location component, since users often use their mobile phone to search for information based on where they are. All these trends make location-based search more important. Formally speaking, a location-based search query specifies a spatial condition (such as downtown San Francisco) and a few keywords (such as "Ghirardelli"), and wants to retrieve documents or objects that satisfy both conditions (e.g., finding all stores mentioning Ghirardelli in downtown San Francisco.

Technical Challenges

Location-based search requires to deal with two types of data: keywords and geospatial data. For keywords, a classic index structure is inverted index, using which we can easily find documents that contain keywords. For geospatial data, there are many existing index structures, such as R-tree, R*-tree, quad-tree, and Hilbert curve. In order to answer location-based queries, we need integrate these two types of index structures to efficiently find answers that match both conditions.

As mentioned in our earlier posts, instant search and fuzzy search are also very important for location-based search, especially when users issue queries from their mobile devices. However, supporting this type of search is even more challenging, since we need to also consider how to use index structure to handle prefix condition (e.g., a prefix query "Ghira") and fuzzy condition (e.g., "Girardeli"with two typos).

Location-Based Search in Lucene

Although Lucene was initially designed and implemented to support traditional keyword search, it can be extended to support location-based search by using a technique called Geo Hash encoding. Grant Ingersoll wrote an excellent article on how to do it. Many popular search engines are taking such an approach or something similar.

One limitation of this approach is that we cannot easily extend it to enable more powerful capabilities such as instant search or fuzzy search. One metaphor is that a pair of basketball shoes can be used to play soccer, but they cannot give you a good performance, since they are not designed for that purpose!

Bimaple's Location-Based Search

At Bimaple we developed a novel technology that seamlessly integrates several index structures to enable location-based search with the powerful search capabilities. We have published a white paper to compare our solution with Lucene, and shown that ours is 10-100 times faster.

Fuzzy Search

Why fuzzy search?

Fuzzy search is important for the cases where users do not know the exact spelling of the entities they are looking for, such as a person name, a restaurant name, a book title, or a disease. As example, few people know the exact spelling of our former California governor, Schwarrzeneggar, despite his huge popularity in Hollywood and politics. The increasing number of mobile devices and applications make fuzzy search more important due to the "fat fingers problem," i.e., it is painful and error prone to tap in keywords on a small mobile device. Being able to support fuzzy search can make it easier for users to find relevant results.

What is "fuzzy"?

Whenever we talk about fuzzy search, the first question we have to answer is: "What is the meaning of fuzzy?" There are different ways to measure the similarity and fuzziness between two keywords or strings, including edit distance (Levenshtein distance), cosine, jaccard, and dice. In this blog I will focus on edit distance, which is a commonly used method suitable for keyword typos. Formally, the distance distance between two strings is the minimum number of single character operations (insertion, deletion, and substitution) to transform one string to another string. For instance, the edit distance between "Schwarzenaggar" and "Schwarrzeneggar" is 2 (one deletion and one substitution).

A main problem in fuzzy search is: given a query keyword such as "Schwarzenaggar", how to find the keywords that are similar to the query keyword, i.e., their edit distance to the query keyword is within a threshold? A main challenge is that we need to do such a search very efficiently on large amounts of data (millions or tens of millions keywords).

Gram-based Methods

One way to solve the problem is to use grams of strings. For a small integer q, the q-grams of a string are its substrings with the length q. For instance, the 3-grams of "Schwarrzeneggar" are "Sch", "chw", "hwa", "war", etc. An important observation is that, if two strings are very similar, they should have enough common grams. In our running example, the following are the 3-grams of the two strings:
  • Schwarzenaggar: Sch, schw, hwa, war, rze, zen, ena, nag, agg, gga, gar
  • Schwarrzeneggar: Sch, scw, hwa, war, arr, rrz, rze, zen, ene, neg, egg, gga, gar
Clearly these two sets share a lot of common grams. At UC Irvne, I have a research project called Flamingo, in which we have developed efficient indexes and search algorithms do fuzzy search using this method.

Trie-based Methods

Gram-based methods are suitable for fuzzy search on complete keywords, but these methods are not suitable for fuzzy search on the fly as a user is typing a keyword. The main reason is that the prefix from the user could be short (e.g, 4-5 characters), and it does not have enough grams to give pruning power to find similar keywords.

Another class methods to do fuzzy search are based on trie. Intuitively, we use a tree structure called trie to index all the keywords. As a user types in one character in the keyword, we traverse the trie downward to reach the next trie node. When doing fuzzy search, we can find multiple trie nodes that are similar to the query prefix. Such nodes are called "active nodes." As the user types in one more character, we can use the active nodes of the prefix to incrementally compute the active nodes of the prefix. We have published a paper to describe this technique, which has been patented. A big advantage of this method is that it is very suitable for instant search, since the trie structure has a good pruning power for prefixes.

Fuzzy search in other systems

One way to solve the problem is to use grams of strings. For a small integer q, the q-grams of a
  • Lucene: It supports fuzzy search by providing a special kind of queries called FuzzyQuery. Realizing the low performance of the old versions, there is an effort to make it faster in the new 4.0 release.
  • Oracle: It supports fuzzy matching, but with a low performance.

Comparing Bimaple and Lucene on Instant Fuzzy Search

We published a white paper to compare our instant fuzzy search engine with Lucene. A conclusion is that our engine is 10-40 times faster than Lucene for instant, fuzzy search.

Sunday, July 17, 2011

Instant Search: Why and How?


Instant Search (often known as search as you type, type-ahead search, and auto-completion) is becoming increasingly popular in many Web applications. Such a system can find relevant answers as a user types in a keyword query character by character. The quick feedback can not only help the user formulate a query, but also help them find answers more easily. Such features are especially important for mobile users, since it is not easy to tap in keywords in such devices. On 9/08/2010, Google Instant was officially released on their blog, and which has been shown to save 2-5 seconds per Web search. Its success made many developers think about how to support this kind of information-search paradigm in their own applications. This blog is to give an overview of instant search.

Classification of Instant Search

There are two types of instant search: multi-user-based instant search and single-user-based instant search.

(1) Multi-User-Based (e.g., Google Instant and the "Search" box on the Apple homepage): In such a setting, we have a server that can be queried from many users concurrently. Therefore, achieving high query throughput is extremely very important. Studies have shown that in order to achieve an "instant speed," from the time a user presses a keystroke to the time they see the final results, the total time should be within 100 milliseconds. In a client-server environment, the total time should include the round-trip network time between the client machine and the server, the search time on the server, and the time of executing Javascript functions. Therefore, each search on the server should be done very efficiently. Since many users can query the server at the same time, the server has to answer each of the concurrent queries within milliseconds in order to guarantee a high QPS (queries per second). To achieve such a high speed, the server typically uses in-memory data structures and algorithms. Since the data and index are queried by many users, it is worth having enough memory to store them.

At the same time, this setting has a big advantage that many users are querying the same server, which can keep track of query logs and the past queries of each user. Such information can be used by the server to do powerful query predictions and improve its ranking function.

(2) Single-User-Based (e.g., Microsoft Desktop Search and Google Desktop Search):
In this setting, a single user is searching on their own data such as email in Outlook and files on the disk. Since there is only one user of the search, the requirement on a high speed becomes less. In addition, the operating system can only allocate a limited amount of memory for the search process, so the search has to use disk-based index structures and algorithms, which can already achieve an acceptable performance for a single user.

Query Prediction vs Searching Directly on Data

There is a fundamental difference between Google Instant and other instant-search systems such as the iPubmed system on more than 20 million medical publications.

(a) Google Instant heavily utilizes the user's personal search history and query logs of many users to first predict what the user is likely to query. For instance, suppose a user types in a prefix "california." Based on the large amount of search query history and the user's profile information, the engine predicts four possible queries, such as "california," "california lottery," "california dmv," and "california pizza kitchen," and uses the first prediction to query the backend engine to find results. Such a prediction is possible since millions of users are providing queries to the search engine, which can use the information to come up with accurate predictions.
Google Instant: Predication then search.

While this approach is very effective for popular engines such as Google, it is not feasible for many applications where such query log data or user profile information is not available.

(b) The iPubmed system does not assume any past query history. Instead, as a user types in a keyword query character by character, the engine will issue a search on the backend data "on the fly." Such a search engine can be built in many applications where query logs are not available apriori. In the case when query logs are indeed available, the engine can utilize the information to improve the ranking. Many other programs such as Microsoft Desktop Search follow into the same category.

Instant Search Using Lucene

As one of the most popular open source search engines, Lucene supports instant search by enabling prefix queries. For instance, a user can provide a prefix such as "mo," and the engine can find records with a keyword having "mo" as a prefix, such as "mob", "move", and "movie." Internally the engine finds the first few completions (whole keywords) of the prefix, then uses these keywords to find matching documents.

While this approach can be efficient, its main limitation is poor ranking, since it simply ignores the relevance of a document in the search process. The documents matching the first few complete keywords might not be the most relevant results to the query. For instance, it can find documents containing the keyword "mob", even though more relevant documents are those related to "movie."

Miscellaneous:

Wednesday, June 29, 2011

Introducing Bimaple

Bimaple is a company specialized in improving user experiences of Web search.

Our patent-pending search technology makes your data more discoverable and search interface more user-friendly.