This article is an adaptation of our original publication at the SIGIR e-commerce workshop. This work was done by eBay's Search Backend Engineering team members Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degenhardt, and Nick Whyte.
Search engines are implemented as large distributed systems where there is a limited budget in CPU and memory that every machine can use. Any improvement in efficiency could potentially be translated into a reduction in hardware, networking, and operating costs. Tremendous research and engineering efforts have gone into addressing performance challenges: reduction of memory requirements by improving data compression, reduction of CPU usage by implementing early termination techniques, and the use of massively parallel execution engines are just a few of the techniques that have been extensively studied in the past. In this article, we focus on one technique originally designed to improve data compression and reduce the size of the data that is loaded into main memory: document id reordering.
At the heart of search
Before we go into more details, let us explain some basic concepts about the search engine’s main data structure: the Inverted Index. The inverted index is an old and simple, yet very efficient data structure that is at the heart of most search engines and is used to support various search tasks. From a collection of documents, an inverted index stores for each unique term $t$ (or word) a list of postings. Each posting stores pairs $\langle d,w(t,d) \rangle$, where $d$ is a unique document identifier (doc_id) and $w(t,d)$ is a relevance measure of the term $t$ with respect to document $d$ (often the number of times term $t$ occurs in document $d$). These posting lists can be extended to store additional information such as the positions of the term within the document. Posting lists are usually stored sorted by doc_id and processed one document at a time. To help with compression, doc_ids are often stored difference encoded—the value stored in the list is the difference (or d-gap) between this id and the preceding id. The list of doc_ids $\langle d_1, d_2, d_3,\dots d_n \rangle$, is a strictly monotonically increasing sequence. These integers can be decreased in size by calculating the differences between each document id and the preceding document id, resulting in a list of d-gaps $\langle d_1, d_2 - d_ 1, d_3 - d_2, \dots , d_n - d_{n-1}\rangle$. These d-gaps are further compressed using variable-width integer codes such as variable byte encoding, PForDelta, QMX, or similar.
In practice, search engines can assign doc_ids in a number of different ways: at random, based on document similarity, in the order documents are indexed (collection order), based on a global measure of quality such as page rank, or for web documents, URL order. Others have noted that document reordering not only reduces index size, but can also decrease query processing time. This has motivated several authors in the past to study the doc_id assignment process in such a way as to optimize compression and query latency.
Ok, now how do we search?
Query processing involves a number of steps such as query parsing, query rewriting, and the computation of complex machine-learned ranking functions that may include hundreds or thousands of features derived from the documents, the query, and the user. To rank efficiently, it is common to separate the query processing into multiple stages. The first stage is responsible for identifying which documents must be ranked, and the subsequent stages rank those documents. In the first phase, a simple and fast retrieval filtering such as Boolean operators are often used.
A conjunctive Boolean query of the form "Last AND Jedi'' requires an intersection calculation, whereas a disjunctive query of the form "Episode OR VIII'' requires a union calculation. Union queries are solved by linearly traversing all postings lists for all terms in the expression and returning all documents containing either term. Efficiently resolving intersection queries requires complicated traversal of the postings lists and has been examined for decades. It has been proven that the optimal intersection algorithm for two sets of length $m$ and $n$ with $m \leq n$ is $O(m\log\frac{n}{m})$. The most popular algorithm for solving intersections is Set Versus Set (and also known as Small Versus Small).
A popular e-commerce search technique to improve precision is to constrain a query to a particular set of categories in a category tree. This can be done automatically by a trained classifier, or manually by the user. For example, the results of the query iphone case can be constrained so that all the resulting items belong to the category "Cell Phones & Accessories Cases, Covers & Skins." These categories also form natural partitions, clustering items according to popular search dimensions.
Want to search faster? Reorder your docs!
Document reordering is a well-studied technique in web search. Most prior work has focused on reordering documents to achieve better compression. The approach is to perform text clustering on the collection to find similar documents and then assign doc_ids to minimize the d-gaps in the posting list. Fabrizio Silvestri explored a much simpler idea that takes advantage of an implicit organization of the documents in the Web. In his work, he proposes to assign doc_ids sequentially according to the document URL and showed that this simple technique achieved competitive results when compared to text clustering techniques. In essence, he roughly clustered on topic, because different sites and different parts of the same sites are usually topically related.
Our approach is motivated by the work of Silvestri—we present a simple but non-optimal document id reordering technique. It takes advantage of the structured nature of documents in e-commerce, specifically that items that are for sale are usually classified and categorized before being listed.
e-Commerce search is more structured
E-commerce search is a much more structured and constrained scenario than web search. In e-commerce, much of the document content is given by the item properties, such as price, brand, model, category, color, and so on. It is natural to consider these features as filtering components of a search, and it is consequently common practice to generate posting lists for those kind of features. For example, by generating posting lists for each instance of the feature "brand" (i.e brand:apple, brand:samsung, etc) the user can easily constrain their search to just the items made by a given manufacturer—and indeed they expect to be able to do so.
Category is a particularly interesting property of e-commerce items (and queries), because it is not only used to divide the inventory into distinct types of products, but it is also commonly used to improve the precision of the results. Given a user query, the search engine can constrain the results to just those from the most relevant category. This can be done in an automatic way by training query to category classifiers, or by allowing the user to manually select a category. Figure 1 shows an example user query "star wars" being constrained to the category "art" on ebay.com.
If the search engine creates posting lists for each category, the first stage of query processing can be improved significantly, since it can perform a direct Boolean intersection between the (possibly term expanded) user query and the posting list for the given category. Since this cuts down the size of the recall base, it increases the efficiency of the search engine, but since it restricts to the most likely category, it also removes noise from the results list, increasing precision. And this is the motivation for document id reordering based on item category.
We re-order the collection so that the doc_ids are assigned in such a way that items belonging to the same category are given contiguous doc_ids. This reduces the size of the d-gaps in posting lists, which leads to better compression. However, this is not the only benefit. Because posting lists are sorted by doc_id, it creates implicit category "divisions" within each posting list.
Figure 2 illustrates this. On the top left, the collection of documents is shown in "collection order," where the distinct shades of gray represent different categories. The top right gives example posting lists for words ($w_1,w_2,w_3$). The bottom left of the figure shows the collection category reordered where, for example, collection ordered $d_2$ becomes category ordered $d_3$. The bottom right shows the effect of document reordering on the posting lists; they are now implicitly divided by category.
Figure 2: Document reordering diagram. Different grat levels represent different categories. On the left, we show a sketch of how documents get new doc_id assignment. On the right, the effect on posting lists.
This document reordering scheme not only helps compression, but also decreases latency: as the d-gaps are smaller, the decompression of the integers is faster because, in general, a smaller number of operations is required to decode a smaller integer. Equally, since similar documents are stored together, fewer skips are needed. It is obvious that when a query is category constrained, the results must lie within a consecutive part of the postings list.
Show me the numbers
We implemented this idea and conducted our experiments using eBay's search engine. We selected 12 million random items from our dataset and constructed two indexes:
- Random Index: documents were assigned doc_ids at random.
- Reordered Index: documents were sorted by category and then assigned doc_ids.
To evaluate the performance of our document reordering technique, we use two sets of queries:
- General Queries: approximately 3 million queries from production logs of one eBay data center. These queries included user-issued queries as well as system-issued queries (such as those from the eBay public APIs). No country-specific filtering was performed, so queries are in many languages.
- User Queries: a random sample of 57,058 queries from ebay.com exactly as entered into the search box by ebay.com users.
3.2% Smaller
For the purpose of this work, we constructed a special index that uses variable byte encoding to compress doc_ids. We see a reduction of 6.1% in the average number of bytes required to represent a doc_id using this encoding scheme. This represents a 3.2% space reduction of the complete index. Table 1 presents a summary of the results. It shows that the number of d-gaps equal to 1 has increased by 70%, that the average d-gap has decreased by 67.5%, and that the average number of bits required to represent d-gaps is reduced by 28%. In practice, the actual size reduction in the index will depend on the integer encoding scheme.
|
Random |
Reordered |
Change |
d-gaps = 1 |
$890 \times 10^6$ |
$1,510 \times 10^6$ |
+70% |
Avg. $\log_2(d-gaps)$ |
5.73 |
4.11 |
-28% |
Avg. d-gaps |
1966 |
639 |
-67.5% |
Avg. Bytes/d-gap(vByte) |
1.30 |
1.22 |
-6.1% |
Index Size |
29.67 |
28.72 |
-3.2% |
Table 1: Space savings and bits per d-gap obtained before and after applying document reordering.
26.9% Faster!
The first experiment considered throughput using the General Queries—it's a mirror of the production environment. We computed the average number of queries per second (QPS) that could be resolved when the CPU was held at 80% utilization (the other 20% is used in eBay for processes such as index maintenance). We found that on average, the Reordered Index could process about about 30% more queries per second than the Random Index.
Table 2 shows the average search time per query (in milliseconds) at the mean, median, 95th percentile, and 99th percentile. In all cases, we see a substantial improvement; the mean latency improved by 26.9% while 95th percentile latency is almost halved when compared to the random document Reordering.
|
Random |
Reordered |
Latency Reduction |
Mean |
22.43 |
16.4 |
26.9% |
Median |
4.35 |
3.21 |
28.3% |
95th Percentile |
57 |
30.2 |
47% |
99th Percentile |
375 |
224 |
40% |
Table 2: Comparison of random doc_id assignment versus category doc_id reordering. Mean, median, 95th, and 99th percentiles of query processing times in milliseconds for general queries.
We also evaluated the impact of applying category constrains to the queries. The results are shown in Table 3. The left side shows the latency (in milliseconds) when category constraint is not used. In this case, the Reordered index improved mean query latency by 47% and the 95th percentile by 41%. The right shows the effect of enabling category constrain on the queries. There the mean query latency has reduced by 55% when the Reordered Index is used, and a similar effect is observed for the 95th percentile. Clearly both unconstrained and category constrained queries are materially improved.
Without Category Constraint |
With Category Constraint |
|||||
|
Random |
Reordered |
Latency |
Random |
Reordered |
Latency |
Mean |
26.8 |
14.3 |
47% |
18.9 |
8.4 |
55% |
Median |
5.9 |
3.5 |
42% |
3.2 |
1.7 |
48% |
95th Pct |
85.8 |
50.8 |
41% |
61.6 |
27.6 |
55% |
Table 3: Execution time in milliseconds for user queries with and without category constraint enforcement.
Cool, but why?
Categories naturally cluster both document terms and query terms. Items satisfying a multi-term AND query will tend to come from a small number of categories. Ordering posting lists by category will put documents satisfying these queries near each other both in posting lists and in the forward index. This should improve CPU cache hit rates and even improve the inherent efficiency of the posting list processing algorithms. The latter effect would result from having larger clusters of posting list entries that are either matched or skipped than in a random assignment of the documents.
Query expansion is likely to compound these effects, especially in an e-commerce environment. As in most search engines, we make extensive use of query expansion to improve recall and precision. Rewritten queries often form a complex Boolean expression involving many posting lists and nested AND/OR constructs. Expansions involve not only text keywords, but also the structured meta-data associated with products. For example, the term "black" may expand to "color:black" and "samsung" to "brand:samsung."
To examine these effects, we used the CPU usage profiling tool perf, while processing a portion of a General Queries collection, to analyze and identify the exact locations where the improvement was more noticeable. We observed the functions performing the task of iterating (and skipping) through posting lists was consuming about 5% of the total CPU time, and we observed a noticeable improvement especially in these parts. We also saw improvements in the doc_id variable byte decoding section. Finally, we analyzed the effect of how cache misses were affected by the document reordered index. We observed a 7% reduction in overall cache misses and a 13% reduction in last-level cache misses (LLC). These numbers show that that document ordering by category yielded a significant improvement in overall CPU cache hit rates. These numbers are consistent with our hypothesis for the improvements in latency. Additional analysis is still needed to quantify the effects on posting listing processing.
The probability that any given cluster of posting list entries will be referenced by a query is far from uniform in the General Queries collection, and more likely following something like a zipfian curve. This should reduce the number of CPU cache entries filled with posting list data during processing of a query, and thus reducing the CPU cache working set size for the portion of query processing dedicated to processing posting lists. The reduction in CPU cache working set size for posting lists allows a larger amount of CPU cache to be used for other functions performing query processing work, which improves the CPU cache hit rate for those other functions.
The above discussion focuses on the recall component of query processing. As mentioned earlier, document reordering also better co-locates forward index data for items satisfying the recall expression for a query. Forward index data is used extensively in the ranking component of query processing. As such, this also has potential to improve CPU cache hit rates.
Summary
Our results show that document id ordering on category reduces mean latency per query by 27% and 99th percentile latency by 45%. Latency improvements are seen both with and without category constraints applied. It also reduces the index size by 3.2%. For more details you can find our SIGIR e-commerce workshop paper here.
Acknowledgments
We thank Prof. Alistair Moffat and Prof. Gonzalo Navarro for their help and comments during the development of this project.