Mark Twain is quoted as having said, “Some German words are so long that they have a perspective.”
Although eBay users are unlikely to search using fearsome beasts like “rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz”, which stands for the “beef labeling supervision duties delegation law”, we do frequently see compound words in our users’ queries. While some might look for “damenlederhose”, others might be searching for the same thing (women’s leather pants) using the decompounded forms “damen lederhose” or “damen leder hose”. And even though a German teacher would tell you only “damenlederhose” or “damen lederhose” are correct, the users’ expectation is to see the same results regardless of which form is used.
This scenario exists on the seller side as well. That is, people selling on eBay might describe their item using one or more of these forms. In such cases, what should a search engine do? While the problem might seem simple at first, German word-breaking – or decompounding, as it is also known – is not so simple.
How to find compound synonyms?
Writing a program to figure out how to break a German compound word for one of the world’s largest online marketplaces poses challenges:
- The word could have been created using a product name, and new product names can appear all the time.
- Syntactically, the word can be split into many valid components, but not all splits are useful or make sense. For instance, “aktivkohlefiltermagnetventil”, meaning “active carbon filter magnetic valve” (a valve used in water treatment plants), can be split at least three different ways, all of which are comprised of valid dictionary words:
- aktiv (active) kohlefilter (carbon filter) magnetventil (magnetic valve) or
- aktiv (active) kohle (carbon) filter (filter) magnetventil (magnetic valve) or
- aktiv (active) kohle (carbon) filter (filter) magnet (magnetic) ventil (valve)
However, the only acceptable split that is synonymous to the compound word “aktivkohlefiltermagnetventil” is “aktivkohlefilter magnetventil”.
- Some strings look like they might be compound words, but in fact are not. For instance, a tokenizer might look at the word “beiden” (“both”) and decide it’s made up of the two separate words “bei” (“at”) and “den” (“the”), but that would be incorrect.
- There is also the problem of how small or large each of the splits can be. Consider the three forms below; which are most useful?
- granitpflastersteine (granite paving stones)
- granit pflastersteine (granite cobblestones)
- granit pflaster steine (granite paving stones)
- A smaller concern is the cost of processing possible splits online, i.e., after the user types in her search query.
Dictionary-based approaches have a tendency to split German compound words into all possible morphologically valid forms. The best splits are hard to algorithmically determine.
Wisdom of the crowds
Faced with these challenges, we turned to our users for help. We can learn a lot simply by watching millions of German users listing millions of items, and performing hundreds of millions of searches. Essentially, whenever users changed from a compound to a decompounded form of a query (or vice versa), we asked, did their subsequent actions on our site indicate that they were pleased with the change? First, we mined our logs to find instances where users changed their search terms by breaking up or joining consecutive adjacent terms. For instance, people typed in “damen schuhe” and then changed it to “damenschuhe”. We analyzed how many and what kind of results we were showing for either search, and how those users liked those results (based on whether they clicked on any of those results). Based on how frequently users were making such changes in their search terms, we were able to collect millions of candidate compound words and their decompounded forms. We did the same for item titles, as sellers often include multiple forms of compound words in order to make sure buyers can find their items. To ensure a good user experience, we then filtered out words without relevant inventory, model numbers, and single character splits. This whittled the set of synonymous compound word pairs to a few hundred thousand.
Since we now had a large number of compound words and their decompounded forms, we were able to determine if a user’s query matched items listed with any of its compounded or decompounded forms. Since the compound words are precomputed, the lookups are blazingly fast. Further, the list of words is based on real-live user behavior data, which allows us to use the most “sensible” compounds/decompounds.
The bottom line
Being able to find items listed with all common compounded/decompounded word forms is valuable and easy to measure. We simply split our user population in two. We provide the existing search experience to some users, and the new experience to the others. The new compound word synonym experience, although largely unnoticed by most users, was indeed improved, enabling buyers to find – and subsequently buy – more of the items they were looking for. This was measurable in terms of direct revenue. In this case, buyer, seller, and eBay interests are well aligned. Buyers bought more, sellers sold more, and eBay revenue increased.
Brian Johnson, Prathyusha Senthil Kumar, Ashok Mallya