You’ve heard before on this blog about the difference between products and items on eBay: the former uses a well-defined structure to describe product information, the latter allows a seller to enter free-form text for describing what’s for sale. In order to help buyers find what they’re looking for, how can we extract relevant information from these unstructured item titles and make them comparable to products?
Natural language processing (NLP) can be used in this context. In a paper titled “Bootstrapped Named Entity Recognition for Product Attribute Extraction”, we present a named entity recognition (NER) system for extracting product attributes and values from listing titles.
These titles pose some unique challenges for NLP:
- They’re relatively short
- Often they’re just a list of nouns without any grammatical structure
- They contain abbreviations and acronyms, and even typographical errors
- There is no contextual information that could help in identifying product attributes
We combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay’s fashion categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identify novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and find n-gram substring matching to work well in practice.
We presented our work (*) at the international conference on Empirical Methods in Natural Language Processing (EMNLP) this July.
-Junling Hu
Principal Data Mining Lead