Building a Product Catalog: eBay's University Machine Learning Competition

Trade has played a critical role in the history of humanity and yet, data from ecommerce, the modern form of trading, has received limited attention from academia. We at eBay want to change that.

At eBay, we use state-of-the-art machine learning (ML), statistical modeling and inference, knowledge graphs, and other advanced technologies to solve business problems associated with massive amounts of data, much of which enters our system unstructured, incomplete, and sometimes incorrect. The use cases include query expansion and ranking, image recognition, recommendations, price guidance, fraud detection, machine translation, and more.

Though most of the above use cases are common among other technology companies, there is a very distinctive and unique challenge that pertains only to eBay — making sense of more than 1.3 billion listings, of which many are unstructured. Currently, we use our in-house machine learning solutions to approach this problem, but we also want to grow our community and future technologists that haven’t had access to this type of data. By working with universities, we hope that it will pique academic curiosity within ML, spur more research in the ecommerce domain powered by a real-world ecommerce dataset, and help us improve our platform.

To support this idea, eBay is hosting a machine learning competition to structure listing data, in other words, producing a product catalog. We are very excited to partner with students at the following universities (list below), which now can start using a subset of our public listing data to help solve a real-world ecommerce challenge. We have more than 40 students from these universities participate as a team or at individual capacity. There are a number of teams competing from:

  • NYU
  • Stanford
  • University at Buffalo
  • The University of Texas at Dallas

There are plenty of datasets out there, but the primary focus of those have been recommender systems, price estimation, computer vision, Natural Language Processing (NLP), etc. None have been at a scale pertaining to mapping unstructured items to well-cataloged products. We are using the EvalAI open source platform for hosting the challenge. Our main challenge page has all the relevant details.

The challenge

The question we want to address is how to identify two or more listings as being for the same product by putting them into the same group. We call this Product Level Equivalency (PLE). That is, if a buyer purchased two items from two different listings in a single group, and assuming the items were in the same condition, they would assess that they had obtained two instances of the same product. The measurable objective, evaluation, submission format, and other details are available on EvalAI.

The dataset

The dataset consists of 1 million selected public data from unlabeled listings. 

Approximately 25,000 of those listings will be clustered by eBay using human judgment (“true clustering”). These clustered listings will be split into three groups: a) Validation set (approximately 12,500 listings), b) Quiz set (approximately 6,250 listings), c) Final submission set (approximately 6,250 listings).

The validation set is intended for participants to evaluate their approach. Anonymized identifiers and cluster labels will be provided to the participants. The quiz data is used for leaderboard scoring. The final submission set is used to determine the winner. For the quiz and the final submission dataset, neither the listing identifiers nor the cluster labels will be provided to the participants.


The challenge began on October 11, 2019. The partnered university teams can post their submissions anytime through EvalAI. The evaluation and leaderboard scoring will commence on or about November 8, 2019. The competition will run for about five months and end on or about March 4, 2020. We expect to announce the winning team on March 25, 2020.


Students of the winning team will be offered an internship for Summer 2020 at eBay (subject to eligibility verification checks). The 12-week internship will take place at eBay’s San Jose, CA, headquarters and will be fully paid, including furnished summer housing. eBay’s internship program is a combination of real work experience plus a robust program that gives interns exposure to various business verticals, executives, and networking. The internship will also be an excellent opportunity for students to put their ML models into real use.

The team behind

From concept to creation, this challenge was an entirely voluntary effort from people across various disciplines. What started as a hallway conversation eventually ended up into a small group of likeminded enthusiasts. We formed an Operating Committee (OC) and met weekly to brainstorm ideas. Gradually the plans were put into motion, and now we are launching it. It has been an incredible journey, and I was fortunate to be part of the below team that made it happen.

  • Engineering and Research — Roman Maslovskis, Uwe Mayer, Jean-David Ruvini, Anneliese Eisentraut, Akrit Mohapatra, Bennet Barouch, Pavan Vutukuru, Sathish Shanmugam, and Jon Degenhardt
  • Program Management — Roya Foroud
  • Legal — Brian Haslam, Brad Sanders, Sonia Valdez, and Kai Weingarten
  • Recruitment — Cindy Loggins
  • Comms — Melissa Ojeda

We would also like to thank the EvalAI team for quickly responding to our numerous queries. And finally a shoutout to our senior leadership (Mohan Patt and Ron Knapp), who have been supporting this idea from the get-go. 

We sincerely hope that making this real-world dataset available will entice universities and students to explore the ecommerce domain further and come up with novel approaches to solve complex problems that can have a positive impact on customers and sellers alike.

If you are a university student, researcher, or professor and would like to participate in future programs, please feel free to reach out to us.