Hadoop

From Vendor to In-house: How eBay Reimagined Its Analytics Landscape

By: Medha Samant and Valerie Steinbrugge

Learn how eBay transitioned its analytics data platform from a vendor-based data warehouse to an open-source-based solution built by the team.

To set the path for more tech innovation at eBay, we recently completed a transition of our analytics platform, moving from a vendor-based data warehouse to an open-source-based solution. This transition was no small undertaking and was considered challenging – or even impossible – by many in the industry due to its scale and complexity. With thoughtful strategy, breakthrough technological innovations and tight teamwork, eBay got it done.

Background

eBay’s journey with a commercial data warehousing platform began in the early 2000s. Since then, the platform steadily grew in size, amassing over 20 petabytes of data that includes web analytics and transactional data, such as bids, checkouts, listings, users and accounts. The platform was supporting thousands of eBay analysts, engineers, product managers and leaders across the globe. It not only served as the system of record for eBay’s financial reporting, but was also the preferred platform for all advanced analytics and business intelligence. Why did eBay decide to transition out of this platform, and how was this complex shift achieved?

Key Drivers

Key factors that influenced this move were cost, innovation and more control over eBay’s tech journey. The vendor system posed constraints on eBay’s scope of expansion and became increasingly expensive. Simultaneously, eBay’s technology stack was undergoing a transformation. With a growing focus on data governance, platform reliability and availability coupled with the rapidly evolving data security landscape, it was imperative for eBay to have full control of its technological innovation.

How eBay Achieved It

eBay began exploring alternatives, with open source as the top contender. Here’s how we did it.

Approach

At the foundational level there were two main systems: one supporting large data and batch processing, and the other supporting fast interactive querying and analytics. Both systems had thousands of extract, transform, load (ETL) jobs running on a daily basis. These datasets were being used by thousands of users at all levels of the organization. eBay teammates in search, marketing, shipping, payments, risk and several other domains directly consumed and interacted with these datasets every second of the day. Whether a team wanted to execute a simple “select * from” SQL command or build a complex machine learning model, they had to touch the data residing in one of these two systems. The use cases were seemingly endless, and they all had to move to the new platform without any disruptions.

The new ecosystem needed to provide the same capabilities as the existing solution. Users’ tolerance for any degradation in their experience was very low. Detailed analyses yielded the following factors as crucial to providing a seamless transition to the new platform:

Executing interactive queries: The new solution had to match industry standards for SQL execution speed at scale. Irrespective of the number of users accessing the system simultaneously, every user would expect their queries to be executed in a matter of seconds.
Feature parity: The eBay analytics community had a massive inventory of customized SQL scripts, reports, applications and complex processes that leveraged features and functions that were not provided by the new Hadoop platform. These needed to be able to be migrated with minimal changes.
Connectivity patterns: The Hadoop environment was expected to support established connectivity patterns that had evolved over the years while adhering to new, more stringent standards.
Tools integration: The new solution needed to be able to connect to a software where users could write and execute code like SQL and Python as well as connect to vendor business intelligence and data science applications.

Therefore the migration objective of this program was staged into the following steps:

Build the Hadoop infrastructure and clusters;
Enable ETL batch processing on Hadoop;
Replicate jobs running on the vendor platform on Hadoop;
Build a dedicated computing cluster for interactive queries; and
Migrate users from the vendor platform to Hadoop.

Technology Innovations

SQL-on-Hadoop Engine

eBay’s new SQL-on-Hadoop solution offers high availability, security and reliability. The primary goal of the engine was to replace a proprietary software that specializes in speed, stability and scalability. It was built on Apache Spark 2.3.1 with rich security features, but there was a gap in performance of open-source SQL on Spark. To close this gap, the following optimization strategies were incorporated:

Custom Spark drivers: By introducing a custom Spark driver that functions as a long-running service, the engine was able to support a high volume of concurrent Spark sessions, thus increasing the elasticity of the system and providing isolated session management capabilities. This reduced the connectivity and initialization speed from 20 seconds to one second.
Transparent data cache layer: Scanning the vast Hadoop Distributed File System (HDFS) cluster would introduce instability and degrade the performance of the cluster. To tackle this, a transparent data cache layer with well-defined cache life cycle management was introduced. This enabled automatic caching of the most-accessed datasets in the SQL-on-Hadoop cluster. The cached data would automatically expire based on the Spark runtime as soon as it discovered that the upstream data had been refreshed, prompting the cache to be rebuilt. This quadrupled the scan speed.
Re-bucketing: Most of eBay’s data tables have a bucket layout and are more suitable for “sort-merge joins,” since they eliminate the need for additional shuffle-and-sort operations. But what happens if tables have different bucket sizes or the join key is different from the bucket key? The new SQL-on-Hadoop engine can handle this scenario with the “MergeSort” or “Re-bucketing” optimization feature.
Bloom filter indexes: This feature allows data pruning on columns not involved in buckets or partitions for faster scanning. Bloom filter indexes are independent from the data files so they can be applied and removed as needed.
Original design manufacturing (ODM) hardware: The full effect of software optimizations can be realized only if the hardware has the capacity to support it. At eBay we design our own ODMs and were able to leverage a custom-designed SKU with a high-performance CPU and memory specs tailored for the SQL-on-Hadoop Spark engine, providing maximum computing capability.
ACID (Atomicity, Consistency, Isolation, Durability) support: Traditional commercial databases with ACID properties provide CRUD (Create, Read, Update, Delete) operations. The current Hadoop open-source framework lacks ACID properties, supporting only Create and Read operations. Not providing Update and Delete operations would have required thousands of analysts and engineers to learn and adopt heavy Hadoop ETL technology to perform their day-to-day functions. This was a deal-breaker for eBay, so the SQL-on-Hadoop tool was enhanced to provide the Update and Delete operations. Using Delta Lake, Apache Spark was enhanced to fully support the Update and Delete operations, including using these operations in complex statements that include joins.

SQL Authoring Tool

The other key component of migrating from a vendor warehousing solution to an on-premise one was building a SQL authoring capability. This was particularly crucial since it is the interface between users and the underlying warehouse. Analytics users at eBay can be categorized into “data analysts” and “data engineers.” Analysts primarily deal with SQL code, while engineers work with Python and Java in addition to SQL. With these two roles in mind, a new SQL authoring solution was introduced, which would eventually go on to offer more capabilities like metadata management, advanced analytics and toolkits for efficient data operations.

The tool was designed to provide SQL development capability, and it leverages Apache Livy for connectivity to the underlying Hadoop data platform and two-way transfer of data. It also provides a centralized toolkit to support the development life cycle for engineers.

The platform has become a powerful SQL authoring solution also providing the following capabilities:

Data exploration;
Interactive query analytics;
Data visualization;
Collaboration and sharing;
Schedule as a service; and
Operation as a service.

Making the Move

Batch Workloads

With over 30,000 production tables on the vendor system, the first task at hand was to determine the critical tables and establish a clear scope for migration. This provided the opportunity to clean up several legacy and unused tables, resulting in a final set of 982 production tables to be migrated to Hadoop. All other tables were retired.

Personal Databases and Interactive Queries

The vendor solution provided a custom feature that allowed users to create personal databases which they could use as sandbox environments for testing temporary use cases. However, several users leveraged this feature for their day-to-day analyses and reporting, so it was a critical component of what needed to be migrated to Hadoop without any loss of data or functionality. Transferring these databases posed a few challenges:

From a platform perspective, there were thousands of such databases. Each of them served a unique use case into which the platform team had no visibility, and had to rely on each user to determine the criticality and decision to migrate to the new environment.
Several databases were created for one-time use. Some of the users owning these databases had left the company, posing a communication and outreach complication.

To tackle these challenges, the platform team built a self-service tool that would allow users to migrate their personal database tables from the vendor system to the new Hadoop platform. To address the problem of completion, the platform and migration teams analyzed the full list of databases on the system to eliminate all the tables unused for at least 365 days. The resulting list of databases was further analyzed for active usage, thus leading to a smaller set of databases to be migrated.

Customer Training and Support

This effort involved changing systems for users from a wide range of roles and responsibilities with diverse skills. Most users were accustomed to the vendor-provided ecosystem and found it challenging to reimagine their day-to-day tasks in a new environment. To facilitate the transition, the project team worked to address gaps in skills and to help users develop familiarity with the new system before encouraging them to make the move.

The migration team established a dedicated track to develop training materials for various levels of user experiences and technical complexity – not only through wiki pages and training videos, but also through live zoom classes and training drives with full-fledged course offerings tailored for users across the globe.

Several other learning and development avenues were established through custom office hours (for each topic of concern for users at large); dedicated Slack channels; and 24x7 level 1 and level 2 support with clearly defined SLAs for ticket acknowledgement and resolution. By the end of the project close to 2,000 migration-related tickets were resolved.

In some cases where a team needed dedicated support for the migration, temporary working groups (dubbed “tiger teams”) – including engineers from all levels of the stack – worked closely with the user team. Together they navigated the deep end of their processes and dependencies on the vendor platform, rebuilding them in Hadoop to offer a similar – if not better – performance and experience.

Velocity and Agility

Another key aspect of making the move was the speed at which eBay’s engineering teams could roll out features and upgrades as users were making the transition. eBay is a highly agile organization, and this migration was particularly achievable within the planned timelines because of the velocity of the product’s maturity.

For example, the SQL-on-Hadoop engine served as the system that several users directly interacted with on a daily basis. As people were making the transition and exploring the new system, they discovered several features that were essential to their activities. The platform team was able to gather these requirements periodically; design and develop the change; and roll it out in production within one or two sprints on average. This not only gave users the confidence that their requirements were being addressed, but also helped mature the product quickly.

What Success Meant for eBay

By eliminating vendor dependency, this migration puts eBay in full control of its innovation, preparing users for the future of analytics. Some of the biggest wins include:

A highly resilient analytics data platform modernized across the tech stack;
An open source solution with full flexibility over features, supporting tools and operations;
No vendor dependency or constraints to the platform’s growth and innovation.

This monumental migration not only resulted in significant cost savings, but also helps drive eBay’s renewed tech-led reimagination strategy. Most of all, it exemplifies the strength of collaboration and reaffirms the technical expertise at eBay for significant undertakings of this size and scale in the future.

Charts

Vendor-to-in-house tool migration stats
Vendor-based vs. in-house ecosystems at eBay

Tags: Data Infrastructure and Services, Hadoop, Platforms and Frameworks