When eBay moved to the open-source-based system, Hadoop, for its data warehouse technology, we had to build a key application for data creation, management and interaction: a tool which would provide the functionality necessary to further our innovations and tech-led reimagination.
Moving to Hadoop meant that standard vendor tools for tasks like data modeling, ETL (extract, transform, load) and SQL editing were no longer an option. What started as a project to streamline data engineering processes resulted in an application that gave more flexibility and power to data warehouse users — while also increasing eBay’s ability to govern data.
eBay has created a new web-based data warehouse tool which serves as an editor, data catalog and data engineering toolkit. It expands the features available to data platform users such as pivots, charts and queries, keeping more data activities on the platform while simultaneously increasing transparency for data governance. eBay’s development of the application shows that the challenges created by a move to open-source technology can result in a reengineering of workflows with cross-functional benefits.
The Engineering Tool Gap
With the move to Hadoop, 100+ data engineers had to transition from robust vendor applications with GUIs to text-based terminal workflows and scattered “toolkits.” The terminal windows offered unstructured query results that were hard to read and posed challenges when monitoring the job status and history of thousands of items. The distributed toolkits were inefficient and made it hard to onboard new engineers.
The team tasked with solving this issue started by interviewing engineers about their pain points and doing industry research. The alpha version of the tool was designed to have SQL development as its focus, with its first generation leveraging Apache Livy to perform platform data connections and information exchange. Next, the toolkits were gathered and given a UI. At this point, the new tool became the web-based alternative for Hadoop data development and quickly gained its first 100 users.
The New Challenge
With the tool offering web-based SQL editing and execution as well as simple code repository management, the team was asked if it could support all 2,000+ data warehouse users at eBay, representing the internal customers of the 100 engineers already using the platform.
To transform the new SQL authoring tool into an enterprise-level analytics tool for analysts, data scientists and anyone with a need to create a custom report, the feature set needed to expand. The new target features included:
- Data exploration,
- Data visualization, and
- Query scheduling.
Adding this new functionality and combining these audiences on the same platform ended up being a win-win.
Data Exploration
With a data warehouse at eBay’s scale, entity relationship diagrams (ERDs) are virtually unusable. eBay once covered three walls of an office with a data warehouse ERD which was out of date as soon as the pages were retrieved from the printer.
Instead, eBay relies on well documented metadata and sample queries. The new solution integrated metadata from handwritten definitions; SQL logs (sample SQL and sample join queries); the Hive metastore; profiled data; and outputs from data modeling and ETL processes into a single data catalog.
This web-based data catalog can answer questions like:
- Business questions: What is in the table? What is the subject area? What are the columns and their meaning?
- Usage questions: Is this table used? By whom/what? Which organizations use this data?
- SQL writing questions: Which columns have partitions? Which tables can be joined to this table? What example queries include this table?
- Operational questions: When was the table last loaded? How long does it normally take? When should it be loaded by? What other tables have to finish before this one can start being loaded? Where is the code that corresponds to this table?
- Data quality questions: What data quality checks are performed on the table? Did it pass these checks recently?
- Ownership: Who manages the data?
- Change history: Have any significant changes impacted this table recently? When will they be resolved?
This solution brought the data operations and business metadata together. Everyone is now using the same reference. This provides more transparency to data warehouse users and gives engineers the extra drive to make sure that the metadata is accurate and understandable.
With a data catalog so vast, the traditional folder-browse style of metadata exploration is impractical, so the tool focuses on metadata search. A natural language search bot was added as an assistant that can support user questions on any page of the tool. For example, when in a SQL writing pane, you can interact with a pop-up assistant that can tell you if the Hadoop cluster you’re using is going to be down for maintenance, highlight which tables or queries you can leverage for your business questions, help with some SQL syntax or provide tips for using the tool.
Visualization: Keep Users in the Tool for Increased Transparency
A common data workflow involves writing a bit of code and then exporting data into Excel to create a graph or pivot for a quick understanding or quality check on the code’s results. To better support users, a quick pivot and chart option is now available right in the results panel of the SQL editor. This is easier for users and also ensures that data stays on-platform — a win for data governance.
If users need more than a quick graphical view, they can create a query with variables (static or dynamic) that display as an entry form. They can leverage Spark, PySpark, Python, RSpark and more in a single notebook. The platform team is continuously inspired by how users leverage these notebooks. For example, users have created how-to pages with images and sample SQL; dashboards with entry forms; and detailed analysis with plots and descriptions of findings.
Scheduling: More Features for Users, More Data Governance
Prior to the new SQL authoring tool, data warehouse power users that needed to create complex schedules had to create their own virtual machine and install software to manage dependencies. With the new tool being tied into the data warehouse operations, users can now schedule their own code to only start after a data warehouse table has loaded. They can even set up schedules with dependencies on their personally owned tables, creating a seamless series of dependencies starting with when the data is first copied from its source to a user-created table used in a report.
A traditional data warehouse management team may shudder at giving users the power to be virtual ETL developers, but providing this ability increases transparency. Based upon how the new SQL authoring tool is used, the data governance team can observe which tables users have created that likely have heavy business dependencies. In addition, completing data governance tasks has been tied into these features. For example, users can only set up dependencies on a personally owned table if the metadata for that table is registered in the same data catalog used by all other tables in the data warehouse.
Results and Next Steps
This web-based tool has already served approximately 5,000 users, with over 2,400 monthly average users. Moving forward, the platform team will continue to refine features available to power users and increase automation for ETL engineers. In addition, 10-20% of the tool’s users have no SQL skills and either run code written by others or use the platform as a reference. The team looks forward to engaging this audience to uncover opportunities to simplify the analytics experience for everyone.