Service Oriented or Performance Oriented

We have seen a big push toward a Service Oriented Architecture (SOA) across the IT industry over the last decade. Many companies, including eBay, have made big investments in developing SOA frameworks, evolving their applications, introducing governance processes and breaking up monolithic systems to assume a more service oriented nature. The benefits are many and well understood, but adopting a SOA also has its potential pitfalls.

For web-facing applications the decoupling and isolation along functional boundaries may lead to an increase in latency, a decrease in throughput, and consequently a negative impact on site-speed (check out Hugh’s blog entries on how we define, measure and improve site-speed at eBay). In other words, a service-oriented architecture may not be the same as a performance oriented architecture (POA).

If your application must fulfill stringent site speed requirements, then a loosely coupled SOA, while beneficial in terms of system maintenance, management and evolution, may turn out to be unsatisfactory nonetheless. One might argue that performance is lacking because the system wasn’t decomposed in the best possible way, i.e. that the lines were drawn inappropriately; but that just reinforces the point. The lines you draw between a more service-oriented versus a more performance oriented architecture can impact your bottom line.

How can service orientation limit site-speed? In my experience it comes down to the data path. As in many computer architectures at the micro level, data flow optimizations are becoming increasingly important at the macro level when we optimize for site-speed across complex distributed applications. The latencies encountered down the memory and storage hierarchies can quickly add up and consequently many threads end up running idle waiting for data. Adding disaster recovery (DR) capabilities and ensuring loads are balanced across data centers only exacerbate the problem.

What can we do? Again, a look at the evolution of computer architecture can provide some guidance. Caching, pipelining, prediction, pre-fetching and out-of-order execution are some of the key milestone achievements; and these techniques can be applied at the macro level as well. When you decompose a system into services, rather than focusing on a pure functional partitioning, you should take into consideration the data flow and examine any data dependencies across functions. At the same time, think of caching, pipelining, prediction, prefetching, memoization, asynchronous I/O, event driven processing and fork/join frameworks as tools in your optimization toolbox. You will find that services become more coarse-grained due to optimizations along the data path. In other words, two functions that logically represent separate services may end up being coupled due to data dependencies such that you end up not just colocating them, but encapsulating them in a single service. You may also find that the invocation of one service can, as a side effect, produce valuable results more efficiently than if you had to invoke a second service to produce the same.

Another way to look at it is to approach the big picture with a service oriented mindset while tuning critical subsystems with a performance oriented mindset. We don’t suggest you replace a SOA with a POA across the board, but rather pay attention to islands of functionality that might benefit from a POA in an archipelago where SOA dominates.

Let me illustrate with a concrete example. We have spent the last two years optimizing a content delivery platform for real-time user messaging. It is used for advertising, recommendations and loyalty programs across the eBay site. In fact, much of the content you see on the new home page is delivered by this platform. The system is one of the largest at eBay and handles over 2.5 billion calls on a busy day. It relies on numerous data sources and provides two primary capabilities: user segmentation and dynamic, personalized content generation.
User Messaging
The data consumed by the platform can be categorized into the following:

  • Contextual: request-specific (placements, geo, user agent)
  • Demographic: user-specific (age, gender)
  • Behavioral: user-specific (items purchased)
  • Configuration: campaigns and segments
  • Services: on-demand, usually content-specific (ads, recommendations, coupons)

Contextual information is passed to the system as part of the incoming request (typically from a front-end tier) and describes the context for a set of placements. A placement is a reserved area on a web page where a creative is displayed. The demographic and behavioral data is user-specific and includes buyer and seller properties. Behavioral attributes also include real-time data based on recent site activity, e.g. a user’s recently viewed items. Campaigns are deployed to configure the system. They contain segment definitions and may be prioritized or scheduled to run for a specific period of time. All configuration data is cached in memory and refreshed periodically.

The platform also relies on a set of on-demand services that act as data providers. The campaign configuration dictates which service is required for a given type of message. While colocation of some of these would further improve system performance, most services currently run as separate systems.

In selecting a message for a placement on a page, the request handler may evaluate numerous segment rules to determine which message is the most appropriate for the current user in the given context. This evaluation depends on data lookups that may incur a significant cost in terms of latency. After a message has been selected, a content generator creates the actual message creative which may require similar data lookups. For example, a JSP template may rely on a recommender system to deliver item data related to a user’s recent purchase.

The platform typically receives a batch request for as many as 15 placements, so the request handler runs a number of concurrent tasks and data is shared across tasks as much as possible. Content generator tasks are prioritized so the ones that depend on remote data providers are dispatched as early as possible (out-of-order execution). Contextual data is used to predict when to prefetch correlated user data based on common usage patterns. Certain asynchronous tasks may continue to run even after the response has been delivered. For example, a task that writes back campaign usage metrics to a persistent store doesn’t unnecessarily hold up the main request handler.

Numerous platform optimizations are aimed at reducing latency by preventing multiple lookups for similar data in the same request. The data path from segmentation to content generation is highly optimized and even customized for certain use-cases. For example, if the rule engine determines the user falls into a segment for which a coupon is to be issued, it fetches any coupon data used for content generation as a side-effect of segment evaluation. And the task that generates the coupon’s HTML creative reads the data from memory.

While from a SOA perspective we may be encouraged to decouple segmentation from content generation so as to make each component potentially reusable in other applications, the combination based on a POA where the data path is optimized clearly delivers better results in terms of site-speed. The improvements in real-time content delivery are reflected in the reduced response times measured for the eBay home page. The two measurements below are from late 2009 (dotted curve) and late 2010 (black curve). The two curves represent the cumulative distribution of the duration of calls handled by the platform based on a large sample at the same time of day on the same day of the week (loads vary substantially depending on the time of day, and some days are busier than others).
Response Times
The chart above indicates that at the 90th percentile response times have improved by 30% over the course of the year. The speed-up is all the more significant given that the response times for critical data providers hasn’t changed much and the system remains I/O bound. You may be wondering why response times at the 20th percentile have doubled. This is due to the campaign configuration. The home page configuration in late 2009 wasn’t as dependent on data services as the one in late 2010 was, and thus some of the calls weren’t as I/O bound. The samples for the black line above were collected during a period when almost all requests from the home page contained at least one placement that depended on a remote data source.

I’ll leave you with two things. First, a SOA is good, but in certain cases a POA is better. And second, if time is of the essence, an awareness of the data flow and data dependencies in your system can help you decide where to begin adopting more of the latter.