Unicorn—Rheos Remediation Center

By: Lubin Liu

Rheos is eBay's near-line data platform, and it owns thousands of stateful machines in the cloud. The Rheos team has been building and enhancing the automation system over the past two years. However, it’s time to unify the past work and build a modern, automatic remediation system, Unicorn.

Unicorn includes a centralized place to manage operation jobs, the ability to handle alerts more efficiently, and the ability to store alert and remediation history for further analysis.

Managing thousands of stateful machines in the cloud is challenging. Operation tasks come from two categories:

Hardware failures in the cloud happen everyday
Stateful application-specific issues when scale grows

A centralized place to manage operation jobs

Previously, Rheos had tools to remediate clusters. Many tools are scripts and run in separate places. This kind of dispersive tool-based remediation has several limitations:

Hard to develop and maintain
Conflicts between tools
Learning curve for new support candidates
Not truly automated, costing human efforts

Handling alerts more efficiently

Rheos already delivers a bunch of metrics and alerts. If we can collect these inputs in a centralized place and associate them with other external information, a rule-based remediation service could heal the clusters automatically.

Handling the alert with automation flow will:

Reduce human efforts
Reduce the service-level agreement (SLA) response time to handle alerts

Storing alerts and remediation history for further analysis

Automation can’t cover all the issues at the very beginning and may also introduce new issues if the algorithm is not good enough. Storing the history in a centralized place could help the team improve the system.

On the other hand, by analyzing the alert history, the team can try to find new automation scope.

Challenge

Building such a centralized remediation center is not easy, for at least two reasons.

Building a generic model to cover known experiences that is easy to extend—Building a specific tool to solve a specific problem is a very straightforward solution. On the contrary, identifying a common pattern and model to apply to all the existing experiences is much harder. To avoid building another group of tools in the future, the pattern proposed also needs to be extensible.
Avoiding over-automation with good efficiency—Automation sometimes can be dangerous:

- Doing something not allowed: the actions done by an automation system must be limited and under control. Doing anything that is unexpected may make things worse.
- Doing something too quickly: the concurrence of an automation system must be limited, especially for a stateful cluster. It needs time to sync the state back after an operation and must be easy to roll back when bugs are detected in the automation system.

Serializing all the tasks may avoid part of the over-automation issues. But it’s better to define a model to guarantee efficiency. In a big complicated system, handling tasks in a more efficient way offers a better SLA.

Design

The illustration above shows the overview design of Unicorn.

Unicorn defines a key resource called “Event” to abstract all the issues it needs to solve. Event has three sources:

Alerts sent out from upstream: this is the most common source
Manually dropped by an admin: a way to manually operate in an urgent case
Periodic remediation job: some routine check tasks

The event controller reads the events periodically and triggers workflow to handle events. This module needs to consider how to avoid over-automation and guarantee efficiency. The workflow engine handles the workflow life cycle and sends back the event status to the event controller. Each workflow may invoke some external dependency services, such as the configuration management system and underlying provisioning, deployment, monitoring, and remediation (PDMR) systems to do the remediation logics.

The reporter module sends out a summary/detail report to subscribed targets.

Event

Model

Name	Type	Description
ID	Long	Unique ID to index each event
TYPE	String	The issue type that this event need to handle
GROUP_ID	String	The event in the same group need to be handled sequentially
STATUS	Enum	Status life cycle of the event
LABELS	Map	Information to describe the issue
PAYLOAD	String	Information output during the processing
PRIORITY	Int	How urgent this event is; a bigger value means more urgent
FLOW_ID	Long	The ID of the workflow to handle this event
TIMESTAMP	Long	The timestamp of this event to become valid. Maybe a future time for a delay event.
TIME_TO_LIVE_MS	Long	The lifetime of this event after become valid
OWNER	String	The source of this event
RETRY_COUNT	Int	How many times has this event been attempted to handle
PROCESS_TIMESTAMP	Long	The timestamp that the workflow begins to handle the event
REFERENCE_ID	String	The ID that is defined by user for querying
LOG	String	The processing log

State Machine

The state of each event defines the file cycle of it.

Emit: Event is first dropped in the state store
Processing: Event is picked by the event controller and handling by the workflow engine
Locked: Another event in the same group is processing and marked the remaining as locked
Finished: Event has been processed successfully
Skipped: Event has been processed, but for some reason, it has reached the final state. Usually this means that no further action needs to be taken
Failed: Event has been processed, but failed. Need to take a look ASAP
Ignored: Event is not considered in the current scope

Event Controller

GroupId

Unicorn introduces the concept of “GroupID” to achieve isolation between events. The semantic of GroupId is as follows:

The events in the same group must be handled sequentially
The events in different group could be handled concurrently

Even this field is totally open to self-definition; the common case is to use the physical cluster ID. From the past operation experience, we found that it is safer to avoid handling multiple nodes in one stateful cluster.

Based on the GroupId field, the event controller works as follows to pick events in each round:

List all the group
Iterate all the groups, check whether there is already one event in this group in “Processing” state
If yes, skip the current group. If no, pick one event and kick off processing
Wait for the next round

Priority

The next question to answer is how to determine which group/event should be picked first. That’s the reason Unicorn introduces the concept of “Priority.” The semantic of priority is as follows:

The group with the higher priority event will always be scanned first
The event with the higher priority in one group will always be picked first

The priority is related to event type, i.e., the same type of issue has the same priority. The priority of each event type is not defined in Unicorn. When a new event type is introduced, the user must define this field.

Rate Control

To avoid over-automation, Unicorn introduces several concepts to control the rate:

Max processors: The max concurrent events in the“Processing” state. When there are equal or larger than the max processors events in the processing state, the scan process will stop, even if there are still groups that need to be scanned. “Vip priority threshold” could break this rule.
Rate control window: The max event count could be handled within some sliding window for this specific event type.
VIP priority threshold: The group whose highest priority event exceeds this threshold could break the “Max Processors” limitation, i.e., even if the concurrent processing events already reach the max processors, the high priority event could also be processed. Note that this rule never breaks the “GroupId” limitation.

Workflow

The detail workflow defers between event types. However, most of the workflow in Unicorn follows the above common pattern:

Issue exists stage: check whether this issue still exists. This stage could help to avoid do some action duplicated.
Query metadata stage: query metadata based on input. In most of the cases, the input only contains the index information, like cluster id, node id, etc. For detail information, it needs to be queried in the workflow.
Do action stage: this stage is event-type specific and when it needs to retry, it should return to “Issue exists stage.”
Verify action stage: the implementation of this stage may also be event-type specific, but it is a good pattern to verify what has been done in this flow.
Finish stage: update the status and sync state to the event.

Implementation

NAP (NuData automation platform)

NAP is the RESTful framework contributed by the NuData team. Unicorn highly relies on the features provided by NAP.

Workflow in Unicorn

Unicorn workflows leverage the NAP workflow engine. Basically, Unicorn implements separate tasks and build workflows through configuration files.

Each task may fail with many reasons, which will create multiple branches in a workflow diagram. All the tasks in Unicorn have an output variable called “flowStatus.” “flowStatus” describes the result of the current task, and Unicorn leverages the “SwitchBy” semantic in the NAP workflow engine to determine what should be the next task. For the detailed information of the current task, Unicorn outputs to the “log” field of the event.

Besides basic workflow, Unicorn has a lot of scheduled workflows. The scheduler in NAP is much more powerful than the ScheduledExecutorService in Java and even supports a crontab style. This feature is very useful for Unicorn to control daily and weekly jobs.

RESTful services in Unicorn

Unicorn RESTful services all leverage the NAP micro-service RESTful framework. The services in Unicorn could be divided into two categories.

Resource CRUD

There are two main resources in Unicorn: event and alert. Event is the key resource in Unicorn, and Unicorn manages the whole life cycle of it, including create, get, update, and delete. The alert interface is opened for the alerting system to inject through a webhook, and the main logic in AlertHandler transfers alerts to a standard event.

Note that in current implementation, the GroupId of the event, which is transformed from an alert, is also injected in the AlertHandler. The GroupId that Unicorn chooses is the clusterId.

As an automation system, Unicorn also exposes an API to query the clusters that were recently handled. This is useful for a support person as well as for generating reports.

Workflow resources also fall into this type, but the NAP framework already covered this part.

Admin APIs

This group of APIs help admin to debug issues. NAP has two amazing interfaces:

Thread dump: returns the thread dump info of the current service. Useful for deadlock analysis and thread hang issue debugging.
Logs: streams the log file content. Useful for detailed debugging.

Portal in Unicorn

The Unicorn portal is totally a leveraged NAP config-driven UI framework. The portal of Unicorn is a standard operation that focuses on resource query and status update.

To show the overall status of the system, the main page of Unicorn shows some statistical information. For example:

The total event type supported currently
The count of clusters handled recently
How many events are handled successfully recently
How many events are handled failed recently

Sample workflows

Disk sector error flow

Kafka depends on the stability of the disk. An HDD with high media sector error will impact the Kafka application and cause the following two main issues:

Traffic drop dramatically
High end-to-end latency

When Unicorn receives a media sector error alert, it follows the flow shown on image above. Two steps should be highlighted:

Alert firing: check with the monitoring system to see whether this issue still exists.
Ready for replace: removing the old node and bringing back a new node with the same broker ID is the best solution to solve this issue. However, when the in-sync replica (ISR) is not in a healthy state, it is dangerous to replace the node directly. In this case, Unicorn will only restart the Kafka process to temporary slow down the impact.

This flow is a standard issue caused by the underlying hardware failure.

Kafka partition reassign flow

Each Kafka broker holds some number of topic-partitions. When the traffic in one cluster increases dramatically, Rheos will increase the virtual machine count in that cluster. However, Kafka will not move the existing topic-partition to the new brokers automatically. Guaranteeing the balance of traffic keeps the cluster stable and gets better throughput.

The PartitionReassignEvent is dropped periodically. Unicorn remediate the traffic imbalance issue with a greedy algorithm. The main process of this event workflow is as follows:

Pick one broker with the most partitions
Pick one broker with the least partitions
Generate a partition reassign plan and kick off the process with partition reassign tool provided by Kafka
Wait and check until the reassign finishes

This flow is solves a standard issue caused by application-specific design.

Reporter

It’s important to know what operations have been done by an automation system. Rheos periodically sends out an operation report.

In Unicorn, the reporter is implemented with the scheduled workflow provided by NAP. Most of the reports it uses today are related to event statistical information. To serve a different target receiver, Unicorn makes the reporter pluggable with the following configurations:

Handler: the class that implements the reporter logic
SchedulerPattern: the pattern to send out the report, in a crontab style
ReportWindowHour: the time window of this report
EventTypeSubscribed: the event type list that the report will include
Receiver: the receiver list that will receive this report
Type: the protocol of this report to send out; currently, only email is supported

A developer and a manager may need different granularities and different dimension reports. Making this module pluggable provides such kind of capability.

On production

Unicorn has run on production for several months. This section introduces the statistical information based on production data.

Event types

Currently, 10 event types are supported on production. A brief description and the percentage of each event type in the total event count are as follows:

Event Type	Percentage	Description
MediaErrorDisk	0.06%	Remediate the node with high media sector error
RollingRestart	0.02%	Restart the process one by one and guarantee state synced
NodeDown	66.89%	Reboot node or replace to bring down virtual machine back
Replace	0.37%	Delete the legacy node and set up a new one with same ID and configurations
BehaviorMmLagHighFor5Min	20.00%	Remediate the high mirrormaker lag caused by network shake
PartitionReassign	4.79%	Reassign the partition distribution within one cluster
LeaderReelection	4.78%	Reelect a proper leader for some specific topic partition
ReadonlyDisk	0.31%	Remediate the node when the disk is in read-only state
HighDiskUsageFor5Min	0.09%	Clean up the disk usage when the occupation is high
StraasAgentHeal	1.76%	Auto heal the PRMR system agent

The following observations are based on the table above:

Virtual machine down and network shake contribute the most alerts
Application-specific flows take an important part in Unicorn
Disk failure causes a lot of issues in Rheos

Event status

The distribution of the three final event status (Finished, Failed, Skipped) for each event type is as follows:

Event Type	Finished	Failed	Skipped
MediaErrorDisk	50%	9%	41%
RollingRestart	36%	1%	63%
NodeDown	2%	3%	95%
Replace	20%	79%	1%
BehaviorMmLagHighFor5Min	0.04%	0.06%	99.9%
PartitionReassign	11%	1%	78%
LeaderReelection	70%	1%	29%
ReadonlyDisk	81%	6%	13%
HighDiskUsageFor5Min	18%	3%	79%
StraasAgentHeal	76%	8%	16%
Total	6%	2%	92%

Several observations based on the above table:

Most of the events are marked as “Skipped.” When the problem is solved once, the other related events waiting in the queue will not need to take any action. Handling false alerts is an important advantage of Unicorn.
Disk-related issues could be remediated efficiently.
Replace flow has the highest failed rate. As the final step of many flows, replace flow suffers a lot of underlying system failures.

Event Count

A simple count of handled events after deploying to production is shown in below image:

Several observations from this diagram:

Each week, Unicorn will scan thousands of events.
After enabling new features, the handled events count increased, obviously.
When some issues were solved, the events that need to be handled decreased.

Event timestamp

Two concepts to show the efficiency of Unicorn are as below:

Time to response (Blue color on the chart): processing_timestamp - creation_timestamp, the time between Unicorn received the event and start processing timestamp
Time to resolve (Red color on the chart): last_modified_timestamp - processing_timestamp, the time that Unicorn to mark an event as “Finished” or “Failed.”

Several observations from this chart:

Event with rate control (NodeDown and BehaviorLagHighFor5Min) and low priority (LeaderReelection and PartitionReassign) have a bigger response time
Urgent events, including disk related and manually dropped events, have a small response time: under 5 minutes
Most of the resolve time is around 10 minutes
BehaviorMMLagHighFor5Min needs some sampling time; we may need to enhance the algorithm

Summary

Automation is the key to saving effort and providing a better SLA response. Unicorn saves the support efforts, shortens the SLA to remediate urgent issues, and guarantees the stability of stateful clusters in Rheos. From a technical perspective, the contributions of Unicorn are as follows: provides an event-driven solution to modeling the past Rheos support experiences and easy to extend for new issues; defines the concept of GroupId and Priority to isolate conflicting operations and make sure there is acceptable efficiency; and some best practices based on NAP may provide an example to other tools to manage clusters running on C3.

Unicorn needs to be continuously enhanced in Rheos daily work. New issues and new flows continue to be found from going through the alert list and from daily support work. It is also very important to add intelligence to Unicorn in the future to handle brand new issues automatically.

Tags: Cloud, Data Infrastructure and Services