Cloud

Performance Tuning on Quartz Scheduler

By: Sheldon Shao

Quartz is a popular open source job scheduling library for Java Application. eBay uses it in many projects to schedule jobs.

It does very well under lower load. However, Quartz runs into trouble under a heavy load. A lot of misfired triggers occur, the executor threads can’t get tasks, and hundreds of jobs get stuck in the triggers table.

So we have to do performance tuning on it. This article describes how we narrowed down the issue and optimized Quartz.

What’s the problem?

Quartz jobs aren’t able to be scheduled and executed.
A lot of jobs in the table simple_triggers wait for execution, but a few in fired_triggers. The simple triggers should have the REPEAT_INTERVAL set, which means they are repeat jobs.
TIMES_TRIGGERED means how many times it has been triggered.
Tons of “Handling the first 20 triggers that missed their scheduled fire-time …” logs in the log file.
Database session increased, and many sessions are wait on “SELECT * FROM qrtz_LOCKS WHERE SCHED_NAME = ‘{SCHED_NAME}‘ AND LOCK_NAME = ‘TRIGGER_ACCESS’ FOR UPDATE”.

What is a misfire?

Before we can understand why this happens, let’s learn about misfire. Here is the explanation from Quartz‘s official website:

A misfire occurs if a persistent trigger “misses” its firing time because of the scheduler being shutdown, or because there are no available threads in Quartz’s thread pool for executing the job. The different trigger types have different misfire instructions available to them. By default they use a ‘smart policy’ instruction – which has dynamic behavior based on trigger type and configuration. When the scheduler starts, it searches for any persistent triggers that have misfired, and it then updates each of them based on their individually configured misfire instructions. When you start using Quartz in your own projects, you should make yourself familiar with the misfire instructions that are defined on the given trigger types, and explained in their JavaDoc. More specific information about misfire instructions will be given within the tutorial lessons specific to each trigger type.

For example, there is a job that should be triggered with 10-second interval. Let’s consider “0s” as a timestamp. At “0s”, it was acquired by QuartzSchedulerThread and passed to ExecuteThread to execute. The NEXT_FIRE_TIME was set to “10s.” Unfortunately, it took more than 60 seconds and didn’t finish within 10 seconds, so it missed the trigger of “10s” “20s” .. “60s.”

After “70s,” the MisfireHandler finds it was misfired, so the NEXT_FIRE_TIME should be recovered to “80s.” That’s the “smart policy” instruction for repeat simple triggers.

What’s the “TRIGGER_ACCESS” LOCK for?

Quartz supports clusters, so we are able to configure many instances within one cluster. It needs to use the database LOCK to coordinate the UPDATE on tables triggers and fire_triggers. Quartz uses standard row lock “SELECT * FROM … FOR UPDATE” for MySQL.

A graph helps understand TRIGGER_ACCESSLOCK.

If a new job stores something in the “triggers” table, it must obtain TRIGGER_ACCESS once the LockOnInsert is true(by DEFAULT).
QuartzSchedulerThread also needs to obtain the LOCK after it acquires the trigger and to fire the trigger (triggersFired)
MisfiredHandler obtains TRIGGER_ACCESS to recover the misfired triggers and updates the NEXT_FIRE_TIME for misfired triggers.

When lots of misfires occur, the system runs into a bad situation

We saw this problem on production many times. Here are details of this problem.

One instance has only a few jobs executing.
Once misfires start occurring, reducing the number of instances will help the system to recover.

Based on the logs and database information, we tried to reproduce the problem locally using the following steps.

We set up MySQL database locally.
We copied the MisfireExample from Quartz’s existing examples.
We changed the configuration to point Quartz to use the MySQL database.
We modified the MisfireExample to support multiple instances, so that we could run multiple instances locally.
We set the system to generate triggers to repeat 5 times with 3 seconds interval every 500ms.

After these changes and running five MisfireExample instances, it’s easy to reproduce the problem. Here is what we can see with the behavior the same as that on production.

A lot of triggers accumulated in the “simple_triggers” table.
A few jobs were fired in “fired_triggers”.
Lots of misfired information was printed in the console.
Many MySQL sessions were waiting on “SELECT * FROM qrtz_LOCKS WHERE SCHED_NAME = ‘SCHED_NAME’ AND LOCK_NAME = ‘TRIGGER_ACCESS’ FOR UPDATE”.
Stopping the storage of new triggers did not help to recover the triggers.
Stopping 3 or 4 instances increased the fired triggers. The system will go back to normal once more jobs were executed.

As #5, the job generator only generates 2 triggers every minute in one instance. Even when the generating frequency is very low, the system didn’t recover. So that means StoreJobAndTriggers isn’t a key role in this scenario.

The problem is that the MisfireHandler and QuartzSchedulerThread compete for “TRIGGER_ACCESS” LOCK. Each instance has one MisfireHandler and one QuartzSchedulerThread.

Also if you notice the misfired information printout, it happened around one second. That means it took around 1 second to update 20 rows each time.

Another fact is that QuartzSchedulerThread acquires ONE trigger each time once it obtains the TRIGGER_ACCESS” LOCK. It is a high-speed operation compared with the slow speed of MisfireHandler.

Here is a graph that indicates that why fewer instances were better than more instances when it was running into misfire problem.

Fewer instances mean that QuartzSchedulerThread has more chances to obtain the LOCK.

How to optimize?

The above chart shows the test result of each optimization. We generated 500 enable/disable traffic jobs at once and started two instances of Quartz to process them. It took around 270 minutes to finish all the jobs when using the original code. But it took only 36 minutes with Quartz batch mode.

Using batch mode

Quartz supports a batch mode. With batch mode, the QuartzSchedulerThread is able to acquire jobs based on the active executor thread count. When we configure under this mode, the triggers can be executed faster, and the number of fired triggers is same as the total thread count of all instances.

The following code is the method of creating the Quartz scheduler. We can set maxBatchSize and batchTimeWindow to leverage the batch mode.

public void createScheduler(String schedulerName, String schedulerInstanceId, ThreadPool threadPool, ThreadExecutor threadExecutor, JobStore jobStore, Map<String, SchedulerPlugin> schedulerPluginMap, String rmiRegistryHost, int rmiRegistryPort, long idleWaitTime, long dbFailureRetryInterval, boolean jmxExport, String jmxObjectName, int maxBatchSize, long batchTimeWindow) throws SchedulerException

We set maxBatchSize as same as the number of executor threads. The batchTimeWindow should be based on how many tasks triggered in a specific period. We set it as 1 second in our code.

Change the order of job completion

Let the updating job data task execute before obtaining the lock. The Quartz executing thread needs to obtain the TRIGGER_ACCESS LOCK once a stage is completed. It updates Job Data and the state in the trigger table after it obtained the lock. Updating job data takes a lot time because the job data needs to be serialized and stored to the job detail table. Usually there is only one executor thread updating one job’s data. So it isn’t necessary to do it within the LOCK.

When we moved the “updating job data” step into our own code, it reduced the time on lock. After this change, it only took 27 minutes to finish all the 500 jobs. The following chart shows the change.

Reduce the context switch; execute the stages as much as possible

Our job has multiple stages. One stage can be run in any instance independently. The job data should be stored into database permanently. Also it needs to update the trigger state after one stage completed. Executing all the stages in one executing thread and reducing the lock usage would be a good improvement.

Summary

Quartz uses the database lock in a cluster environment. Jobs will stack with regular configuration when under heavy load. Using batch mode can improve performance quite a bit. And also trying to reduce the lock times would help.

Tags: Cloud, Performance Engineering