Scalable and Nimble Continuous Integration for Hadoop Projects

Experimentation

The Experimentation Platform at eBay runs around 1500 experiments that are responsible for processing over hundreds of terabytes of reporting data contained in millions of files using a 2500+ node Hadoop infrastructure and consuming thousands of computing resources. The entire report generation process contains well over 200 metrics. It enables millions of customers to experience small and large innovations that enable them to buy and sell products in various countries in diverse currencies and using diverse payment mechanisms in a better way everyday.

The Experimentation Reporting Platform at eBay is developed using Scala, Scoobi, Apache Hive, Teradata, MicroStrategy, InfluxDB, and Grafana, submitting hundreds of Map/Reduce (M/R) jobs to a Hadoop infrastructure. The platform contains well over 35,000 statements and over 25,000 lines of code in around 300 classes.

Problem

We use Jenkins to set up continuous integration (CI), and one of the challenges for humongous projects involving Hadoop technologies is slow-running unit and integration tests. These test cases run one or several M/R jobs in a local JVM, and that effort involves a considerable amount of set-up and destruction time. As additional automated test cases are added to increase code coverage, they have adverse impacts on overall build completion time. One solution that can improve CI run time is to run automated test cases in a distributed and concurrent manner.

This technique helped improve CI running time of the Experimentation Reporting Platform at eBay from 90 minutes to ~10 minutes, thus paving the way for a truly scalable CI solution. This CI involves more than 1,800 unit test cases written in Scala using ScalaTest and Mockito.

Solution

Jenkins provides support for multi-configuration build jobs. A multi-configuration build job can be thought of as a parameterized build job that can be automatically run on multiple Jenkins nodes with all the possible permutations of parameters that it can accept. They are particularly useful for tests where you can test your application using a single build job but under a wide variety of conditions (different browsers, databases, and so forth). A multi-configuration job allows you to configure a standard Jenkins job and specify a set of slave servers for this job to be executed on. Jenkins is capable of running an instance of the job on each of the specified slaves in parallel, passing each slave ID as a build parameter and aggregating JUnit test results into a single report.

The problem boils down to this. There are a number of Jenkins slave nodes, and we have to split all JUnit tests into batches, run all batches in parallel using the available slaves, and aggregate all the test results into a single report. The last two tasks (parallel execution and aggregation) can be solved using built-in Jenkins functionality, namely, multi-configuration jobs (also known as matrix builds).

Setting up a multi-configuration project on Jenkins

There are a number of different ways that you can set up a distributed build farm using Jenkins, depending on your operating systems and network architecture. In all cases, the fact that a build job is being run on a slave (and how that slave is managed) is transparent for the end-user: the build results and artifacts will always end up on the master server. It is assumed that the Jenkins master server has multiple slave nodes configured and ready for use. A new multi-configuration build is created as shown below.

Project configuration

This set-up results in the creation of a multi-configuration project on Jenkins that requires additional configuration before it can be functional.

Source code management

Assuming the project is set up on Git, you can provide the Git SSH URL and build trigger settings as shown here.

Configuration matrix

Now comes the important part that allows you to choose the list of slave machines on which an individual batch of test cases can be executed. In this example, five machines are selected (slave4 is not visible) on which the build will be triggered.

Build

In this set-up, a master machine (the part of the distributed CI job that runs on the Master node) dictates the entire run. A multi-configuration build runs as-is on every slave (including the master) machine. Every build receives a $slaveId as a build parameter that allows the script to be written appropriately. The build configuration part of CI involves invocation of a shell script. This shell script performs the following activities.

Determine a list of test cases classes that need to be executed. Once the list is obtained, it is shuffled. This process occurs only on the master.
Send the complete list to all the slaves.
Split the complete list of test cases into batches equal to number of slave machines.
Execute each batch on a node (slaves or master)
The master node waits for each slave node to complete execution.
Each part of distributed CI job runs on the slaves and the master, but the console log of each is available only on the master. As a result, the computation of the number of total number of tests occurs on the master.

The following shell script performs the above listed tasks.

#!/bin/bash
function determine_list_of_test_cases() {
find . -name *Test*.scala | rev | cut -d '/' -f1 | rev | cut -d '.' -f1 | sort -R > alltests.txt
}
function copy_list_to_slaves() {
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave1-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave1/experimentation-reporting-platform
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave2-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave2/experimentation-reporting-platform
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave3-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave3/experimentation-reporting-platform
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave4-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave4/experimentation-reporting-platform/
echo "Copied list to slaves"
}
function split_tests_into_batches() {
counts=`wc -l alltests.txt | cut -d ' ' -f1`
total_ci_nodes=5
batch=$((($counts+$total_ci_nodes-1)/$total_ci_nodes))
split -l $batch alltests.txt split.
counter=0
for f in split.*; do
awk '{print $0","}' $f | perl -ne 'chomp and print' > $f.$counter
counter=$((counter+1))
done
}
function wait_until_test_list_comes_from_master() {
while [ ! -f alltests.txt ]
do
sleep 2
done
}
function wait_until_build_completes_on_slaves() {
while [[ ! -f slave1.complete ]] || [[ ! -f slave2.complete ]] || [[ ! -f slave3.complete ]] || [[ ! -f slave4.complete ]]
do
sleep 2
done
}
function cleanup() {
if [ -f alltests.txt ];
then
rm alltests.txt
fi
if ls *.complete 1> /dev/null 2>&1;
then
rm *.complete
fi
if ls split.a* 1> /dev/null 2>&1;
then
rm split.a*
fi
}
function count_test_cases_on_master() {
totalPass=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f3 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalFailures=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f5 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalErrors=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f7 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalSkipped=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f9 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
echo "**************************** $1 ********************************"
echo "Number of unit tests executed successfully: $totalPass"
echo "Number of unit tests with failures: $totalFailures"
echo "Number of unit tests with errors: $totalErrors"
echo "Number of unit tests skipped: $totalSkipped"
case $1 in
master)
tests_master=$totalPass
;;
slave1)
tests_slave1=$totalPass
;;
slave2)
tests_slave2=$totalPass
;;
slave3)
tests_slave3=$totalPass
;;
slave4)
tests_slave4=$totalPass
;;
esac
}

function count_test_cases_on_slave() {
totalPass=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f3 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalFailures=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f5 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalErrors=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f7 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalSkipped=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f9 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
echo "**************************** $1 ********************************"
echo "Number of unit tests executed successfully: $totalPass"
echo "Number of unit tests with failures: $totalFailures"
echo "Number of unit tests with errors: $totalErrors"
echo "Number of unit tests skipped: $totalSkipped"
case $1 in
master)
tests_master=$totalPass
;;
slave1)
tests_slave1=$totalPass
;;
slave2)
tests_slave2=$totalPass
;;
slave3)
tests_slave3=$totalPass
;;
slave4)
tests_slave4=$totalPass
;;
esac
}

function execute_tests() {
echo "Executing a batch of $batch test classes, each with multiple test cases"
buildCommand="mvn clean -U -DfailIfNoTests=false -Dtest=`cat $my_batch` test"
echo $buildCommand
eval $buildCommand
}
function report_build_complete_to_master() {
touch $1.complete
scp -i /home/username/.ssh/id_rsa $1.complete username@epci-master-ebay.com:/usr/local/jenkins-ci/.jenkins/jobs/epr-staging-distributed-tests-only/workspace/slaveId/master/experimentation-reporting-platform
}
cleanup
export MAVEN_OPTS="-Xms700m -Xmx4g -XX:MaxPermSize=2g"
cd experimentation-reporting-platform
ls -l
my_batch=split.aa.0
case $slaveId in
master)
determine_list_of_test_cases
copy_list_to_slaves
split_tests_into_batches
execute_tests
wait_until_build_completes_on_slaves
count_test_cases_on_master "master"
count_test_cases_on_slave "slave1"
count_test_cases_on_slave "slave2"
count_test_cases_on_slave "slave3"
count_test_cases_on_slave "slave4"
totalTests=$(($tests_master+$tests_slave1+$tests_slave2+$tests_slave3+$tests_slave4))
echo "*****************************************************************************"
echo " Total number of unit tests executed successfully across: $totalTests"
echo "*****************************************************************************"
;;
slave1)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ab.1
execute_tests
report_build_complete_to_master $slaveId
;;
slave2)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ac.2
execute_tests
report_build_complete_to_master $slaveId
;;
slave3)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ad.3
execute_tests
report_build_complete_to_master $slaveId
;;
slave4)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ae.4
execute_tests
report_build_complete_to_master $slaveId
;;
esac
cleanup

Run

Once the multi-configuration project is created, it can be run as follows.

Each configuration runs a subset of automated test cases on a separate CI node (machine). This allows the entire CI job to complete execution in a distributed manner. This solution is scalable, and as additional automated test cases are added, the execution speed can be maintained by simply adding additional CI slave nodes.

Conclusion

Currently, the batch of unit test case classes are distributed randomly across CI machines. There might be a batch that has slower running test cases, thereby slowing down the execution time of the CI build. Better task allocation times can be achieved by recording test class execution times in any relational database so as to analyze them and construct robust batches of unit test classes with uniform execution times.

Distributed and concurrent execution of unit test cases has allowed the Experimentation Reporting Platform at eBay to build a CI solution with 1,800+ unit test cases with more than 70% statement and branch coverage. The test cases are a mix of Hadoop tests and non-Hadoop unit tests. Each commit triggers a distributed build that finishes in a timely manner (approximately 10 minutes), allowing the committer to quickly verify each commit.

References

“Speeding Up Hadoop Builds Using Distributed Unit Tests,” by Ilya Katsov
Jenkins: The Definitive Guide Book by John Ferguson Smart

Tags: Coding Practices, Developer Tools, Hadoop, OSS, Testing