How to Gain Up to 9X Speed on Apache Spark Jobs

Oct 27

Written By

Are you looking to gain speed on your Apache Spark jobs? How does 9X performance speed sound? Today I’m excited to tell you about how engineers at Microsoft were able to gain that speed on HDInsight Apache Spark Clusters.

If you’re unfamiliar with HDInsight, it’s Microsoft’s premium managedoffering for running open source workloads on Azure. You can run thingslike Spark, Hadoop, HIVE, and LLAP among others. You create clustersand spin them up and spin them down when you’re not using them.

The big news here is the recently released preview of HDInsight IOCache, which is a new transparent data caching feature that providescustomers with up to 9X performance improvement for Spark jobs, withoutan increase in costs.

There are many open source caching products that exist in theecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IOCache is also based on RubiX and what differentiates RubiX from othercomparable caching products is its approach of using SSD and eliminatingthe need for explicit memory management. While other comparable cachingproducts leverage the reservation of operating memory for caching thedata.

Because the SSDs typically provide more than 1 gigabit/second ofbandwidth, as well as leverage operating system in-memory file cache,this gives us enough bandwidth to load big data compute processingengines like Spark. This allows us to run Spark optimally and handlebigger memory workloads and overall better performance, by speeding upthese jobs that read data from remote cloud storage, the dominantarchitecture pattern in the cloud.

In benchmark tests comparing a Spark cluster with and without the IOCache running, they performed 99 SQL queries against a 1 terabytedataset and got as much as 9X performance improvement with IO Cacheturned on.

Let’s face it, data is growing all over and the requirement forprocessing that data is increasing more and more every day. And we wantto get faster and closer to real time results. To do this, we need tothink more creatively about how we can improve performance in otherways, without the age-old recipe of throwing hardware at it instead oftuning it or trying a new approach.

This is a great approach to leverage some existing hardware and helpit run more efficiently. So, if you’re running HDInsight, try this outin a test environment. It’s as simple as a check box (that’s off bydefault); go in, spin up your cluster and hit the checkbox to include IOCache and see what performance gains you can achieve with yourHDInsight Spark clusters.

How to Gain Up to 9X Speed on Apache Spark Jobs

Azure Database for MariaDB in Preview

Using Azure to Drive Security in Banking Using Biometrics