Getting started with Spark Pools in Azure Synapse
In my latest video blog I discuss getting started on the newly Generally Available Spark Pools as a part of Azure Synapse, another great option for Data Engineering/Preparation, Data Exploration, and Machine learning workloads
Without going too deep into the history of Apache Spark, I'll start with the basics. Essentially, in the early days of Big Data workloads, a basis for machine learning and deep learning for advanced analytics and AI, we would use a Hadoop cluster and move all these datasets across disks, but the disks were always the bottleneck in the process. So, the creators of Spark said hey, why don't we do this in memory and remove that bottleneck. So they developed Apache Spark as an in memory data processing engine as a faster way to process these massive datasets.
When the Azure Synapse team wanted to make sure that they were offering the best possible data solution for all different kinds of workloads, Spark gave the ability to have an option for their customers that were already familiar with the Spark environment, and included this feature as part of the complete Azure Synapse Analytics offering.
Behind the scenes, the Synapse team is managing many of the components you'd find in Open-Sourced Spark such as:
- Apache Hadoop Yarn - for the management of the clusters where the data is being processed
- Apache Livy - for the job orchestration
- Anaconda - a package manager, environment manager, Python/R data science distribution and a collection of over 7500 open source packages for increasing the capabilities of the Spark clusters
I hope you enjoy the post. Let me know your thoughts or questions!