Intro to Azure Databricks Delta

Ifyou know about or are already using Databricks, I’m excited to tell youabout Databricks Delta. As most of you know, Apache Spark is theunderlining technology for Databricks, so about 75-80% of all the codein Databricks is still Apache Spark. You get that super-fast, in-memoryprocessing of both streaming and batch data types as some of thefounders of Spark built Databricks.

The ability to offer Databricks Delta is one big difference betweenSpark and Databricks, aside from the workspaces and the collaborationoptions that come native to Databricks. Databricks Delta delivers apowerful transactional storage layer by harnessing the power of Sparkand Databricks DBFS.

The core abstraction of Databricks Delta is an optimized Spark tablethat stores data as Parquet files in DBFS, as well as maintains atransaction log that efficiently tracks changes to the table. So, youcan read and write data, stored in the Delta format using Spark SQLbatch and streaming APIs that you use to work with HIVE tables and DBFSdirectories.

With the addition of the transaction log, as well as other enhancements, Databricks Delta offers some significant benefits:

ACID Transactions – a big one for consistency. Multiple writers can simultaneously modify a dataset and see consistent views. Also, writers can modify a dataset without interfering with jobs reading the dataset.

Faster Read Access – automatic file management organizes data into large files that can be read efficiently. Plus, there are statistics that enable speeding up reads by 10-100x and data skipping avoids reading irrelevant information. This is not available in Apache Spark, only in Databricks.

Databricks Delta is another great feature of Azure Databricks that is not available in traditional Spark further separating the capabilities of the products and providing a great platform for your big data, data science and data engineering needs.

Previous
Previous

Azure Data Factory - Data Flow

Next
Next

Azure Database for MariaDB in Preview