Azure Data Factory - Data Flow

I’mexcited to announce that Azure Data Factory Data Flow is now in publicpreview and I’ll give you a look at it here. Data Flow is a new featureof Azure Data Factory (ADF) that allows you to develop graphical datatransformation logic that can be executed as activities within ADFpipelines.

The intent of ADF Data Flows is to provide a fully visual experiencewith no coding required. Your Data Flow will execute on your own AzureDatabricks cluster for scaled out data processing using Spark. ADFhandles all the code translation, spark optimization and execution oftransformation in Data Flows; it can handle massive amounts of data invery rapid succession.

In the current public preview, the Data Flow activities available are:

  • Joins – where you can join data from 2 streams based on a condition
  • Conditional Splits – allow you to route data to different streams based on conditions
  • Union – collecting data from multiple data streams
  • Lookups – looking up data from another stream
  • Derived Columns – create new columns based on existing ones
  • Aggregates – calculating aggregations on the stream
  • Surrogate Keys – this will add a surrogate key column to output streams from a specific value
  • Exists – check to see if data exists in another stream
  • Select – choose columns to flow into the next stream that you’re running
  • Filter – you can filter streams based on a condition
  • Sort – order data in the stream based on columns

Getting Started:

To get started with Data Flow, you’ll need to sign up for the Preview by emailing adfdataflowext@microsoft.com with your ID from the subscription you want to do your development in. You’ll receive a reply when it’s been added and then you’ll be able to go in and add new Data Flow activities.

At this point, when you go in and create a Data Factory, you’ll nowhave 3 options: Version 1, Version 2 and Version 2 with Data Flow.

Next, go to aka.ms/adfdataflowdocs and this will give you all thedocumentation you need for building your first Data Flows, as well aswork and play around with some samples already built. You can thencreate your own Data Flows and add a Data Flow activity to your pipelineto execute and test your own Data Flow in debug mode in the pipeline.Or you can use Trigger Now in the pipeline to test your Data Flow from apipeline activity.

Ultimately, you can operationalize your Data Flow by scheduling andmonitoring your Data Factory pipeline that is executing the Data Flowactivity.

With Data Flow we have the data orchestration and transformationpiece we’ve been missing. It gives us a complete picture for the ETL/ELTscenarios that we want to do in the cloud or hybrid environments, youron prem to cloud or cloud to cloud.

With Data Flow, Azure Data Factory has become the true cloud replacement for SSIS and this should be in GA by year’s end. It is well designed and has some neat features, especially how you build your expressions which works better than SSIS in my opinion.

When you get a chance, check out Azure Data Factory and its Data Flow features and let me know if you have any questions!

Previous
Previous

New Development Feature for Azure Stream Analytics

Next
Next

Intro to Azure Databricks Delta