What is Azure Data Factory?
About Alif : Alif empowers Microsoft MSP-CSP partners to provide exceptional IT services to their clients to ensure that the partners reduce their costs and focus on their business. We provide white-labelled managed services for technologies like Microsoft Azure, Microsoft 365, Microsoft Dynamics 365, Microsoft Security, SharePoint, Power Platform, SQL, Azure DevOps and a lot more. Our headquarter is in Pune, India whereas we work with over 50 partners across the globe that trust us with their client delivery.
Azure Data Factory (ADF) is a cloud integration system, which allows moving data between on-premises and cloud systems as well as scheduling and orchestrating complex data flows. ADF is more of an Extract-and-Load and Transform-and-Load platform rather than a traditional Extract-Transform-and-Load (ETL) platform. To achieve Extract-and-Load goals, you can use the following approaches:
ADF has built-in features to configure relatively simple data flows to transfer data between file systems and database systems located both on-premises and on the cloud. ADF is capable of connecting to many different database systems, like SQL Server, Azure SQL Database, Oracle, MySQL, DB2, as well as file storage and Big Data systems like the local file system, blob storage, Azure Data Lake, and HDFS.
Alternatively, you could also use ADF to initiate SSIS packages. This could be handy if you need to implement more sophisticated data movement and transformation tasks.
As for transformation tasks, ADF can provision different Azure services on the fly and trigger scripts hosted by different databases and other data processing systems, like Azure Data Lake Analytics, Pig, Hive, Hadoop, and Azure Machine Learning API services.
How does it work?
This visual guide provides a detailed overview of the complete Data Factory architecture:
Why Azure Data Factory
While you can use SSIS to achieve most of the data integration goals for on-premises data, moving data to/from the cloud presents a few challenges:
Job scheduling and orchestration. SQL Server Agent services, which is the most popular to trigger data integration tasks is not available on the cloud. Although, there are a few other alternatives, like SQL Agent services on SQL VM, Azure scheduler, and Azure Automation, for data movement tasks, job scheduling features included in ADF seems to be best. Furthermore, ADF allows build event based data flows and dependencies. For example, data flows can be configured to start when files are deposited into a certain folder.
Security. ADF automatically encrypts data in-transit between on-premises and cloud sources.
Scalability. ADF is designed to handle big data volumes, thanks to its built-in parallelism and time-slicing features and can help you move many gigabytes of data into the cloud in a matter of a few hours.
Continuous integration and delivery. ADF integration with GitHub allows you to develop, build and automatically deploy into Azure. Furthermore, the entire ADF configuration could be downloaded as an Azure ARM template and used to deploy ADF in other environments (Test, QA and Production). For those who are skilled with PowerShell, ADF allows you to create and deploy all of its components, using PowerShell.
Minimal coding required. ADF configuration is based on JSON files and a new interface coming with ADF v2 allows creating components from the Azure Portal interactively, without much coding (which is one reason why I love Microsoft technologies!).
Azure Data Factory - Main Concepts
To understand how ADF works, we need to get familiar with the following ADF components:
Connectors or Linked Services. Linked services contain configuration settings to certain data sources. This may include server/database name, file folder, credentials, etc. Depending on the nature of the job, each data flow may have one or more linked services.
Datasets. Datasets also contain data source configuration settings, but on a more granular level. Datasets can contain a table name or file name, structure, etc. Each dataset refers to a certain linked service and that linked service determines the list of possible dataset properties. Linked services and datasets are similar to SSIS's data source/destination components, like OLE DB Source, OLE DB Destination, except SSIS source/destination components contain all the connection specific information in a single entity.
Activities. Activities represent actions, these could be data movement, transformations or control flow actions. Activity configurations contain settings like database query, stored procedure name, parameters, script location, etc. An activity can take zero or more input datasets and produce one or more output datasets. Although, ADF activities could be compared to SSIS Data Flow Task components (like Aggregate, Script component, etc.), SSIS has many components for which ADF has no match yet.
Pipelines. Pipelines are logical groupings of activities. A data factory can have one or more pipelines and each pipeline would contain one or more activities. Using pipelines makes it much easier to schedule and monitor multiple logically related activities.
Triggers. Triggers represent scheduling configuration for pipelines and they contain configuration settings, like start/end date, execution frequency, etc. Triggers are not mandatory parts of ADF implementation, they are required only if you need pipelines to be executed automatically, on a pre-defined schedule.
Integration runtime. The Integration Runtime (IR) is the compute infrastructure used by ADF to provide data movement, compute capabilities across different network environments. The main runtime types are:
Azure IR. Azure integration runtime provides a fully managed, serverless compute in Azure and this is the service behind data movement activities in cloud.
Self-hosted IR. This service manages copy activities between a cloud data stores and a data store in a private network, as well as transformation activities, like HDInsght Pig, Hive, and Spark.
Azure-SSIS IR. SSIS IR is required to natively execute SSIS packages.
The most straightforward approach to begin transferring data is to use Data Copy Wizard. It lets you easily build a data pipeline that transfers data from a supported source data store to a supported destination data store.
In addition to using the DataCopy Wizard, you may customize your activities by manually constructing each of the major components. Data Factory entities are in JSON format, so you may build these files in your favorite editor and then copy them to the Azure portal. So the input and output datasets and the pipelines can be created in JSON for migrating data.
Data Factory is a great tool for the rapid transition of data onto the cloud. Data Factory provides easy data integration on-premises and on the cloud.