What is Azure Data Factory?

ALIF Consulting
Oct 27, 2022
5 min read

Updated: Aug 9, 2024

Azure Data Factory (ADF) is a cloud integration system that allows data to be moved between on-premises and cloud systems and schedules and orchestrates complex data flows. Azure Data Factory (ADF) is more of an Extract-and-Load and Transform-and-Load platform rather than a traditional Extract-Transform-and-Load (ETL) platform. To achieve Extract-and-Load goals, you can use the following approaches:

Azure Data Factory (ADF) has built-in features to configure relatively simple data flows to transfer data between file systems and database systems located both on-premises and on the cloud. Azure Data Factory is capable of connecting to many different database systems, like SQL Server, Azure SQL Database, Oracle, MySQL, and DB2, as well as file storage and Big Data systems like the local file system, blob storage, Azure Data Lake, and HDFS.
Alternatively, you could also use Azure Data Factory to initiate SSIS packages. This could be handy if you need to implement more sophisticated data movement and transformation tasks.

For transformation tasks, Azure Data Factory can provision different Azure services on the fly and trigger scripts hosted by different databases and other data processing systems, such as Azure Data Lake Analytics, Pig, Hive, Hadoop, and Azure Machine Learning API services.

How does Azure Data Factory work?

This visual guide provides a detailed overview of the complete Data Factory architecture:

Why Azure Data Factory?

While you can use SSIS to achieve most of the data integration goals for on-premises data, moving data to/from the cloud presents a few challenges:

Job scheduling and orchestration

SQL Server Agent services, which are the most popular to trigger data integration tasks are not available on the cloud. Although there are a few other alternatives, like SQL Agent services on SQL VM, Azure scheduler, and Azure Automation, for data movement tasks, job scheduling features included in ADF seem to be the best. Furthermore, ADF allows the building of event-based data flows and dependencies. For example, data flows can be configured to start when files are deposited into a certain folder.

Security

Azure Data Factory automatically encrypts data in transit between on-premises and cloud sources.

Scalability

Azure Data Factory is designed to handle big data volumes, thanks to its built-in parallelism and time-slicing features and can help you move many gigabytes of data into the cloud in a matter of a few hours.

Continuous integration and delivery

Azure Data Factory integration with GitHub allows you to develop, build, and automatically deploy into Azure. Furthermore, the entire Azure Data Factory configuration could be downloaded as an Azure ARM template and used to deploy Azure Data Factory in other environments (Test, QA, and Production). For those who are skilled in PowerShell, ADF allows you to create and deploy all of its components using PowerShell.

Minimal coding is required

Azure Data Factory configuration is based on JSON files, and a new interface coming with ADF v2 allows the creation of components from the Azure Portal interactively without much coding (which is one reason why I love Microsoft technologies!).

Azure Data Factory - Main Concepts

To understand how ADF works, we need to get familiar with the following Azure Data Factory components:

Connectors or Linked Services

Linked services contain configuration settings for certain data sources. This may include server/database name, file folder, credentials, etc. Depending on the nature of the job, each data flow may have one or more linked services.

Datasets

Datasets also contain data source configuration settings but on a more granular level. Datasets can contain a table name or file name, structure, etc. Each dataset refers to a certain linked service and that linked service determines the list of possible dataset properties. Linked services and datasets are similar to SSIS's data source/destination components, like OLE DB Source, and OLE DB Destination, except SSIS source/destination components contain all the connection-specific information in a single entity.

Activities

Activities represent actions, such as data movement, transformations, or control flow actions. Activity configurations contain settings like a database query, stored procedure name, parameters, script location, etc. An activity can take zero or more input datasets and produce one or more output datasets. Although ADF activities could be compared to SSIS Data Flow Task components (like Aggregate, Script component, etc.), SSIS has many components for which ADF has no match yet.

Pipelines

Pipelines are logical groupings of activities. A data factory can have one or more pipelines and each pipeline would contain one or more activities. Using pipelines makes it much easier to schedule and monitor multiple logically related activities.

Triggers

Triggers represent scheduling configuration for pipelines, and they contain configuration settings, like start/end date, execution frequency, etc. Triggers are not mandatory parts of ADF implementation; they are required only if you need pipelines to be executed automatically on a pre-defined schedule.

Integration Runtime

The Integration Runtime (IR) is the computing infrastructure ADF uses to provide data movement and compute capabilities across different network environments. The main runtime types are:

Azure IR. The Azure integration runtime provides fully managed, serverless computing in Azure, which is the service behind data movement activities in the cloud.
Self-hosted IR. This service manages copy activities between a cloud data store and a data store in a private network, as well as transformation activities, like HDInsght Pig, Hive, and Spark.
Azure-SSIS IR. SSIS IR is required to natively execute SSIS packages.

Data Migration

The most straightforward approach to transferring data is using Data Copy Wizard. It lets you easily build a data pipeline that transfers data from a supported source data store to a supported destination data store.

In addition to using the DataCopy Wizard, you may customize your activities by manually constructing each major component. Data Factory entities are in JSON format, so you may build these files in your favourite editor and then copy them to the Azure portal. The input and output datasets and pipelines can also be created in JSON to migrate data.

Conclusion

Azure Data Factory (ADF) is an essential tool for organizations looking to integrate and manage data across both on-premises and cloud environments. Its ability to handle large volumes of data, combined with features like built-in connectors, flexible scheduling, and secure data movement, makes it a powerful platform for data integration and orchestration. The support for various data sources, minimal coding requirements, and integration with other Azure services further enhance its versatility and scalability.

Whether migrating data to the cloud, automating data pipelines, or executing complex data transformations, Azure Data Factory offers a comprehensive solution that adapts to your specific needs. As businesses increasingly move towards cloud-based operations, mastering ADF can be a significant advantage in achieving efficient, secure, and scalable data management.

Frequently Asked Questions

What is Azure Data Factory (ADF)?

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data transformation. It supports a variety of data sources, including on-premises and cloud-based systems.

How does Azure Data Factory differ from traditional ETL platforms?

Unlike traditional Extract-Transform-Load (ETL) platforms, Azure Data Factory is more focused on Extract-and-Load (EL) and Transform-and-Load (TL) processes. This allows for more flexibility, especially when dealing with large data volumes and diverse data sources.

How does Azure Data Factory work?

Azure Data Factory operates by defining and managing data pipelines, which consist of various activities to move and transform data. It uses Integration Runtime (IR) to execute these activities across different network environments. Data can be copied from on-premises or cloud sources to destinations with minimal coding, using built-in connectors and linked services. ADF can also integrate with other Azure services to perform complex transformations and orchestrate workflows, ensuring secure, scalable, and automated data movement.