top of page

What is Azure Data Factory?

Updated: Dec 27, 2023

Azure Data Factory (ADF) is a cloud integration system, which allows moving data between on-premises and cloud systems as well as scheduling and orchestrating complex data flows. ADF is more of an Extract-and-Load and Transform-and-Load platform rather than a traditional Extract-Transform-and-Load (ETL) platform. To achieve Extract-and-Load goals, you can use the following approaches:

  • ADF has built-in features to configure relatively simple data flows to transfer data between file systems and database systems located both on-premises and on the cloud. ADF is capable of connecting to many different database systems, like SQL Server, Azure SQL Database, Oracle, MySQL, DB2, as well as file storage and Big Data systems like the local file system, blob storage, Azure Data Lake, and HDFS.

  • Alternatively, you could also use ADF to initiate SSIS packages. This could be handy if you need to implement more sophisticated data movement and transformation tasks.

As for transformation tasks, ADF can provision different Azure services on the fly and trigger scripts hosted by different databases and other data processing systems, like Azure Data Lake Analytics, Pig, Hive, Hadoop, and Azure Machine Learning API services.


How does it work?

How does azure data factory works

This visual guide provides a detailed overview of the complete Data Factory architecture:



Why Azure Data Factory

While you can use SSIS to achieve most of the data integration goals for on-premises data, moving data to/from the cloud presents a few challenges:

Job scheduling and orchestration

SQL Server Agent services, which are the most popular to trigger data integration tasks are not available on the cloud. Although there are a few other alternatives, like SQL Agent services on SQL VM, Azure scheduler, and Azure Automation, for data movement tasks, job scheduling features included in ADF seem to be the best. Furthermore, ADF allows the building of event-based data flows and dependencies. For example, data flows can be configured to start when files are deposited into a certain folder.

Security

ADF automatically encrypts data in transit between on-premises and cloud sources.

Scalability

ADF is designed to handle big data volumes, thanks to its built-in parallelism and time-slicing features and can help you move many gigabytes of data into the cloud in a matter of a few hours.

Continuous integration and delivery

ADF integration with GitHub allows you to develop, build, and automatically deploy into Azure. Furthermore, the entire ADF configuration could be downloaded as an Azure ARM template and used to deploy ADF in other environments (Test, QA, and Production). For those who are skilled with PowerShell, ADF allows you to create and deploy all of its components, using PowerShell.

Minimal coding is required

ADF configuration is based on JSON files and a new interface coming with ADF v2 allows creating components from the Azure Portal interactively, without much coding (which is one reason why I love Microsoft technologies!).

Azure Data Factory - Main Concepts

To understand how ADF works, we need to get familiar with the following ADF components:

Connectors or Linked Services

Linked services contain configuration settings for certain data sources. This may include server/database name, file folder, credentials, etc. Depending on the nature of the job, each data flow may have one or more linked services.

Datasets

Datasets also contain data source configuration settings but on a more granular level. Datasets can contain a table name or file name, structure, etc. Each dataset refers to a certain linked service and that linked service determines the list of possible dataset properties. Linked services and datasets are similar to SSIS's data source/destination components, like OLE DB Source, and OLE DB Destination, except SSIS source/destination components contain all the connection-specific information in a single entity.

Activities

Activities represent actions, these could be data movement, transformations, or control flow actions. Activity configurations contain settings like a database query, stored procedure name, parameters, script location, etc. An activity can take zero or more input datasets and produce one or more output datasets. Although ADF activities could be compared to SSIS Data Flow Task components (like Aggregate, Script component, etc.), SSIS has many components for which ADF has no match yet.

Pipelines

Pipelines are logical groupings of activities. A data factory can have one or more pipelines and each pipeline would contain one or more activities. Using pipelines makes it much easier to schedule and monitor multiple logically related activities.

Triggers

Triggers represent scheduling configuration for pipelines and they contain configuration settings, like start/end date, execution frequency, etc. Triggers are not mandatory parts of ADF implementation, they are required only if you need pipelines to be executed automatically, on a pre-defined schedule.

Integration Runtime

The Integration Runtime (IR) is the compute infrastructure used by ADF to provide data movement and compute capabilities across different network environments. The main runtime types are:

  • Azure IR. Azure integration runtime provides a fully managed, serverless compute in Azure and this is the service behind data movement activities in the cloud.

  • Self-hosted IR. This service manages copy activities between a cloud data store and a data store in a private network, as well as transformation activities, like HDInsght Pig, Hive, and Spark.

  • Azure-SSIS IR. SSIS IR is required to natively execute SSIS packages.


Data Migration

The most straightforward approach to begin transferring data is to use Data Copy Wizard. It lets you easily build a data pipeline that transfers data from a supported source data store to a supported destination data store.

In addition to using the DataCopy Wizard, you may customize your activities by manually constructing each of the major components. Data Factory entities are in JSON format, so you may build these files in your favorite editor and then copy them to the Azure portal. So the input and output datasets and the pipelines can be created in JSON for migrating data.


Conclusion

Data Factory is a great tool for the rapid transition of data onto the cloud. Data Factory provides easy data integration on-premises and on the cloud.

82 views0 comments

Recent Posts

See All

Comments


bottom of page