Design and build Modern data warehouse for small and medium business

ALIF Consulting
Nov 8, 2022
5 min read

Updated: Jun 6, 2024

Solution Overview

This example workload shows several ways small businesses (SMBs) can modernize legacy data stores and explore big data tools and capabilities without overextending current budgets and skill sets. These end-to-end Azure data warehousing solutions integrate easily with Azure and Microsoft services and tools like Azure Machine Learning, Microsoft Power Platform, and Microsoft Dynamics.

Data Warehouse

A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived from a wide range of sources, such as application log files and transaction applications.

A typical data warehouse often includes the following elements:

A relational database to store and manage data
An extraction, loading, and transformation (ELT) solution for preparing the data for analysis
Statistical analysis, reporting, and data mining capabilities
Client analysis tools for visualizing and presenting data to business users
Other, more sophisticated analytical applications that generate actionable information by applying data science and artificial intelligence (AI) algorithms or graph and spatial features that enable more kinds of analysis of data at scale

Benefits of a Data Warehouse

Data warehouses offer the overarching and unique benefit of allowing organizations to analyze large amounts of variant data and extract significant value from it, as well as to keep a historical record.

Four unique characteristics (described by computer scientist William Inmon, who is considered the father of the data warehouse) allow data warehouses to deliver this overarching benefit. According to this definition, data warehouses are

Subject-oriented. They can analyze data about a particular subject or functional area (such as sales).
Integrated. Data warehouses create consistency among different data types from disparate sources.
Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
Time-variant. Data warehouse analysis looks at change over time.

Architecture

Legacy SMB data warehouses might contain several types of data:

Unstructured data, like documents and graphics
Semi-structured data, such as logs, CSVs, JSON, and XML files
Structured relational data, including databases that use stored procedures for extract-transform-load/extract-load-transform (ETL/ELT) activities

Components

Azure Synapse Analytics is an analytics service that combines data integration, enterprise data warehousing, and big data analytics. In this solution:

o An Azure Synapse Workspace promotes collaboration between data engineers, data scientists, data analysts, and business intelligence (BI) professionals.

o Azure Synapse pipelines orchestrate and ingest data into SQL Database and Data Lake Storage.

o Azure Synapse serverless SQL pools analyze unstructured and semi-structured data in Data Lake Storage on demand.

o Azure Synapse serverless Apache Spark pools do code-first explorations in Data Lake Storage with Spark languages like Spark SQL, pySpark, and Scala.

Azure SQL Database is an intelligent, scalable, relational database service built for the cloud. In this solution, the SQL Database holds the enterprise data warehouse and performs ETL/ELT activities that use stored procedures.
Azure Event Hubs is a real-time data streaming platform and event ingestion service. Event Hubs can ingest data from anywhere and seamlessly integrate with Azure data services.
Azure Stream Analytics is a real-time, serverless analytics service for streaming data. Stream Analytics offers rapid, elastic scalability, enterprise-grade reliability and recovery, and built-in machine learning capabilities.
Azure Machine Learning is a toolset for data science model development and lifecycle management. Machine Learning is one example of the Azure and Microsoft services that can consume fused, processed data from Data Lake Storage.

Alternatives

Azure IoT Hub could replace or complement Event Hubs. The solution you choose depends on the source of your streaming data and whether you need cloning and bidirectional communication with the reporting devices.
You can use Azure Data Factory for data integration instead of Azure Synapse pipelines. The choice depends on several factors:

o Azure Synapse pipelines keep the solution design simpler and allow collaboration inside a single Azure Synapse workspace.

o Azure Synapse pipelines don't support SSIS packages rehosting, which is available in Azure Data Factory.

o Synapse Monitor Hub monitors Azure Synapse pipelines, while Azure Monitor can monitor Data Factory.

You can use Synapse Analytics dedicated SQL pools for storing enterprise data, instead of using SQL Database. Review the use cases and considerations in this article and related resources to make a decision.

Potential use cases

Several scenarios can benefit from this workload:

Migrating a traditional, on-premises relational data warehouse that's smaller than 1 TB and extensively uses SQL Server Integration Services (SSIS) packages to orchestrate stored procedures.
Meshing existing Dynamics or Power Platform Dataverse data with batched and real-time Azure Data Lake sources.
Using innovative techniques to interact with centralized Data Lake Storage data. Techniques include serverless analysis, knowledge mining, data fusion between domains, and end-user data exploration.
This solution isn't recommended for:
Greenfield deployment of data warehouses that are estimated to be > 1 TB within one year.
Migrating on-premises data warehouses that are > 1 TB or projected to grow to that size within a year.

Considerations

The following considerations apply to this scenario:

Availability

SQL Database is a PaaS service that can meet your high availability (HA) and disaster recovery (DR) requirements. Be sure to pick the SKU that meets your requirements.

Operations

SQL Database uses SQL Server Management Studio (SSMS) to develop and maintain legacy artifacts like stored procedures.

Pricing

See a pricing sample for an SMB data warehousing scenario in the Azure pricing calculator. Adjust the values to see how your requirements affect the costs.

SQL Database

SQL Database bases costs on the selected Compute and Service tiers, and the number of vCores and Database Transaction Units (DTUs). The example shows a single database with provisioned Compute and eight vCores, based on the assumption that you must run stored procedures in an SQL Database.

Data Lake Storage

Data Lake Storage pricing depends on the amount of data you store and how often you use the data. The sample pricing includes 1 TB of data stored, with further transactional assumptions. The 1 TB refers to the size of the data lake, not the original legacy database size.

Azure Synapse pipelines

Azure Synapse pipelines base costs on the number of data pipeline activities, integration runtime hours, data flow cluster size, and execution and operation charges. Pipeline costs increase with additional data sources and amounts of data processed. The example assumes one data source batched every hour for 15 minutes on an Azure-hosted integration runtime.

Azure Synapse Spark pool

Azure Synapse Spark pool bases pricing on node size, number of instances, and uptime. The example assumes one small compute node with five hours a week to 40 hours a month utilization.

Azure Synapse serverless SQL pool

Azure Synapse serverless SQL pool bases pricing on TBs of data processed. The sample assumes 50 TBs processed a month. This figure refers to the size of the data lake, not the original legacy database size.

Event Hubs

Event Hubs bills based on tier, throughput units provisioned, and ingress traffic received. The example assumes one throughput unit in the Standard tier over one million events for a month.

Stream Analytics

Stream Analytics bases costs on the number of provisioned streaming units. The sample assumes one streaming unit used over the month.