What is Azure Databricks?
About Alif : Alif empowers Microsoft MSP-CSP partners to provide exceptional IT services to their clients to ensure that the partners reduce their costs and focus on their business. We provide white-labelled managed services for technologies like Microsoft Azure, Microsoft 365, Microsoft Dynamics 365, Microsoft Security, SharePoint, Power Platform, SQL, Azure DevOps and a lot more. Our headquarter is in Pune, India whereas we work with over 50 partners across the globe that trust us with their client delivery.
Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks offers three environments for developing data intensive applications: Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine Learning.
Databricks SQL provides an easy-to-use platform for analysts who want to run SQL queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
Databricks Data Science & Engineering provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights using Spark.
Databricks Machine Learning is an integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.
For example, in the below Microsoft reference architecture, Databricks is used for ETL and Machine Learning, and Synapse / Azure Analysis Services are serving the Line of Business / Ad Hoc BI workloads. That said, you can still use Databricks through Power BI to perform ad hoc queries on your data lake.
This example is still consistent with the vision – making your Data Lake a centralized and democratized asset that can serve all down-stream data processes.
Databricks Core Components
Developers will mainly interact with Databricks through its collaborative and interactive workspace. This is a notebook-based environment that has some of the following key features:
Code collaboratively, in real time, in notebooks that support SQL, Python, Scala, and R
Built in version control and integration with Git / GitHub and other source control
Enterprise level security
Visualize queries, build algorithms and create dashboards
Create and schedule ETL / Data Science workloads from various data sources to be run as jobs
Track and manage the machine learning lifecycle from development to production
Here is a screenshot of a Databricks Notebook and the Databricks Workspace.
One of the key original value props of Databricks is its managed infrastructure. This takes the form of managed clusters. A cluster is a group of virtual machines that divide up the work of a query in order to return the results faster. By filling out 5-10 fields and clicking a button, you can spin up a Spark cluster that is optimized well beyond open-source spark, includes many common data science and data analytical libraries, and can auto scale to meet the needs of a given workload. You are only paying for Databricks for the time that a cluster is live – and there is much built-in functionality to reduce this cost. For example, using a jobs cluster, the cluster will spin up to complete a specific job or task, and then immediately shut down.
Spark is an open-source distributed processing engine that processes data in memory – making it extremely popular for big data processing and machine learning. Spark is the core engine that executes workloads and queries on the Databricks platform. Databricks was founded by the original creators of Spark and continues to be the largest contributor to open-source Spark today.
Delta is an open-source file format that was built specifically to address the limitations of traditional data lake file formats. Under the hood, Delta is composed of Parquet, a columnar format optimized for big data workloads, with added metadata and transaction logs. Delta offers the following key features that are limitations in file formats such as Parquet and ORC:
Ability to perform upserts
Indexing for faster queries
Unifies streaming and batch workloads without a complex Lambda architecture
Schema validation and expectations
A common misconception is if you choose to build a 'Delta Lake', all of your data needs to be in the Delta format. This is not true – your raw data can stay in its original format, and if you have other specific file format requirements, you can store whatever file type you would like in the data lake. Delta is a tool to be used in the data lake where it makes sense.
ML Flow is an open-source machine learning framework that was built to manage the ML lifecycle. A common challenge within data science is that it is hard to get machine learning into production. ML Flow addresses this challenge with the following features:
All of the above components are part of the open-source ML Flow. On the Databricks platform, you get the following additional benefits:
Workspaces – Collaboratively track and organize experiments from the Databricks Workspace
Jobs – Execute runs as a Databricks job remotely or directly from Databricks notebooks
Big Data Snapshots – Track large-scale data sets that feed models with Databricks Delta snapshots
Security – Take advantage of one common security model for the entire ML lifecycle.
Serving – Quickly deploy a ML model to a rest endpoint for testing during the development process
Essentially, on Databricks, there is no management of ML Flow as a separate tool. Everything is built right in to the UI to create a seamless experience.
SQL Analytics is a new offering which give the SQL analyst a home within Databricks. By switching views in the traditional Databricks workspace, the SQL Analytics workspace gives an experience like that of a traditional SQL workbench. Users can:
Write SQL queries against the data lake
Visualize queries in line
Build dashboards and share them with the business
Create alerts based on SQL queries
The backend of SQL Analytics is powered by SQL Endpoints, which are spark clusters optimized for SQL workloads. These endpoints are not limited to being used by the SQL Analytics UI within Databricks – you can connect to them via your favorite BI tools such as Tableau and Power BI, and harness all of the data in your lake through your favorite BI tool.
When to use Databricks
Modernize your Data Lake – if you are facing challenges around performance and reliability in your data lake, or your data lake has become a data swamp, consider Delta as an option to modernize your Data Lake.
Production Machine Learning – if your organization is doing data science work but is having trouble getting that work into the hands of business users, the Databricks platform was built to enable data scientists from getting their work from Development to Production.
Big Data ETL – from a cost/performance perspective, Databricks is best in its class.
Opening your Data Lake to BI users – If your analyst / BI group is consistently slowed down by the major lift of the engineering team having to build a pipeline every time they want to access new data, in might make sense to open the Data Lake to these users through a tool like SQL Analytics within Databricks.
When not to use Databricks
There are a few scenarios when using Databricks is probably not the best fit for your use case:
Sub-second queries – Spark, being a distributed engine, has overhead involved in processing that make it nearly impossible to get sub-second queries. Your data can still live in the data lake, but for sub-second queries you will likely want to use a highly tuned speed layer.
Small data – Similar to the first point, you won't get the majority of the benefits of Databricks if you are dealing with very small data (think GBs).
Pure BI without a supporting data engineering team – Databricks and SQL Analytics does not erase the need for a data engineering team – in fact, they are more critical than ever in unlocking the potential of the Data Lake. That said, Databricks offers tools to enable the data engineering team itself.
Teams requiring drag and drop ETL – Databricks has many UI components but drag and drop code is not currently one of them.