What Is Microsoft OneLake?
OneLake serves as a unified and logical data lake for your entire organization. It handles substantial amounts of data from different origins, similar to OneDrive. OneLake is included by default with every Microsoft Fabric tenant and is intended to serve as the primary location for all your analytical data. OneLake offers customers:
OneLake: Your Single Data Lake
One Data, Many Engines
OneLake: Your Single Data Lake
Prior to OneLake, it was simpler for clients to establish numerous lakes for various business units instead of working together on a single lake despite the additional effort of managing multiple resources. OneLake aims to address these obstacles by enhancing collaboration. Each client tenant is assigned exactly one OneLake. There can never be more than one, and if you have Fabric, there can never be zero. Each Fabric tenant automatically sets up OneLake without the need for additional resources to configure or oversee.
One copy of the data
OneLake's goal is to maximize the value obtained from a single data instance, eliminating the need for data duplication or movement. There is no longer a necessity to duplicate data to utilize it with a different system or to dismantle data silos to analyze data alongside information from other sources.
Shortcuts connect
ParaShortcuts enable your organization to effortlessly share data between users and applications without the need to move and duplicate information unnecessarily. When teams operate independently in separate workspaces, shortcuts empower you to merge data across different business groups and domains into a virtual data product tailored to a user's requirements.
A shortcut serves as a reference to data stored in other file locations, which can be within the same workspace or across different workspaces, within OneLake or externally in ADLS, S3, or Dataverse — with more target locations on the way. Regardless of the location, shortcuts present files and folders as if they are stored locally.
One Data, Many Engines
While applications may have separate storage and computing capabilities, the data is usually optimized for a single engine, making it challenging to utilize the same data for multiple applications. With Microsoft Fabric, various analytical engines (such as T-SQL, Apache Spark, and Analysis Services) store data in the open Delta Parquet format, enabling the use of the same data across multiple engines.
There is no longer a necessity to duplicate data in order to use it with a different engine. You always have the flexibility to select the most suitable engine for the task at hand. For instance, consider a scenario where a team of SQL engineers is constructing a fully transactional data warehouse. They can leverage the T-SQL engine and its capabilities to create tables, transform data, and load the data into tables. If a data scientist needs to utilize this data, they are no longer required to go through a specialized Spark/SQL driver. OneLake stores all data in Delta Parquet format, allowing data scientists to directly utilize the powerful capabilities of the Spark engine and its open-source libraries.
Business users can develop Power BI reports directly on OneLake using the new Direct Lake mode in the Analysis Services engine. The Analysis Services engine powers Power BI semantic models and has always offered two modes of accessing data: import and direct query. The Direct Lake mode provides users with the speed of import without the need to duplicate the data, combining the best aspects of import and direct query.
Get Started with Microsoft Lakehouse
Sign in to Microsoft Fabric.
Switch to Data Engineering using the workload switcher icon at the lower left corner of your homepage.
Select Workspaces from the left-hand menu.
To open your workspace, enter its name in the search textbox located at the top and select it from the search results.
In the upper left corner of the workspace home page, select New and then choose Lakehouse.
Give your lakehouse a name and select Create.
A new lakehouse is created, and if this is your first OneLake item, OneLake will be provided behind the scenes.
At this point, you have a lakehouse running on top of OneLake. Next, add some data and start organizing your lake.
Adding Data to Your OneLake Lakehouse
In the file browser on the left, select Files and then select New subfolder. Name your subfolder and select Create.
You can repeat this step to add more subfolders as needed.
Select a folder and then select Upload files from the list.
Choose the file you want from your local machine and then select Upload.
You’ve now added data to OneLake. To add data in bulk or schedule data loads into OneLake, use the Get Data button to create pipelines. Find more details about options for getting data in the Microsoft Fabric decision guide: copy activity, dataflow, or Spark.
Select the More icon (…) for the file you uploaded and select Properties from the menu.
The Properties screen shows the various details for the file, including the URL and Azure Blob File System (ABFS) path for use with Notebooks. You can copy the ABFS into a Fabric Notebook to query the data using Apache Spark.
OneLake and Lakehouse in Microsoft Fabric
The foundation of a contemporary data warehouse is a data lake. Microsoft OneLake is a unified, logical data lake for your entire organization, automatically provisioned with every Fabric tenant and intended to serve as the singular location for all your analytics data. You can utilize OneLake for:
Remove silos and reduce management effort
All organization data is stored, managed, and protected in a single data lake resource. Since OneLake is provided by your Fabric tenant, there is no need to provide or manage additional resources.
Reduce data movement and duplication
The goal of OneLake is to retain just a single instance of data. Fewer data duplicates reduce the need for data transfer procedures, resulting in increased efficiency and decreased complexity. If needed, you can establish a link to access data stored in alternative locations instead of duplicating it in OneLake.
Use with multiple analytical engines
The information within OneLake is saved in a format accessible to different analytical engines such as Analysis Services (utilized by Power BI), T-SQL, and Apache Spark. Other applications outside of Fabric can also utilize APIs and SDKs to retrieve data from OneLake.
For more information, see OneLake, the OneDrive for data.
Storing data in OneLake involves creating a lakehouse in Fabric. A lakehouse is a data architecture platform designed for storing, managing, and analyzing structured and unstructured data in a unified location. It can effortlessly handle large data volumes of all file types and sizes, and because it's centralized, it can be easily shared and reused within the organization.
The lakehouse includes an integrated SQL analytics endpoint, offering data warehouse features without the need for data transfer. This allows you to run SQL queries on your Lakehouse data without additional setup.
Delta Lake storage
Delta Lake is an optimized storage layer that forms the basis for data and table storage. It is designed to handle ACID transactions for large-scale data workloads and is the default storage format in a Fabric lakehouse.
It is important to note that Delta Lake ensures reliability, security, and performance in the lakehouse for both streaming and batch operations. While internally stores data in the Parquet file format, it also maintains transaction logs and statistics to offer additional features and performance improvements compared to the standard Parquet format.
Choosing Delta Lake format over generic file formats provides several key benefits. These include support for ACID properties, particularly durability to prevent data corruption, faster read queries, increased data freshness, and support for batch and streaming workloads.
Furthermore, Delta Lake enables data rollback through Delta Lake time travel and enhances regulatory compliance and audit through Delta Lake table history. Fabric standardizes the storage file format with Delta Lake, and every workload engine in Fabric creates Delta tables by default when data is written to a new table.
Data Security With OneLake & Fabric
Microsoft Fabric provides a multi-layer security approach for controlling data access. Security settings can be applied to an entire workspace, specific items, or through detailed permissions within each Fabric engine. OneLake has distinct security factors that are detailed in this paper.
OneLake data access roles (Preview)
OneLake data access roles (Preview) enable users to establish personalized roles within a lakehouse and provide read access exclusively to designated folders when using OneLake.
Users can designate users and security groups or automatically assign them based on the workspace role for each OneLake role.
Shortcut security
Shortcuts in Microsoft Fabric enable streamlined data administration. OneLake Folder security is enforced for OneLake shortcuts according to the roles specified in the lakehouse housing the data. To learn more about the security implications of shortcuts, refer to the OneLake access control model. Additional details about shortcuts are available here.
Authentication
OneLake employs Microsoft Entra ID for authentication. It allows you to grant permissions to user identities and service principals. The user identity is automatically extracted by OneLake from tools that utilize Microsoft Entra authentication, and it is then aligned with the permissions configured in the Fabric portal.
Audit Logs
To view the OneLake audit logs, follow the instructions outlined in the "Monitoring user actions in Microsoft Fabric" documentation. The actions performed in OneLake correspond to ADLS APIs such as CreateFile or DeleteFile. Please be aware that the OneLake audit logs do not include log entries for read requests or requests made from Fabric workloads.
Encryption and networking
Data at Rest
The data stored in OneLake is automatically encrypted using a key managed by Microsoft while at rest. Microsoft regularly updates the keys used for encryption. Data in OneLake is secured and unsecured seamlessly, and it complies with FIPS 140-2. Currently, it is not possible to use a customer-managed key for encryption at rest. If you want this feature, you can request it on Microsoft Fabric Ideas.
Data in transit
The data travelling through the public internet between Microsoft services is consistently encrypted using at least TLS 1.2. When possible, Fabric switches to TLS 1.3. Traffic between Microsoft services is always directed through the Microsoft global network. Inbound OneLake communication also ensures the use of TLS 1.2 and switches to TLS 1.3 when possible. When communicating outbound from Fabric to customer-owned infrastructure, preference is given to secure protocols, but older, insecure protocols (including TLS 1.0) may be used if newer protocols are not supported.
OneLake: Expanding Data Access Beyond Fabric
OneLake provides the option to limit access to data from applications that are not running within Fabric environments. Administrators can locate this setting in the OneLake section of the Tenant Admin Portal. Enabling this setting allows users to access data from all sources. Disabling this setting restricts user access to data from applications that are not running within Fabric environments. For instance, users will be able to access data from applications using Azure Data Lake Storage (ADLS) APIs or the OneLake file explorer.
Understanding OneLake Usage and Billing
OneLake usage is defined by data stored and the number of transactions. This page contains information on how all of OneLake's usage is billed and reported.
Storage
OneLake storage is charged based on the data used and does not require Fabric Capacity Units (CUs). Fabric components such as lakehouses and warehouses utilize OneLake storage. The Power BI licensing fee includes the cost of storing data in OneLake for Power BI import semantic models. When it comes to mirroring storage, a certain amount of data is free depending on the computing capacity of the SKU you have purchased. You can find more details about pricing in the Fabric pricing section.
You can view your OneLake storage usage in the Storage tab of the Fabric Capacity Metrics app. It's important to note that soft-deleted data is billed at the same rate as active data. For information on monitoring usage, refer to the Metrics app Storage page. To gain a better understanding of OneLake consumption, visit the OneLake Capacity Consumption page.
Transactions
Requests to OneLake, such as reading or writing data, consume Fabric Capacity Units. The rates in this page define how much capacity units are consumed for a given type of operation.
Operation types
OneLake uses the same mappings as Azure Data Lake Storage (ADLS) to classify the operation to the category.
This table defines CU consumption when OneLake data is accessed using applications that redirect certain requests. Redirection is an implementation that reduces the consumption of OneLake compute.
Operation in Metrics App | Description | Operation Unit of Measure | Consumption rate |
OneLake Read via Redirect | OneLake Read via Redirect | Every 4 MB, per 10,000* | 104 CU seconds |
OneLake Write via Redirect | OneLake Write via Redirect | Every 4 MB, per 10,000* | 1626 CU seconds |
OneLake Iterative Read via Redirect | OneLake Iterative Read via Redirect | Per 10,000 | 1626 CU seconds |
OneLake Iterative Write via Redirect | OneLake Iterative Write via Redirect | Per 100 | 1300 CU seconds |
OneLake Other Operations via Redirect | OneLake Other Operations via Redirect | Per 10,000 | 104 CU seconds |
This table defines CU consumption when OneLake data is accessed using applications that proxy requests.
Operation in Metrics App | Description | Operation Unit of Measure | Consumption rate |
OneLake Read via Proxy | OneLake Read via Proxy | Every 4 MB, per 10,000* | 306 CU seconds |
OneLake Write via Proxy | OneLake Write via Proxy | Every 4 MB, per 10,000* | 2650 CU seconds |
OneLake Iterative Read via Proxy | OneLake Iterative Read via Proxy | Per 10,000 | 4798 CU seconds |
OneLake Iterative Write via Proxy | OneLake Iterative Write via Proxy | Per 100 | 2117.95 CU seconds |
OneLake Other Operations | OneLake Other Operations | Per 10,000 | 306 CU seconds |
Shortcuts
When you access data via OneLake shortcuts, the transaction usage counts against the capacity tied to the workspace where the shortcut is created. The capacity where the data is ultimately stored (that the shortcut points to) is billed for the data stored.
When you access data via a shortcut to a source external to OneLake, such as to ADLS Gen2, OneLake does not count the CU usage for that external request. The transactions would be charged directly to you by the external service such as ADLS Gen2.
Paused Capacity
When a capacity is paused, the data stored is continued to be billed using the pay-as-you-go rate per GB. All transactions to that capacity are rejected when it is paused, so no Fabric CUs are consumed due to OneLake transactions. To access your data or delete a Fabric item, the capacity needs to be resumed. You can delete the workspace while a capacity is paused.
The consumption of the data via shortcuts is always counted against the consumer’s capacity, so the capacity where the data is stored can be paused without disrupting downstream consumers in other capacities. See an example on the OneLake Capacity Consumption page
Disaster recovery
OneLake usage when disaster recovery is enabled is also defined by the amount of data stored and the number of transactions.
Disaster recovery storage
When disaster recovery is enabled, the data in OneLake gets geo-replicated. Thus, the storage is billed as Business Continuity and Disaster Recovery (BCDR) Storage. For more information about pricing, see Fabric pricing
Disaster recovery operation types
Operation | Description | Operation Unit of Measure | Capacity Units |
OneLake BCDR Read via Redirect | OneLake BCDR Read via Redirect | Every 4 MB per 10,000 | 104 CU seconds |
OneLake BCDR Write via Redirect | OneLake BCDR Write via Redirect | Every 4 MB per 10,000 | 3056 CU seconds |
OneLake BCDR Iterative Read via Redirect | OneLake BCDR Iterative Read via Redirect | Per 10,000 | 1626 CU seconds |
OneLake BCDR Iterative Write via Redirect | OneLake BCDR Iterative Write via Redirect | Per 100 | 2730 CU seconds |
OneLake BCDR Other Operations Via Redirect | OneLake BCDR Other Operations Via Redirect | Per 10,000 | 104 CU seconds |
This table defines CU consumption when disaster recovery is enabled, and OneLake data is accessed using applications that proxy requests.
Operation | Description | Operation Unit of Measure | Capacity Units |
OneLake BCDR Read via Proxy | OneLake BCDR Read via Proxy | Every 4 MB per 10,000 | 306 CU seconds |
OneLake BCDR Write via Proxy | OneLake BCDR Write via Proxy | Every 4 MB per 10,000 | 3870 CU seconds |
OneLake BCDR Iterative Read via Proxy | OneLake BCDR Iterative Read via Proxy | Per 10,000 | 4798 CU seconds |
OneLake BCDR Iterative Write via Proxy | OneLake BCDR Iterative Write via Proxy | Per 100 | 3415.5 CU seconds |
OneLake BCDR Other Operations | OneLake BCDR Other Operations | Per 10,000 | 306 CU seconds |
Fabric Workload Rate Changes
The rates at which resources are consumed can be changed at any time. Microsoft will make reasonable efforts to notify users about these changes through email or in-product notifications. The changes will be effective as of the date mentioned in Microsoft's Release Notes or Microsoft Fabric Blog. If any change to a Microsoft Fabric Workload Consumption Rate significantly increases the Capacity Units (CU) needed to use a specific workload, customers have the option to cancel using the chosen payment method.
Conclusion
Microsoft Fabric OneLake is a comprehensive data lake platform that revolutionizes data management and analysis. By consolidating data from various sources into a unified repository, OneLake eliminates the complexities of data silos and enables organizations to extract maximum value from their data assets. With its flexible data storage, seamless integration with analytical engines, and robust security features, OneLake empowers teams to collaborate effectively, accelerate insights, and drive data-driven decision-making.
ความคิดเห็น