Microsoft Fabric and Databricks Unity Catalog — unraveling the integration scenarios
The first Microsoft Fabric community conference occurred this year, featuring two sneak peeks showcasing Fabric and Databricks Unity Catalog (UC) integration during the day 1 keynote.
In a prior blog post, I explored options to write from Databricks to OneLake and from Fabric Spark to ADLS Gen2 (without Unity Catalog-enabled clusters). This blog post aims to uncover various scenarios related to Fabric + Databricks Unity Catalog integration (with Unity Catalog-enabled clusters). For lakehouse scenarios, see this post from Piethein.
- Can I sync UC tables into OneLake catalog? How?
- Can I write from UC-enabled clusters to OneLake?
- Can I integrate UC with OneLake? Can I run federated queries againts SQL endpoint / Fabric Data Warehouse?
Note: This article reflects my personal experiences and viewpoints, not the official stance of Microsoft or Databricks. Additionally, while this blog post outlines potential scenarios, it does not necessarily reflect the Fabric roadmap or intentions. Not all options mentioned may become operational in the future.
Databricks Unity Catalog and Microsoft Fabric
The integration scenarios can essentially be viewed based on the entry point, Unity Catalog and Fabric:
- Accessing Unity Catalog from Fabric (Fabric → Unity Catalog): This functionality could enable users to seamlessly access Unity Catalog catalog, schemas, and tables from within Fabric.
- Utilizing Fabric from Unity Catalog (DBX/Unity Catalog → Fabric): This feature could offer users the ability to access and use OneLake directly from within Unity Catalog and run federated queries over a SQL endpoint or Fabric Data Warehouse.
Let’s explore these scenarios further.
Fabric → Unity Catalog
Using Unity Catalog from Fabric
If you’re in Fabric, here are some of the options for accessing Unity Catalog tables from Fabric. You can also r/w directly from Fabric Spark to ADLS Gen2.
Current options
Users currently have two options for creating shortcuts to UC tables: manual (tedious) or semi-automatic, the latter achievable through a notebook. With the semi-automatic method, users can integrate UC Delta external tables into OneLake by creating shortcuts. They specify the catalog and schema names for synchronization, which generates shortcuts to tables in those schemas within the Fabric lakehouse. See additional instructions on executing the utility notebook.
# configuration
dbx_workspace = "<databricks_workspace_url>"
dbx_token = "<pat_token>"
dbx_uc_catalog = "catalog1"
dbx_uc_schemas = '["schema1", "schema2"]'
fab_workspace_id = "<workspace_id>"
fab_lakehouse_id = "<lakehouse_id>"
fab_shortcut_connection_id = "<connection_id>"
fab_consider_dbx_uc_table_changes = True
# sync UC tables to lakehouse
sc.addPyFile('https://raw.githubusercontent.com/microsoft/fabric-samples/main/docs-samples/onelake/unity-catalog/util.py')
from util import *
databricks_config = {
'dbx_workspace': dbx_workspace,
'dbx_token': dbx_token,
'dbx_uc_catalog': dbx_uc_catalog,
'dbx_uc_schemas': json.loads(dbx_uc_schemas)
}
fabric_config = {
'workspace_id': fab_workspace_id,
'lakehouse_id': fab_lakehouse_id,
'shortcut_connection_id': fab_shortcut_connection_id,
"consider_dbx_uc_table_changes": fab_consider_dbx_uc_table_changes
}
sync_dbx_uc_tables_to_onelake(databricks_config, fabric_config)
Potential future options
- Unity Catalog native item on Fabric: similar to Hive Metastore metadata migration to Fabric lakehouse, Unity Catalog metadata can be synced to Fabric lakehouse, enabling access to Unity Catalog tables. A sneak peak of this scenario was demonstrated at FabCon, showing how users could directly access and query Unity Catalog tables using the Fabric UI.
- Unity Catalog Shortcut in Fabric: Similar to Dataverse shortcuts, the OneLake shortcut UX could potentially support creating shortcuts to Unity Catalog tables.
Note: Options such as Delta sharing, JDBC/ODBC to Databricks, Fabric data pipeline Databricks activity, and others are not mentioned here.
Databricks/Unity Catalog → Fabric
Using Fabric and OneLake from Databricks/Unity Catalog
Unity Catalog offers different ways to connect and leverage Cloud object storage connections (e.g. ADLS Gen2) as well as connect to external data systems to run federated queries (e.g. Azure Synapse).
Current options
Currently, users can use OneLake from UC-enabled clusters as follows: (i) r/w to OneLake using Service Principal (SPN) based authentication, and (ii) r/w to Onelake using mount points with SPN auth.
# r/w using spn
workspace_name = "<workspace_name>"
lakehouse_name = "<lakehouse_name>"
tenant_id = "<tenant_id>"
service_principal_id = "<service_principal_id>"
service_principal_password = "<service_principal_password>"
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", service_principal_id)
spark.conf.set("fs.azure.account.oauth2.client.secret", service_principal_password)
spark.conf.set("fs.azure.account.oauth2.client.endpoint", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
# read
df = spark.read.format("parquet").load(f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse/Files/data")
df.show(10)
# write
df.write.format("delta").mode("overwrite").save(f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse/Tables/dbx_delta_spn")
# mount with spn
workspace_id = "<workspace_id>"
lakehouse_id = "<lakehouse_id>"
tenant_id = "<tenant_id>"
service_principal_id = "<service_principal_id>"
service_principal_password = "<service_principal_password>"
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": service_principal_id,
"fs.azure.account.oauth2.client.secret": service_principal_password,
"fs.azure.account.oauth2.client.endpoint": f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
}
mount_point = "/mnt/onelake-fabric"
dbutils.fs.mount(
source = f"abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com",
mount_point = mount_point,
extra_configs = configs
)
# read
df = spark.read.format("parquet").load(f"/mnt/onelake-fabric/{lakehouse_id}/Files/data")
df.show(10)
# write
df.write.format("delta").mode("overwrite").save(f"/mnt/onelake-fabric/{lakehouse_id}/Tables/dbx_delta_mount_spn")
Note: Creating an external table using the OneLake abfss
path or mount path will now result in an exception in UC. Currently, you cannot register an external table in UC with OneLake as the underlying storage. This may lead to potential future scenarios.
- INVALID_PARAMETER_VALUE: Missing cloud file system scheme
- Failed to acquire a SAS token for list. Invalid Azure Path
Additionally, read-only access to data in an SQL endpoint or Fabric Data Warehouse using UC foreign catalog is an option (federated lakehouse). Currently, Azure Synapse authentication is based on username/password, but the SQL Server connector supports SPN authentication, so the latter can be used to connect.
Potential future options
Similar to ADLS Gen2 and Azure Synapse, different options could exist in the future:
- OneLake as default managed storage: Databricks started to roll out automatic enablement of Unity Catalog, i.e. automatically-provisioned Unity Catalog metastore with Databricks-managed storage (e.g. ADLS Gen2). However, user can also create user-managed metastore-level storage while creating Unity Catalog metastore pointing to OneLake in this case. Note: this is not possible yet.
- OneLake as external location: External locations are used to define managed storage locations for catalogs and schemas, and to define locations for external tables and external volumes. For instance, if users are using external tables in Spark, OneLake could be leveraged as an external location. Note: this is not possible yet.
- OneLake for Volumes: Volumes represent a logical volume of storage in a cloud object storage location, adding governance over non tabular datasets. External and managed volumes could exist using OneLake, e.g., as for ADLS Gen2. Note: this is not possible yet.
Note: Other options like ODBC to SQL endpoint from Databricks, Partner Connect (Power BI + Databricks), and streaming options are not mentioned here.
Wait, what about access policies?
It’s still open to explore how Unity Catalog access policies could align with OneLake RBAC and OneSecurity, as well as whether it would be possible to carry over security and access policies from Unity Catalog to Fabric, and vice versa.
References:
- Non-UC scenarios: Databricks and Fabric — writing to OneLake and ADLS Gen2 | by Aitor Murguzur | Medium
- Lakehouse scenarios: Integrating Azure Databricks and Microsoft Fabric | by Piethein Strengholt | Jun, 2024 | Medium
- Integrate Databricks Unity Catalog with OneLake — Microsoft Fabric | Microsoft Learn