Migrating Spark Catalog to Fabric Lakehouse

How to make an existing Spark metastore catalog available in Microsoft Fabric Spark?

6 min readJul 24, 2023

[Updated Nov. 2023]

> Check official documentation here Migrate Hive Metastore metadata from Azure Synapse to Fabric

When building a lakehouse architecture in Microsoft Fabric, users can make existing data (e.g. Delta tables on ADLS Gen2) available in OneLake (e.g. using shortcuts or copying data using dataflows, Data Factory pipelines or Spark notebook) and read that data from different compute engines such as Spark and SQL. While data can be accessible with a few clicks in Fabric lakehouse, sharing the metadata can be tricky now.

One way of doing that is by leveraging the auto discovery over OneLake shortcuts within the “Tables” section in the Lakehouse. That will register existing Delta tables in lakehouse’s internal managed metastore, allowing users query and cook Spark catalog objects produced by other compute engines (e.g. Azure Synapse, Databricks or HDInsight).

Fabric currently lacks a Catalog API and access to an external Hive MetaStore (HMS) for metadata interoperability, requiring metadata migration or rebuilding the metastore from external tables. To overcome this, users need to make existing Spark catalog available on Fabric to start querying existing DBs, schemas and tables.

In this article, I will focus on one time Spark Catalog migration to Fabric. Let’s get started!

> Jump directly to the GitHub repo if you want to stop reading.

Enabling Spark Metastore in Fabric Lakehouse

The proposed migration approach requires three steps: (i) export Spark catalog to intermediate location in OneLake using source compute engine (e.g. Azure Synapse), (ii) import catalog from intermediate storage using a notebook in Fabric, and (iii) run validation queries. Let’s see those in detail.

> Note: scripts only copy Spark catalog objects (i.e. metadata in HMS) to Fabric lakehouse. Assumption is that the data is already copied (e.g. from warehouse location to ADLS Gen2) or available for managed and external tables (e.g. via shortcuts — preferred) into the Fabric lakehouse. The provided notebooks will not work with Unity Catalog (UC).

Prerequisites

The pre-migration steps involve:

If you don’t have one already, create a Fabric workspace in your tenant.
If you don’t have one already, create a Fabric lakehouse in your workspace. This is used as an intermediate storage for metadata import/export.

Step 1: Export Metadata from source HMS

The step 1 workflow is as follows:

1.1) Import migration notebook to Azure Synapse, HDInsight or Databricks. This notebook queries and exports HMS metadata of databases, tables, and partitions to an intermediate directory in OneLake (functions not included yet). Spark internal catalog API is used in this script to read catalog objects.
1.2) Configure the parameters in the first command to export metadata information to an intermediate storage (OneLake).

// Fabric config
var WorkspaceId = "<workspace_id>"
var LakehouseId = "<lakehouse_id>"
var IntermediateFolderPath = f"abfss://${WorkspaceId}@onelake.dfs.fabric.microsoft.com/${LakehouseId}/Files/hms_output/syn/"

var DatabaseNames = "<db1_name>;<db2_name>"
var SkipExportTablesWithUnrecognizedType:Boolean = false

1.3) Run all notebook commands to export catalog objects to intermediate path. Once all cells are completed, you will be able to see this folder structure under the intermediate output directory.

Step 2: Import Metadata to Fabric Lakehouse

The step 2 workflow is as follows:

2.1) Create a shortcut within the “Files” section of the lakehouse pointing to the source warehouse directory. The warehouse directory is used to store both the data (for managed tables) and metadata of tables in Spark. The abfss path of this shortcut will be used later to do the replacement for managed tables. In the example below, you can see three shortcuts pointing to Databricks, Azure Synapse and HDInsight warehouse directories.

> Note: for Databricks managed tables, if you’re using DBFS root directory, you will need copy data from DBFS root warehouse location to an external storage (e.g. ADLS Gen2) and then shortcut in Fabric Lakehouse.

2.2) Import metadata notebook to Fabric Lakehouse. Import this notebook to import database, table, and partition objects from intermediate storage. Spark internal catalog API is used in this script to create catalog objects in Fabric.
2.3) Configure the parameters in the first command. For managed tables, WarehouseMappings is used to do the replacement using the shortcut created in step 2.1. Similarly, for external tables, you can change the path using WarehouseMappings or keep original data location, e.g. ADLS Gen2. All source managed tables are converted as external tables into Fabric Spark metastore using this script. WarehouseId and LakehouseId refer to the Fabric/workspace lakehouse created in the pre-requisited and containing the containing shortcut from step 2.1.

// Fabric config
var WorkspaceId = "<workspace_id>"
var LakehouseId = "<lakehouse_id>"
var IntermediateFolderPath = f"abfss://${WorkspaceId}@onelake.dfs.fabric.microsoft.com/${LakehouseId}/Files/hms_output/syn/"

// Warehouse directory
var ContainerName = "<container_name>"
var StorageName = "<storage_account_name>"
var SynapseWorkspaceName = <synapse_workspace_name>
var WarehouseMappings:Map[String, String] = Map(
    f"abfss://${ContainerName}@${StorageName}.dfs.core.windows.net/synapse/workspaces/${SynapseWorkspaceName}/warehouse"-> f"abfss://${WorkspaceId}@onelake.dfs.fabric.microsoft.com/${LakehouseId}/Files/warehouse_dir_syn",
    f"dbfs:/mnt/${StorageName}/databricks/warehouse"->f"abfss://${WorkspaceId}@onelake.dfs.fabric.microsoft.com/${LakehouseId}/Files/warehouse_dir_dbx",
    f"abfss://${ContainerName}@${StorageName}.dfs.core.windows.net/apps/spark/warehouse"->f"abfss://${WorkspaceId}@onelake.dfs.fabric.microsoft.com/${LakehouseId}/Files/warehouse_dir_hdi"
)

// Metastore config
var DatabasePrefix = ""
var TablePrefix = ""
var IgnoreIfExists = true

2.4) Run all notebook commands to import catalog objects from intermediate path.

Step 3: Validate Metadata Migration

Verify that the tables have been created successfully in the auto-created lakehouse(s). Run Spark queries from Fabric notebook that retrieve and display the data from each table. Check the number of tables and row numbers for some tables.

Additional Notes

In this article, I showed you how to enable an existing Spark Catalog in Fabric lakehouse (as external tables). However, it is important to be mindful of several crucial considerations:

Tables and data formats: The approach has been tested on Azure Synapse Spark (Spark 3.3), Databricks (12.2 LTS, Spark 3.3) and HDInsight Spark (HDI 5.1 in preview, Spark 3.3), both using HMS managed and external tables. Managed and unmanaged (both partitioned and non-partitioned tables) have been tested using csv, json, parquet and Delta data formats (see #996 and #1051 for SHOW PARTITIONS support).
Source databases to Fabric lakehouse mapping: when importing multiple databases, you can (i) create one lakehouse per database (the approach used here), or (ii) move all tables from different databases to a single lakehouse. For the latter, since 3-part naming is not supported yet, all migrated tables could be <lakehouse>.<db_name>_<table_name>.
Scalability: The solution here is using internal Spark catalog API to do import/export, but it is not connecting directly to HMS to get catalog objects, so the solution may not scale well if the catalog is large. You would need to change the export logic using HMS DB.
Data accuracy: There is no isolation guarantee, which means that if the source Spark compute engine is doing concurrent modifications to the metastore while the migration notebook is running, inconsistent data can be introduced in Fabric lakehouse.

For any suggestions or questions, feel free to reach out :)

References: