Using a Shared Hive Metastore Across Azure Synapse, HDInsight, and Databricks

Aitor Murguzur
5 min readMar 2, 2023

--

[Kudos to Umesh Pawar and Lee Hilton-Smith for their contributions]

When building a data lake or lakehouse on Azure, users can keep their data in ADLS Gen2 using delta tables and read that data from different compute engines. This allows for switching from one platform to another, making data interoperable across different compute engines.

While delta tables can be stored in ADLS Gen2, enabling unified data access from different computes, metadata can also be stored in a separate metastore. However, sharing that metastore can be tricky when switching platforms, often requiring metadata migration or rebuilding the metastore from external tables.

In this article, I will demonstrate how an external Hive metastore (HMS) can be shared across three different Spark-based compute engines on Azure — Azure Synapse, HDInsight, and Databricks. Let’s get started!

Architecture

  • Compute layer: HDInsight 5.0 (Spark 3.1.3), Databricks (Runtime 9.1 LTS— Spark 3.1.2), and Azure Synapse Spark (Spark 3.1.3).
  • Storage layer: ADLS Gen2 as a data store, Azure SQL Database as an external Hive metastore (3.1.0).

Versions & Compatibility

The following table summarizes the supported versions by Azure Synapse, HDInsight and Databricks. Considering all three support Spark 3.1.x (this is the latest Spark version supported by HDInsight 5.0), we will go with that option here — Spark 3.1.x, HMS 3.1.0 and Delta 1.0.x.

+---------------+-------+---------------+-----------+------------+
| Spark Version | HMS | Synapse Spark | HDInsight | Databricks |
+---------------+-------+---------------+-----------+------------+
| 3.1 | 3.1.x | Yes | Yes | Yes |
| 3.3 | 3.1.x | Yes | No | Yes |
+---------------+-------+---------------+-----------+------------+

> HDInsight already plan to release 5.1 version, including Spark 3.3.

Prerequisites

Follow the steps below to create the required infrastructure on Azure. If you already have the required infrastructure (with the right versioning!), you can skip the prerequisites listed below.

SELECT * FROM dbo.DBS;

Once the infrastructure is ready, you need to configure Azure Synapse, HDInsight and Databricks to connect to external HMS.

Connecting HDInsight to External HMS

Since the external HMS was already linked during HDInsight cluster creation, you only need set up metastore.catalog.default property to hive. HDInsight sets that property to spark by default.

To change that property, log in to Ambari UI and change that property on the spark3-hive-site-override.

After settings changes and restart is completed for spark components, try listing HMS databases by running the query below in Jupyter notebook. You will see the same three databases — sys, default and information_schema.

%%sql
SHOW DATABASES

Connecting Synapse to External HMS

Connecting Synapse Spark pool to an external HMS (Azure SQL Database) involves two steps:

  • Set up a linked service to HMS (Azure SQL Database): Create a Linked Service within the Synapse Studio to connect to the HMS (Azure SQL Database). Ensure that connection works.
  • Configure a Spark pool to use the external HMS: You can do this by configuring at Spark session level in a Spark notebook (see example below) or at Spark pool level using Spark configurations. See more on Synapse Spark configurations here.
%%configure -f
{
"conf":{
"spark.sql.hive.metastore.version":"3.1",
"spark.hadoop.hive.synapse.externalmetastore.linkedservice.name":"<your linked service name>",
"spark.sql.hive.metastore.jars":"/opt/hive-metastore/lib-3.1/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-client/*"
}
}

After all these settings, try listing catalog objects by running below query in a Spark notebook to check the connectivity to the external HMS.

%%sql
SHOW DATABASES

Connecting Databricks to External HMS

Connecting Databricks to an external HMS is also pretty simple.

  • Download the metastore jars and point to them: Synapse provides the jars by default, but to make external hms work with Databricks, you need to download the required dependencies and make them available. You can use this notebook.
  • Set up external metastore configuration: You can connect to an existing Hive metastore by setting required Spark configurations using the UI or init script. You could also configure the same properties at the Spark session level in a Notebook. The code below shows the required cluster configuration.
spark.sql.hive.metastore.version 3.1
spark.hadoop.javax.jdo.option.ConnectionUserName <username>
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<sql_server_name>.database.windows.net:1433;database=<sql_db_name>
spark.hadoop.javax.jdo.option.ConnectionPassword <password>
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.sql.hive.metastore.jars /dbfs/metastore_jars/hive-v3_1/*
spark.databricks.delta.preview.enabled true

After the cluster is created (make sure credential passthrough is checked), try listing catalog objects by running the query below in a Notebook.

%%sql
SHOW DATABASES

Create Sample Table

HDInsight, Azure Synapse and Databricks are now connected to the external HMS. Let’s create a table and query it using different compute engines!

  • Import Notebook in Synapse: Jump to Synapse Studio and import this notebook. Attach it to a Spark pool and run the first two cells.
  • Create a Database: Create a sharedhmsdb database.
%%sql
CREATE DATABASE IF NOT EXISTS sharedhmsdb
COMMENT "Test database for shared HMS"
LOCATION "abfss://<container>@<adls_gen2_name>.dfs.core.windows.net/hive/warehouse";
  • Create a Table: Create a dummy table from HDInsight csv samples.
spark.read.format("csv") \
.load("abfss://<container>@<adls_gen2_name>.dfs.core.windows.net/HdiSamples/HdiSamples/SensorSampleData/building/*csv") \
.write.format("delta") \
.mode("overwrite") \
.saveAsTable("sharedhmsdb.building_summary")
  • Validate table creation: Query the table just created in sharedhmsdb.
%%sql
SELECT COUNT(*) FROM sharedhmsdb.building_summary

Swapping Compute Engines

The external HMS is working on Synapse (only accessible from Spark pool). Let’s move now to HDInsight and Databricks and see if the same database and table exist.

  • Querying from HDInsight: Open Jupyter and import this notebook. Run all cells and you will see that sharedhmsdband building_summary table are accessible from HDInsight.
  • Querying from Databricks: Import this notebook and run all cells. You will see that the same sharedhmsdband building_summary table are also accessible (using credential passthrough to access ADLS Gen2).

For any suggestions or questions, feel free to reach out :)

--

--

Aitor Murguzur

All things data. Principal PM @Microsoft Fabric CAT Spark. PhD in Comp Sci. All views are my own. https://www.linkedin.com/in/murggu/