Creating Managed and External Spark Tables in Fabric Lakehouse

Lakehouse Tables with Microsoft Fabric Spark

Aitor Murguzur
6 min readJul 10, 2023

Apache Spark supports two main types of tables: managed and unmanaged tables. In Microsoft Fabric, you can create these tables in your Lakehouse using the Spark compute engine.

This article serves as a quickstart guide for Spark tables in Fabric Lakehouse. Let’s dive in and get started!

> Jump directly to the GitHub repo if you want to stop reading.

Introduction

Before we start, let’s quickly revisit the difference between managed and unmanaged tables in Spark:

  • Managed/internal table: Spark manages both the data and the metadata. For these tables, data is stored in a warehouse directory and metadata in a metastore (including databases, tables, views, functions, partitions information). When you drop the table, both data in the warehouse directory and metadata from the metastore gets dropped. Neither the table in the catalog or the data will exist anymore.
  • Unmanaged/external table: Spark only handles the metadata. You need to specify the location where you want to store the table or the directory from which data will be used to create the table. When you drop an external table, only the metadata is removed. The table cannot be queried using Spark SQL, as it will not be in the catalog. However, the actual table data remains in the external location.

Fabric Lakehouse is the filesystem under a workspace which can store a variety of data/file types for analytics and those can be used by different engines (Spark, SQL, KQL and Power BI). The lakehouse is separated into two different sections:

  • Tables section: the managed area of the lake. All tables, both Spark managed and unmanaged tables are registered here.
  • Files section: the unmanaged area of the lake. You can store any type of data/file here and organize in folders + subfolders.

> Note: a lakehouse creates a serving layer by auto-generating a SQL endpoint and a default dataset during its creation.

Hence, a Fabric Lakehouse can be defined as a collection of folders/files/tables that act a database within OneLake including capabilities for ACID transactions when using Delta tables. All compute engines in Fabric utilize an unified storage (OneLake) and format (Delta).

Spark Tables and Fabric Lakehouse Quickstart

Prerequisite: Create a Fabric Lakehouse

If you don’t have one already, create a lakehouse (e.g. name it samplelk) within your Fabric workspace, and import this Notebook.

Creating Managed Tables in Fabric

As mentioned, when you create a managed table, Spark will manage both the data and the metadata. In Fabric Lakehouse, the data is written into the “Tables” section within the lakehouse by default. There are different ways to create managed tables in Spark:

Using saveAsTable() method

# Load data
df = spark.read.format("csv").option("header", "true").load("wasbs://sampledata@azuresynapsestorage.blob.core.windows.net/WideWorldImportersDW/csv/full/dimension_customer")

# Managed Delta table using saveAsTable()
df.write.format("delta").mode("overwrite").saveAsTable("dimCustomer_manag1")

Using CREATE TABLE AS SELECT (CTAS)

# Managed Delta table using CTAS
df.createOrReplaceTempView("df_customer_view")
spark.sql("CREATE TABLE IF NOT EXISTS dimCustomer_manag2 USING DELTA AS SELECT * FROM df_customer_view")

If you have a SQL background, you can also use a %%sql magic in a Spark Notebook and run a SQL command directly:

%%sql
CREATE TABLE IF NOT EXISTS dimCustomer_manag2_sql
USING DELTA AS
SELECT * FROM df_customer_view

Using INSERT INTO

%%sql
-- Managed Delta table using empty table + INSERT INTO
CREATE TABLE dimCustomer_manag3 (
CustomerKey STRING,
WWICustomerID STRING,
Customer STRING,
BillToCustomer STRING,
Category STRING,
BuyingGroup STRING,
PrimaryContact STRING,
PostalCode STRING,
ValidFrom STRING,
ValidTo STRING,
LineageKey STRING
) USING DELTA;

INSERT INTO dimCustomer_manag3
SELECT * FROM df_customer_view;

Using save method

df.write.format("delta").mode("overwrite").save("Tables/dimCustomer_manag4")

This only works for Delta tables, since auto-discovery on the Tables section of the lakehouse only works with Delta Lake format for now. If you try to create a table using a non-Delta format and .save(Tables/<table_name>), it will go to the undefined area.

Alternatively, if you want to create Delta managed tables manually from parquet and csv files, you can also use Load to Tables functionality or leverage the auto discovery when creating tables defined over OneLake shortcuts. Apart from Delta tables, you can also create non-Delta managed tables using other data formats such as csv, json and parquet and those will be visible within the Lakehouse explorer UI “Tables” section.

# Managed tables using other data formats
df.write.format("csv").mode("overwrite").saveAsTable("dimCustomer_manag1_csv")
df.write.format("json").mode("overwrite").saveAsTable("dimCustomer_manag1_json")
df.write.format("parquet").mode("overwrite").saveAsTable("dimCustomer_manag1_parquet")

You can check that all these commands successfully created a managed table named dimcustomer_manag# with tableType = 'MANAGED' by running:

spark.catalog.listTables()

You can see the details of a particular table by running:

%%sql
DESCRIBE EXTENDED <lh_name>.dimcustomer_manag1

Creating External Tables in Fabric

Unmanaged tables provide much more flexibility, as the table data can be stored in a location on your choice, or the table can be built directly on top of data available in an external directory.

In the example below, I am going to use the “Files” section in the lakehouse as an external location, but you could also create unmanaged tables from other locations such as ADLS Gen2 or use the “Tables” section in the lakehouse.

Using saveAsTable() method with a path

# Using saveAsTable() method with a path
df.repartition(2).write.format("delta").mode("overwrite").\
option("path", "Files/dimCustomer_unmanag1")\
.saveAsTable("dimCustomer_unmanag1")

Using CTAS

%%sql
-- Unmanaged tables using CTAS
CREATE EXTERNAL TABLE IF NOT EXISTS dimCustomer_unmanag2
USING DELTA
LOCATION 'Files/dimCustomer_unmanag2' AS
SELECT * FROM df_customer_view

Using <format> OPTIONS path syntax + INSERT INTO

%%sql
-- Unmanaged Delta table using empty table + INSERT INTO
CREATE TABLE dimCustomer_unmanag3 (
CustomerKey STRING,
WWICustomerID STRING,
Customer STRING,
BillToCustomer STRING,
Category STRING,
BuyingGroup STRING,
PrimaryContact STRING,
PostalCode STRING,
ValidFrom STRING,
ValidTo STRING,
LineageKey STRING
) USING DELTA OPTIONS (path 'Files/dimCustomer_unmanag3');

INSERT INTO dimCustomer_unmanag3
SELECT * FROM df_customer_view;

If you try to list the tables in the catalog again, you will see three additional unmanaged tables, together with the manages tables and temp view.

If you try to drop an unmanaged table, metadata will be removed but the data will remain in the “Files” section of the lakehouse:

%%sql
DROP TABLE <lh_name>.dimcustomer_unmanag1

For any suggestions or questions, feel free to reach out :)

Additional Notes

Fabric compute engines (e.g. Spark and SQL) are optimized for Delta format, and if you want the table to appear in SQL endpoint and Power BI make sure you’re creating the tables as managed Delta tables (for now). However, as we have seen above, Fabric Lakehouse and the Spark engine support all table types, both managed and unmanaged, and non-Delta table formats such as csv, json or parquet.

In terms of default data format in Fabric Spark (tested on started pools), hive is the default for Spark SQL and parquet for PySpark for now. This is due to the spark.sql.legacy.createHiveTableByDefault property (default=true) in starter pools. That property controls the default behavior of table creation in Spark SQL when interacting with a metastore. When set to true, it enables the legacy behavior where the default table type is a Hive-managed table. If set to false, it uses the new behavior where the default table type is data source table, indicated by spark.sql.sources.default property (default=parquet). The default should be Delta, this is a known bug.

--By default, table format is hive
CREATE TABLE IF NOT EXISTS t_1 (id INT, name STRING)

--By setting spark.conf.set("spark.sql.legacy.createHiveTableByDefault", "false"), table format is parquet. This is because 'spark.sql.sources.default' property (default=parquet)
CREATE TABLE IF NOT EXISTS t_2 (id INT, name STRING)

--By setting spark.conf.set("spark.sql.legacy.createHiveTableByDefault", "false") and spark.conf.set("spark.sql.sources.default", "delta"), table format is delta
CREATE TABLE IF NOT EXISTS t_3 (id INT, name STRING)

Both managed and unmanaged Delta tables can be V-Order optimized. Sandeep Pawar has a good blog on this here. Learn more about V-Order.

Lastly, in terms metastore, currently Fabric Spark metastore only supports 2-part naming, meaning <database|schema>.<object> where database = lakehouse name. In terms of metastore interoperability, Fabric Spark does not provide any catalog API yet, and connecting to existing external metastore is not posible at the moment.

If you’ve any suggestion or idea, feel free to submit it here.

References:

--

--

Aitor Murguzur

All things data. Principal PM @Microsoft Fabric CAT Spark. PhD in Comp Sci. All views are my own. https://www.linkedin.com/in/murggu/