The Periodic Table of Fabric Data Engineering (Spark)

Aitor Murguzur
10 min readSep 9, 2024

--

The Periodic Table of Fabric Data Engineering (Spark) outlines core concepts for data engineering and Spark in Microsoft Fabric.

! READ THIS: This blog is outdated after the FabCon Europe 2024 announcements. I will update it soon.

Lakehouse
Spark Compute
Spark Development
Spark Collaterals

> Download the pdf version here.

Lakehouse (Data)

Lakehouse
Lakehouse Ingestion
Lakehouse Access
Table Operations

Lh — Lakehouse: a lakehouse is the filesystem within a workspace that stores various data/file types for analytics, which can be used by different engines.

  • Le — Lakehouse Explorer: built-in UI (/ Tables and /Files sections) for managing tables/files.
  • Tb — Tables Section: the managed area of the lake. All tables such as Spark managed and unmanaged tables are registered here.
  • Fl — Files Section: the unmanaged area of the lake. You can store any type of data/file here and organize in folders + subfolders.
  • Se — SQL Endpoint: read-only T-SQL mode over lakehouse managed Delta tables. You can use T-SQL, create views, or apply security on those.
  • Ls — Lakehouse Schemas: supports lakehouse.schema.table construct. You can find a default schema named dbo under /Tables.
  • Tf — Table Formats: Fabric engines adopt Delta Lake as default table format, but /Tables supports different data formats (e.g. parquet). Iceberg / Hudi interoperability will be supported by Apache XTable.

Li — Lakehouse Ingestion: there are different ways to load data into a lakehouse in Fabric.

  • Du — Direct Upload: allows uploading data stored on your local machine using the Lakehouse explorer UI.
  • Pi — Data Pipelines: Fabric data pipelines copy activity allows ingesting data into a lakehouse.
  • Df — Dataflows Gen2: similarly, you can use lakehouse as a destination option in your dataflows to push data to a lakehouse.
  • Sh — Shortcuts: a pointer or symbolic link to an internal (another lakehouse) or external object store data.
  • Cn — Spark Connectors: different Spark connectors (e.g. Azure SQL DB) can be used, for example, via a notebook to load data into a lakehouse.
  • Ss — Spark Structured Streaming: stream data from different sources using readStream and write to a lakehouse.
  • Ev — Eventstream: you can add a lakehouse destination to an eventstream and ingest data there.

La — Lakehouse Access: offers various methods for configuring permissions and ensuring data accessibility within the lakehouse.

  • Ro — Workspace Roles: workspace roles (admin, member, contributor, viewer) define what users can do with Fabric items.
  • Lx — Lakehouse Sharing: allows granting users or groups access to a lakehouse without giving access to a workspace.
  • Or — OneLake RBAC: manages RBAC permissions under /Files (e.g. folders) and /Tables sections of a lakehouse.
  • Gq — API for GraphQL: enables creating an API on top of lakehouse SQL analytics endpoint tables.

Lt — Table Operations: provides multiple methods to manage Delta tables for continuous analytics readiness.

  • Ad — Auto Discovery: a built-in capability used under /Tables section to discover Delta Lake tables (metadata) over OneLake shortcuts.
  • Lo — Load to Table: an UI option to convert csv or parquet files in /Files into a Delta Lake table under /Tables.
  • Op — Optimize: consolidates multiple small parquet files into large file to reduce the small file problem. Enabled by default.
  • Vo — V-Order: applies optimized sorting, encoding, and compression to Delta parquet files for faster reading. Enabled by default.
  • Vc — Vacuum: removes old parquet files no longer referenced by a Delta table log. It does not clean up the Delta log. Built-in UI option.

Spark Compute

Spark Runtime
Spark Compute
Spark Admin
Environments

Ru — Spark Runtime: a Fabric Spark runtime is a bundle that includes internal and OSS components such as Delta Lake, Apache Spark and OS, default packages for Java/Scala, Python and R, and default Spark configs.

Sc — Spark Compute: managed Apache Spark compute platform within Fabric data engineering and data science experiences.

  • Sp — Starter Pools: pre-warmed nodes with default Spark settings and libraries for fast session startup (5–10secs), fixed Medium node size.
  • Cp — Custom Pools: allow users to created tailored Spark compute environments. 2–4min Spark session startup. Single node support.
  • Hc — High Concurrency: allows sharing the same Spark session across different notebooks. Max 5 now.
  • At — Autotune: tunes Apache Spark configurations to accelerate workload execution and optimize performance. Disabled by default.
  • Cy — Spark Concurrency: explains how Fabric capacity units are map to Spark vcores, and deal with concurrency bursting and throttling.
  • An — Job Admission: explains how Spark jobs admission works, based on the minimum core requirement (optimistic admission).
  • Qu — Job Queueing: explains how Fabric Spark queueing for background jobs work when Spark compute limits are reached based on capacity.
  • Ic — Intelligent Cache: caches data to help speed up the execution of Spark jobs in Fabric Spark runtimes. Enabled by default.

Sa — Spark Admin: Spark administration options to manage and set up pools at different levels, including Spark billing and utilization.

  • Cs — Capacity Settings: option to create custom pools at the capacity level and enable them in workspaces. It allows to restrict workspace-level pool creation.
  • Ws — Workspace Settings: workspace-level data engineering settings to configure Spark pool, Spark runtime version, default environment and high concurrency.
  • Bu — Billing and Utilization: compute utilization and reporting for Apache Spark operations, all classified as background.

En — Environments: a consolidated item encompassing all Spark compute settings, libraries, and properties.

  • Ec — Env. Spark Compute: provides options to select and configure Spark compute properties such as driver and executor memory.
  • Ep — Env. Spark Properties: provides an option to configure Spark properties. Some Spark properties in Fabric Spark runtime are immutable. If you get the message AnalysisException: Can't modify the value of a Spark config: <config_name>, the property in question is immutable.
  • El — Env. Spark Libraries: provides options to install custom (.whl, .jar, .tar.gz) and public (PyPi, Conda, .yml) libraries.
  • Er — Env. Resources: provides a file system to manage files and folders, and make them available across notebooks.

Spark Development

Notebooks
Spark Job Definition
Copilot
Data Wrangler
VS Code Integration

Nb — Notebooks: a primary code item for developing Spark jobs and ML experiments in Fabric. See some limitations.

Sj — Spark Job Definition: a primary code item for developing batch/streaming Spark jobs.

Co — Copilot: an AI assistant to help co-develop, generate insights and create visualizations in Fabric.

  • Cd — DS & DE Copilot: an AI assistant that helps analyze and visualize data in notebooks.
  • As — AI Skill: GenAI for Q&A on your lakehouse data. These LLMs can generate queries and respond to contextualize questions.

Wr — Data Wrangler: a built-in UI tool in Fabric notebooks for exploratory data analysis (EDA) and data preparation. Also in VS Code.

Vs — VS Code Integration: Synapse VS Code extension for exploring lakehouses and environments, and authoring notebooks and Spark jobs.

  • Dc — Dev Container: a docker image containing all the required dependencies to run the Synapse VS Code extension.
  • Vl — VS Code Local: Synapse VS Code extension on your VS Code for authoring notebooks and Spark jobs, incl environment and lakehouse.
  • Vw — VS Code Web: Synapse VS Code extension on https://vscode.dev.

Spark Collaterals

Fabric REST API
Logging & Monitoring
Migrations
Network Security
DevOps
Integrations

Ap — Fabric REST API: Microsoft Fabric REST APIs to automate different Fabric workflows.

  • Aa — Admin APIs: provide different API endpoints for Fabric administrative tasks.
  • Ac — Core APIs: provide information and functionality for all items within Fabric, cross workloads.

Mo — Logging & Monitoring: to track Spark job execution, resource utilization, and performance metrics.

Mi — Migrations: guidance and tools for transferring Spark workloads, data pipelines, and configurations to Fabric Spark.

  • Mn — Migrate Notebooks: move existing notebooks to Fabric. Example: Azure Synapse notebooks.
  • Mj — Migrate Spark Jobs: move existing Spark jobs (.jar, .py, etc.) to Spark job definitions in Fabric. Example: Azure Synapse Spark job definitions.
  • Mc — Migrate Pools: move existing Spark pools and clusters to Fabric custom pools, including your custom libraries and Spark properties to an environment item. Example: Azure Synapse pools.
  • Mm — Migrate Metastore: move your metastore catalog, schemas, tables, partitions and functions to Fabric. Example: Azure Synapse Hive Metastore metadata to Fabric.
  • Mp — Migrate Pipelines: move pipelines, DAGs and workflows to Fabric data pipelines. Example: Azure Synapse pipelines.
  • Md — Migrate Data: move or make existing data available on Fabric. Example: ADLS Gen2.
  • Mx — Migrate Others: move and change additional settings to make your Fabric target state operational after/before moving items and data such as runtime and syntax support or SDK/CLI usage to just name a few.

Se — Network Security: ensures the protection of data and Spark workloads by implementing secure communication channels, encryption, and access control policies.

  • Pl — Private Link (inbound): allows users to access Fabric privately from your VNet. Tenant level.
  • St — Service Tags (inbound): helps defining network access controls on NSGs, Azure firewall and UDRs.
  • Ca — Conditional Access (inbound): restricts Fabric access by Microsoft Entra Conditional Access policies.
  • Au — Allowlist URLs (inbound): a list of the Microsoft Fabric URLs required for interfacing with Fabric workloads.
  • Mv — Managed VNet (inbound/outbound): provides network isolation for Fabric Spark workloads. Used when creating managed private endpoints and Private Link.
  • Me — Managed Private Endpoints (outbound): enables accessing Azure PaaS services and Azure hosted customer-owned/partner services over a private endpoint in your VNet.
  • Wi — Workspace Identity (outbound): a managed identity that can be associated with a Fabric workspace.
  • Ta — Trusted Access (outbound): enables accessing firewall-enabled Azure PaaS services through managed identity whitelisting. Works with ADLS Gen2.

Do — DevOps: integrates continuous integration and continuous delivery (CI/CD) practices to streamline the development, testing, and deployment of Spark applications.

In — Integrations: data pipeline and transformation capabilities with Fabric data engineering integrations. This is not a complete list.

The Periodic Table of Fabric Data Engineering (Spark) summarizes key concepts within Microsoft Fabric. As Fabric Spark continues to evolve, this periodic table will be updated to reflect new features and advancements.

--

--

Aitor Murguzur

All things data. Principal PM @Microsoft Fabric CAT Spark. PhD in Comp Sci. All views are my own. https://www.linkedin.com/in/murggu/