The Periodic Table of Fabric Data Engineering (Spark)
The Periodic Table of Fabric Data Engineering (Spark) outlines core concepts for data engineering and Spark in Microsoft Fabric.
! READ THIS: This blog is outdated after the FabCon Europe 2024 announcements. I will update it soon.
∟ Lakehouse
∟ Spark Compute
∟ Spark Development
∟ Spark Collaterals
> Download the pdf version here.
Lakehouse (Data)
∟ Lakehouse
∟ Lakehouse Ingestion
∟ Lakehouse Access
∟ Table Operations
Lh — Lakehouse: a lakehouse is the filesystem within a workspace that stores various data/file types for analytics, which can be used by different engines.
- Le — Lakehouse Explorer: built-in UI (/ Tables and /Files sections) for managing tables/files.
- Tb — Tables Section: the managed area of the lake. All tables such as Spark managed and unmanaged tables are registered here.
- Fl — Files Section: the unmanaged area of the lake. You can store any type of data/file here and organize in folders + subfolders.
- Se — SQL Endpoint: read-only T-SQL mode over lakehouse managed Delta tables. You can use T-SQL, create views, or apply security on those.
- Ls — Lakehouse Schemas: supports
lakehouse.schema.table
construct. You can find a default schema nameddbo
under /Tables. - Tf — Table Formats: Fabric engines adopt Delta Lake as default table format, but /Tables supports different data formats (e.g. parquet). Iceberg / Hudi interoperability will be supported by Apache XTable.
Li — Lakehouse Ingestion: there are different ways to load data into a lakehouse in Fabric.
- Du — Direct Upload: allows uploading data stored on your local machine using the Lakehouse explorer UI.
- Pi — Data Pipelines: Fabric data pipelines copy activity allows ingesting data into a lakehouse.
- Df — Dataflows Gen2: similarly, you can use lakehouse as a destination option in your dataflows to push data to a lakehouse.
- Sh — Shortcuts: a pointer or symbolic link to an internal (another lakehouse) or external object store data.
- Cn — Spark Connectors: different Spark connectors (e.g. Azure SQL DB) can be used, for example, via a notebook to load data into a lakehouse.
- Ss — Spark Structured Streaming: stream data from different sources using
readStream
and write to a lakehouse. - Ev — Eventstream: you can add a lakehouse destination to an eventstream and ingest data there.
La — Lakehouse Access: offers various methods for configuring permissions and ensuring data accessibility within the lakehouse.
- Ro — Workspace Roles: workspace roles (admin, member, contributor, viewer) define what users can do with Fabric items.
- Lx — Lakehouse Sharing: allows granting users or groups access to a lakehouse without giving access to a workspace.
- Or — OneLake RBAC: manages RBAC permissions under /Files (e.g. folders) and /Tables sections of a lakehouse.
- Gq — API for GraphQL: enables creating an API on top of lakehouse SQL analytics endpoint tables.
Lt — Table Operations: provides multiple methods to manage Delta tables for continuous analytics readiness.
- Ad — Auto Discovery: a built-in capability used under /Tables section to discover Delta Lake tables (metadata) over OneLake shortcuts.
- Lo — Load to Table: an UI option to convert csv or parquet files in /Files into a Delta Lake table under /Tables.
- Op — Optimize: consolidates multiple small parquet files into large file to reduce the small file problem. Enabled by default.
- Vo — V-Order: applies optimized sorting, encoding, and compression to Delta parquet files for faster reading. Enabled by default.
- Vc — Vacuum: removes old parquet files no longer referenced by a Delta table log. It does not clean up the Delta log. Built-in UI option.
Spark Compute
∟ Spark Runtime
∟ Spark Compute
∟ Spark Admin
∟ Environments
Ru — Spark Runtime: a Fabric Spark runtime is a bundle that includes internal and OSS components such as Delta Lake, Apache Spark and OS, default packages for Java/Scala, Python and R, and default Spark configs.
- Rv — Runtime Version: semantic version numbering aligned with the release cadence and lifecycle. Example: Fabric Spark runtime 1.3.
- Rl — Default Libraries: a list of default packages for Java/Scala, Python and R included in each runtime version. Example: default libraries in Fabric Spark runtime 1.3.
- Rp — Default Properties: default Spark properties that come with each runtime. Some of the properties are immutable.
- Ne — Native Engine: a Velox and Apache Gluten-based native execution engine for Fabric Spark, designed to offload processing from the JVM.
Sc — Spark Compute: managed Apache Spark compute platform within Fabric data engineering and data science experiences.
- Sp — Starter Pools: pre-warmed nodes with default Spark settings and libraries for fast session startup (5–10secs), fixed Medium node size.
- Cp — Custom Pools: allow users to created tailored Spark compute environments. 2–4min Spark session startup. Single node support.
- Hc — High Concurrency: allows sharing the same Spark session across different notebooks. Max 5 now.
- At — Autotune: tunes Apache Spark configurations to accelerate workload execution and optimize performance. Disabled by default.
- Cy — Spark Concurrency: explains how Fabric capacity units are map to Spark vcores, and deal with concurrency bursting and throttling.
- An — Job Admission: explains how Spark jobs admission works, based on the minimum core requirement (optimistic admission).
- Qu — Job Queueing: explains how Fabric Spark queueing for background jobs work when Spark compute limits are reached based on capacity.
- Ic — Intelligent Cache: caches data to help speed up the execution of Spark jobs in Fabric Spark runtimes. Enabled by default.
Sa — Spark Admin: Spark administration options to manage and set up pools at different levels, including Spark billing and utilization.
- Cs — Capacity Settings: option to create custom pools at the capacity level and enable them in workspaces. It allows to restrict workspace-level pool creation.
- Ws — Workspace Settings: workspace-level data engineering settings to configure Spark pool, Spark runtime version, default environment and high concurrency.
- Bu — Billing and Utilization: compute utilization and reporting for Apache Spark operations, all classified as background.
En — Environments: a consolidated item encompassing all Spark compute settings, libraries, and properties.
- Ec — Env. Spark Compute: provides options to select and configure Spark compute properties such as driver and executor memory.
- Ep — Env. Spark Properties: provides an option to configure Spark properties. Some Spark properties in Fabric Spark runtime are immutable. If you get the message
AnalysisException: Can't modify the value of a Spark config: <config_name>
, the property in question is immutable. - El — Env. Spark Libraries: provides options to install custom (.whl, .jar, .tar.gz) and public (PyPi, Conda, .yml) libraries.
- Er — Env. Resources: provides a file system to manage files and folders, and make them available across notebooks.
Spark Development
∟ Notebooks
∟ Spark Job Definition
∟ Copilot
∟ Data Wrangler
∟ VS Code Integration
Nb — Notebooks: a primary code item for developing Spark jobs and ML experiments in Fabric. See some limitations.
- Ns — Notebook Scheduling: allows scheduling a notebook and run it every X minutes / hour / day / week.
- Nu — Notebook Utils: aka mssparkutils, a built-in package to perform common tasks in a notebook.
notebookutils.help().
- Nr — Notebook Resources: notebook-context file system to manage folders and files. Environment resources appear here.
- No — Notebook Operations: you can import, export, save, collaborate and share, use magic commands, parameters cells, ipython widgets, visualize data, and perform various operations to develop notebooks in different languages.
- Na — Notebook Activity: run notebooks from Fabric data pipelines.
- Sl — Semantic Link: semantic-link and semantic-link-labs provide built-in functions to interact with semantic models and other Fabric items from notebooks. Complementary to notebookutils.
Sj — Spark Job Definition: a primary code item for developing batch/streaming Spark jobs.
- Js — Spark Job Definition Scheduling: allows scheduling a Spark job definition and run it every X minutes / hour / day / week.
- Jr — Spark Job Definition Retries: set up retry policy to automatically restart the job when issues happen.
- Ja — Spark Job Definition Activity: orchestrate and execute Spark job definitions from Fabric data pipelines.
Co — Copilot: an AI assistant to help co-develop, generate insights and create visualizations in Fabric.
- Cd — DS & DE Copilot: an AI assistant that helps analyze and visualize data in notebooks.
- As — AI Skill: GenAI for Q&A on your lakehouse data. These LLMs can generate queries and respond to contextualize questions.
Wr — Data Wrangler: a built-in UI tool in Fabric notebooks for exploratory data analysis (EDA) and data preparation. Also in VS Code.
- Wd — Wrangler Dataframes: data preparation with Data Wrangler on Pandas and Spark dataframes.
- Wo — Wrangler Operations: help on EDA and data preparation appying different operations. You can export results.
Vs — VS Code Integration: Synapse VS Code extension for exploring lakehouses and environments, and authoring notebooks and Spark jobs.
- Dc — Dev Container: a docker image containing all the required dependencies to run the Synapse VS Code extension.
- Vl — VS Code Local: Synapse VS Code extension on your VS Code for authoring notebooks and Spark jobs, incl environment and lakehouse.
- Vw — VS Code Web: Synapse VS Code extension on https://vscode.dev.
Spark Collaterals
∟ Fabric REST API
∟ Logging & Monitoring
∟ Migrations
∟ Network Security
∟ DevOps
∟ Integrations
Ap — Fabric REST API: Microsoft Fabric REST APIs to automate different Fabric workflows.
- Aa — Admin APIs: provide different API endpoints for Fabric administrative tasks.
- Ac — Core APIs: provide information and functionality for all items within Fabric, cross workloads.
Mo — Logging & Monitoring: to track Spark job execution, resource utilization, and performance metrics.
- Mh — Monitoring Hub: Fabric-scope, built-in UI in Fabric that serves as a centralized portal for browsing Spark activities across items.
- Rr — Spark Applications: built-in UI to see item-scope recent runs and Spark application details such as jobs, resources, logs and diagnostics.
- Im — Inline Monitoring: both notebooks and Spark job definitions provide some inline monitoring while running the job.
- Ui — Spark UI: application-scope. Spark History Server UI to debug and diagnose each Spark application.
- Rs — Run Series: categorizes your Spark applications based on recurring pipeline activities, manual notebook runs, or Spark job runs from the same notebook or Spark job definition. Anomaly detection.
- Av — Spark Advisor: shows errors, warning and performance suggestions in notebooks. Enabled by default.
- Co — Spark Consumption: Spark usage reporting as part of the Fabric Capacity Metrics app.
Mi — Migrations: guidance and tools for transferring Spark workloads, data pipelines, and configurations to Fabric Spark.
- Mn — Migrate Notebooks: move existing notebooks to Fabric. Example: Azure Synapse notebooks.
- Mj — Migrate Spark Jobs: move existing Spark jobs (.jar, .py, etc.) to Spark job definitions in Fabric. Example: Azure Synapse Spark job definitions.
- Mc — Migrate Pools: move existing Spark pools and clusters to Fabric custom pools, including your custom libraries and Spark properties to an environment item. Example: Azure Synapse pools.
- Mm — Migrate Metastore: move your metastore catalog, schemas, tables, partitions and functions to Fabric. Example: Azure Synapse Hive Metastore metadata to Fabric.
- Mp — Migrate Pipelines: move pipelines, DAGs and workflows to Fabric data pipelines. Example: Azure Synapse pipelines.
- Md — Migrate Data: move or make existing data available on Fabric. Example: ADLS Gen2.
- Mx — Migrate Others: move and change additional settings to make your Fabric target state operational after/before moving items and data such as runtime and syntax support or SDK/CLI usage to just name a few.
Se — Network Security: ensures the protection of data and Spark workloads by implementing secure communication channels, encryption, and access control policies.
- Pl — Private Link (inbound): allows users to access Fabric privately from your VNet. Tenant level.
- St — Service Tags (inbound): helps defining network access controls on NSGs, Azure firewall and UDRs.
- Ca — Conditional Access (inbound): restricts Fabric access by Microsoft Entra Conditional Access policies.
- Au — Allowlist URLs (inbound): a list of the Microsoft Fabric URLs required for interfacing with Fabric workloads.
- Mv — Managed VNet (inbound/outbound): provides network isolation for Fabric Spark workloads. Used when creating managed private endpoints and Private Link.
- Me — Managed Private Endpoints (outbound): enables accessing Azure PaaS services and Azure hosted customer-owned/partner services over a private endpoint in your VNet.
- Wi — Workspace Identity (outbound): a managed identity that can be associated with a Fabric workspace.
- Ta — Trusted Access (outbound): enables accessing firewall-enabled Azure PaaS services through managed identity whitelisting. Works with ADLS Gen2.
Do — DevOps: integrates continuous integration and continuous delivery (CI/CD) practices to streamline the development, testing, and deployment of Spark applications.
- Gi — Git Integration: built-in Git integration on Fabric, supporting Spark Job definition, notebook, lakehouse and environment.
- Dp — Deployment Pipelines: built-in option to deployment items across different environments, supporting notebook, lakehouse and environment.
In — Integrations: data pipeline and transformation capabilities with Fabric data engineering integrations. This is not a complete list.
- Is — Databricks: Fabric and Databricks can be integrated in different ways such as UC scenarios, non-UC scenarios and lakehouse scenarios.
- It — dbt:
dbt-fabricspark
enables dbt to work with Fabric Spark. - If — Azure Data Factory: ADF offers a lakehouse activity to copy data into a lakehouse.
- If — Fivetran: similar to ADF, Fivetran offers lakehouse destination.
- Iw — Warehouse Connector: enables accessing warehouse data from Spark, read-only for now. It honors object level, RLS/CLS of a warehouse.
The Periodic Table of Fabric Data Engineering (Spark) summarizes key concepts within Microsoft Fabric. As Fabric Spark continues to evolve, this periodic table will be updated to reflect new features and advancements.