Synapse

Needs Votes

Enable full access to Warehouse tables from Spark

Vote (12)

Matthias Wong on 10 Jun 2023 03:03:04

Fabric is a game changer because it brings the traditional warehouse approach and Spark approach together under OneLake. This means our data scientists can get access to all the warehouse tables we built over the years.

While this is a major step, the interface from the Data science persona is still a bit awkward. It does not mirror the experience of the Data warehouse perspective. In particular, using SQL endpoint, we can do cross-database queries to seamlessly run a T-SQL query across Lakehouse and Warehouse tables. For example, we can run "select * from Lakehouse.dbo.Table union select * from Warehouse.dbo.Table". The same cannot be said for Spark notebook. This query does not make sense in the Spark context. In other words, the Warehouse tables are invisible to Spark and must be brought in using a shortcut or manually specifying the underlying ABFSS path.

To enable a seamless cross-database query from the Spark session, we may have to align the namespaces. Currently, in the SQL endpoint, the query to a lakehouse table is "select * from Lakehouse.dbo.Table" while the Spark SQL equivalent is "select * from Lakehouse.Table". It would even be worth moving Spark SQL to go from "select * from Lakehouse.table" to "select * from Lakehouse.dbo.table". This would mean that, whether I am in SQL-endpoint or Spark notebook, I could still write "select * from Lakehouse.dbo.Table union select * from Warehouse.dbo.Table"

However we architect it, the result would be amazing if data scientists can immediately query all the warehouse tables in the workspace without setting up shortcuts. We can reserve shortcuts for the more complicated scenarios such as external sources or other workspaces.

Administrator on 30 Aug 2023 17:49:21

There are two distinct requests:

Include schemas in the lakehouse namespace
Query Warehouse tables from Notebooks using Spark

The first one is something that we're working on, and committed to deliver.

The second would require Spark to access and use Warehouse metastore instead of own metastore. That is not planned. The community is encouraged to continue voting on the second request and the product team will regularly review these ideas at planning.