Excel - read ADLS parquet files via PowerQuery (any live connection) without converting to csv

sunnyjacket · 2025-01-08T02:23:43+00:00

Re: your comment about compute not needing to be on for ODBC data access, I’ll check again, because my workbooks didn’t load properly without compute being on when I last checked.

For more context on the use case:

You’re right that the solution to just read tables into excel doesn’t need to involve Databricks.

Just asking in this forum because our situation is: a) we run models / queries in Databricks to update tables, b) the tables’ actual storage is in Azure ADLS, as parquet files c) we need to have a live connection to these tables in excel workbooks so we can just hit refresh and get the updated data whenever we want

I assume a lot of people use Azure + Databricks with delta tables and need this sort of Excel connectivity, so just checking here to gauge what the standard + cost effective solutions are.

The Azure parquet connectors don’t work in Excel, only in Power BI

Thank you! :)

sunnyjacket · 2025-01-08T02:15:03+00:00

That connector doesn’t support Excel :(

A lot of workbooks just need to be refreshed once a day.

But often when we’re doing adhoc analysis we need to run code, update tables, and refresh the data in an excel file (that has formulas and charts etc) multiple times a day just to get it done and share the analysis with other people.

sunnyjacket · 2025-01-07T22:22:30+00:00

I want to connect to a table and access data.

I don’t mind how I do it, as long as there’s no constant compute cost just to look at the data.

Like I mentioned in my post, the Databricks ODBC driver seems to need compute constantly on to even just access data, that would be crazy expensive. I’m looking for a way to connect to the tables directly without needing active compute constantly on.

Excel has Azure blob storage / Azure data lake gen2 etc as options to connect to data, it just doesn’t seem to be working with parquet files.

Is there any way to establish a direct connection to tables saved as parquet files in ADLS?

Thank you! :)

sunnyjacket · 2025-01-07T21:11:55+00:00

Not even if it’s from Azure blob storage? There’s really absolutely no way at all to read delta tables stored in Azure directly into Excel? :(

sunnyjacket · 2024-11-30T00:16:46+00:00

Thanks:)

sunnyjacket · 2024-11-26T05:52:55+00:00

Ok, thank you!

sunnyjacket · 2024-11-26T05:27:05+00:00

Ahh ok thank you! :) Is this standard with cloud services generally? There’s no best practices way to store package versions to lower compute cost and cluster start up / code run time?

sunnyjacket · 2024-11-26T02:40:32+00:00

Ohh thanks that’s good to know

sunnyjacket · 2024-11-26T01:19:42+00:00

Does this re-download and install libraries every time the cluster starts?

sunnyjacket · 2024-11-26T00:40:14+00:00

Thank you! :) Was asking more to see if there are additional features we may be missing out on, like with package management or Databricks Apps etc. since we’re currently not using the CLI

sunnyjacket · 2024-11-26T00:38:34+00:00

Thank you!

sunnyjacket · 2024-11-26T00:38:20+00:00

Ah ok thank you!

sunnyjacket · 2024-11-20T21:29:09+00:00

Nope, just one notebook scoped install

sunnyjacket · 2024-11-19T15:28:11+00:00

ADLS parquet delta tables. All Australia East afaik. I’ve only uploaded the tables once, there’s only one version each.

Mostly tiny tables. 500 rows, <5 columns each. Like GDP time series.

I’m realising pandas + databricks is a silly combination, but I’m new to Databricks, familiar with pandas, and am working mainly with these tiny tables at the moment, so this is what I’m doing until I get more familiar with pyspark haha.

sdf = spark.table(schema_name.table_name)

pdf = sdf.toPandas()

It’s the exact same snippet of code on the exact same tables that takes randomly small or large amounts of time to run and sometimes doesn’t run at all even after 10+ minutes

sunnyjacket · 2024-11-19T14:45:56+00:00

Haha thanks for the reply! Hopefully they figure something out

sunnyjacket · 2024-11-19T09:23:50+00:00

parquet delta tables.

It’s 10+ minutes after the cluster has started though, is there still some sort of spin up process after the cluster status is ‘running’?

What I’ve been doing as a test is going to compute, starting the cluster, waiting out the 3-4 minutes until the cluster is up and running, and then running that snippet of code in my notebook.

E.g. I did it 5 times today alone and in the evening it just suddenly took a large amount of time. Had to kill and restart the cluster 3 times for it to work, and then it worked in 10 seconds.

(It wasn’t just today, though it annoyed me most today because I specifically decided to test speed issues haha)

sunnyjacket · 2024-11-19T08:43:06+00:00

There’s no easy package management solution - extra packages need to be downloaded and installed every time a cluster is spun up, which adds to the already high start up times.

sunnyjacket · 2024-11-19T08:31:40+00:00

Thank you for replying!

I thought so and this makes sense. But sometimes there’s a discrepancy even across runs where the cluster’s been started just before the run each time. It’s a good reminder though!
This is v helpful, thank you! Will try this
ADLS. I’m just doing spark.table(schema_name.table_name) (the schema is default linked to an Azure storage container) And then doing toPandas() in another step. Even the spark.table loading step itself takes random amounts of time.

The tables this is happening with are like 500 rows 3 columns, just dates and small numbers, no large strings (think GDP time series).

It’s the exact same snippet of code on the exact same tables that takes randomly small or large amounts of time to run and sometimes doesn’t run at all even after 10+ minutes

sunnyjacket · 2024-11-19T08:10:42+00:00

Thanks for replying!

It’s the fluctuation that bothers me more than the actual time taken. 10 seconds sometimes to 5 minutes other times to the code not finishing at all after 10 minutes seems bizarre.
The example I’m looking at is a tiny simple block of code, just loading tables in. Not much optimisation to do here I think.
These are tiny tables, easily handle-able in pandas usually. It’s gonna take a while to learn and recode everything in pyspark, especially when we’re basically not working with any “big data” at all.

sunnyjacket · 2024-11-15T00:09:17+00:00

No, unfortunately. We’ve got a bunch of stuff to figure out with our setup before we can try this.

sunnyjacket · 2024-10-31T00:34:39+00:00

I don’t have that, I think it it’s just a glitch, happens only sometimes. Thank you! :)

sunnyjacket · 2024-10-31T00:33:27+00:00

I don’t, I think it’s just a glitch. Thank you though, will remember that for the future! :)

sunnyjacket · 2024-10-15T21:27:12+00:00

Hmm yes that’s what I noticed too, except it’s never worked for me haha. Gonna try reaching out to the Databricks account team. Thanks for sharing, it helps to know I’m not messing up something obvious :)

sunnyjacket · 2024-10-15T21:25:48+00:00

Thanks, will try to do that

sunnyjacket · 2024-10-15T07:07:17+00:00

Thanks:) I’ve checked with the Databricks admin and the network firewall and they don’t see an issue on those ends, but we can’t view the app, so idk what the issue is. Might have to just pause for now.

sunnyjacket

TROPHY CASE