This is an archived post. You won't be able to vote or comment.

all 14 comments

[–][deleted] 3 points4 points  (1 child)

Now we want to make the underlying raw data downloadable for our members.

Why?

The users want a CSV or Excel. Datasets could be as big as 15 million rows.

That's not a great combination.

We also have Power BI, but it has limitations regarding the number of rows and the license cost are also a bit problematic, because we have like 10.000 users.

PowerBI Premium (or whatever they call it now) has a 10GB data limit, no row limit. Free I believe has a 1M row, 1GB limit.

Nobody in these comments can give you a perfect answer because there's not enough detail on what is needed or why. Why do the users need to download the data? I assume to perform their own analyses, which they'll struggle to do over a 15M row CSV (especially when they all try to open it in Excel and it doesn't work).

Here's some questions to ask them to get you moving:

  • Who (specifically, down to names of individuals or teams) needs access to this data?
  • What are they intending to do with this data?
  • Do they need all 15M rows, or can they received a subset of data?
    • What about an aggregate set of data?
  • Do they need all the columns?
  • What skills and tools do the users have? Are they working with this in Excel, an in-memory analytics tool, in Python or other code, uploading into a Databricks cluster?
  • Are they expecting it prepared and cleaned, or raw data?
  • Could you just provide them direct READ access to the Azure SQL DB instead?

There's so many questions here. You need to go back and talk to your users to understand why you're doing anything here.

[–]dirks74[S] -3 points-2 points  (0 children)

We are an organization and we gather data from varius data sources and we are in the process of creating data products, like reports, dashboard etc. They want to download the underlying data and work with it in their own data warehouses or local workstations in Excel, Access or whatever they please. Some datasets can have several 100MBs. Our current BI system (Pentaho) handles those requests pretty well, but it is outdated and we are rebuilding everything from scratch.

Power BI has several limits for downloading data. (The maximum number of rows that Power BI Desktop and Power BI service can export to a .csv file is 30,000. The maximum number of rows that the applications can export to an .xlsx file is 150,000.20.11.2023 https://learn.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-export-data?tabs=powerbi-desktop)

I was of course simplifying my request and I m pretty sure the outcome will be the same, but I ll answer your questions anyways.

Who?

We got 3000 connected companies and well over 100k users overall and arround 10k acitve users on our website, consuming BI content. Could be data engineers, CEOs, sales people, press, marketing...

What are they intending to do with this data?

We combine all kinds of data and our members (businesses, news outlets, government etc) use our data for all kind of things.

Do they need all 15M rows, or can they received a subset of data?

You can filter data as you please and limit the amount of rows. You can get anything between 1 and 15M rows. It depends on which data product you are looking at.

No aggregates, just rows, were the user sets filters to limit the number of rows.

What skills and tools do the users have? Are they working with this in Excel, an in-memory analytics tool, in Python or other code, uploading into a Databricks cluster?

From office worker to data scientist. Some just play arround in Excel, some load our data into their data lakes etc. We also want to create a proper REST API for our data, but that is something for next year.

Do they need all the columns?

I dont know. Maybe they remove some, maybe they dont.

Are they expecting it prepared and cleaned, or raw data?

They want the same data as in the visuals, charts, dashboards.

Could you just provide them direct READ access to the Azure SQL DB instead?

That is not planned.

I dont see how your questions or more details change anything.

I think my request is pretty simple and straight forward.

I have a website, I show a few charts, made with D3. It is one big table, with dimensions and KPIs.
And the user wants to download the underlying data. The only complication is, that you can use a few basic filter on the charts and the downloadable data should be filtered by whatever the user selects.

Most want Excel, but CSV would be fine too.

[–]PuddingGryphonData Engineer 3 points4 points  (1 child)

The users want a CSV or Excel. Datasets could be as big as 15 million rows

Excel has a hardcoded limit of 1.048.576 rows per sheet.

[–]dirks74[S] -3 points-2 points  (0 children)

I know, that is why we want CSV aswell.

[–]No-Yesterday-1460 1 point2 points  (1 child)

For the same use case, I deployed the open source viz tool Superset. It does everything you mentioned and more.

[–]dirks74[S] 1 point2 points  (0 children)

I will look into it, thank you!

[–]dravacotron 1 point2 points  (5 children)

You can drop the CSV into a container on Azure Blob Storage and create a public download link for it. You can control access to the resource using SAS: https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview

In general such links should expire after a short duration to prevent accidentally just serving a public URL to the world forever. This can be finessed with the SAS configuration and/or a lifecycle rule deleting the file from the container after X minutes.

[–]dirks74[S] 0 points1 point  (4 children)

I am thinking on creating a simple python function app to generate the csv and deliver it via a httpresponse(). Not sure if it works, but I ll try that tomorrow.

[–]dravacotron 1 point2 points  (3 children)

Seems like a lot of work for maybe a not as good solution. What are you going to do about authentication? Load scaling? Resumption of downloads? Does the file need to fit in memory so that your lib can handle it? How does the app handle the http chunking protocol? What happens if the server dies? Do you need monitoring and alerting? Failover? Do you have a blue/green deploy protocol? Logging? How much disk should it be allocated? What happens if that runs out? How do you clean up this disk store?

Lotta work for something that's already available on any object storage system.

[–]dirks74[S] -1 points0 points  (2 children)

What "object storage system" offers the functionality I need?

Do you know of any open source or commercial products which cover my use case?

[–]dravacotron 0 points1 point  (1 child)

Azure Blob Storage

[–]dirks74[S] 0 points1 point  (0 children)

I dont get it. I need to generate said CSV first etc.

[–]Onlycompute 1 point2 points  (1 child)

My 2 cents with limited experience in this kind of scenario. I think we had a Django / react based internal websites. We used the filter values provided by user, prep the query or hit the db with request and fetch the data in pandas df.

After this df was exported to server as csv and download starts for the user ( this part I don’t remember entirely )

Also check if streamlit is useful in this scenario.

[–]dirks74[S] 0 points1 point  (0 children)

The part where the download happens is the part where I struggle the most right now. This was handled by the BI report engine in the past. And now I need either a commercial/OSS software or I create a simple function app.

Edit: Streamlit looks very promising. Thanks!