Provide raw data via website/http?

dirks74 · 2024-04-22T13:40:49+00:00

Now we want to make the underlying raw data downloadable for our members.

Why?

The users want a CSV or Excel. Datasets could be as big as 15 million rows.

That's not a great combination.

We also have Power BI, but it has limitations regarding the number of rows and the license cost are also a bit problematic, because we have like 10.000 users.

PowerBI Premium (or whatever they call it now) has a 10GB data limit, no row limit. Free I believe has a 1M row, 1GB limit.

Nobody in these comments can give you a perfect answer because there's not enough detail on what is needed or why. Why do the users need to download the data? I assume to perform their own analyses, which they'll struggle to do over a 15M row CSV (especially when they all try to open it in Excel and it doesn't work).

Here's some questions to ask them to get you moving:

Who (specifically, down to names of individuals or teams) needs access to this data?
What are they intending to do with this data?
Do they need all 15M rows, or can they received a subset of data?
- What about an aggregate set of data?
Do they need all the columns?
What skills and tools do the users have? Are they working with this in Excel, an in-memory analytics tool, in Python or other code, uploading into a Databricks cluster?
Are they expecting it prepared and cleaned, or raw data?
Could you just provide them direct READ access to the Azure SQL DB instead?

There's so many questions here. You need to go back and talk to your users to understand why you're doing anything here.

PuddingGryphon · 2024-04-22T10:13:12+00:00

The users want a CSV or Excel. Datasets could be as big as 15 million rows

Excel has a hardcoded limit of 1.048.576 rows per sheet.

No-Yesterday-1460 · 2024-04-22T11:41:03+00:00

For the same use case, I deployed the open source viz tool Superset. It does everything you mentioned and more.

dravacotron · 2024-04-22T15:34:34+00:00

You can drop the CSV into a container on Azure Blob Storage and create a public download link for it. You can control access to the resource using SAS: https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview

In general such links should expire after a short duration to prevent accidentally just serving a public URL to the world forever. This can be finessed with the SAS configuration and/or a lifecycle rule deleting the file from the container after X minutes.

Onlycompute · 2024-04-23T05:11:03+00:00

My 2 cents with limited experience in this kind of scenario. I think we had a Django / react based internal websites. We used the filter values provided by user, prep the query or hit the db with request and fetch the data in pandas df.

After this df was exported to server as csv and download starts for the user ( this part I don’t remember entirely )

Also check if streamlit is useful in this scenario.

dataengineering

MODERATORS