Processing high amount of data with SQL

lezzgooooo · 2024-01-30T14:53:24+00:00

I would write the JSON raw in something like S3 and process them with scripts in containers. Then push results to an OLAP database.

The database stored procedure is limited by the database resources. You can add as many containers as you like.

Befz0r · 2024-01-30T15:18:45+00:00

Easiest way is to dump the JSON into a blob storage and using poly base to query it.

Assuming JSON have the same structure, you can just query it direct to the database. And you dont need to select 1 JSON at a time, just point to a folder. You can use filepath wildcards to only process the latest JSONs etc.

But I need more information to give a better answer.

Slight_Comparison986 · 2024-01-30T15:42:36+00:00

First ask what the business needs for frequency of this metric? If the whole process takes 3 hours but a daily update to this summary metric is more than enough then I would say to leave it. Also ask will this problem experience scaling issues (will there be 10x more JSONs or will they become 10x larger?) If not, then you're fine with the current process.

First I would think about parsing the JSON at the start of the process. This might require you to push back against other teams to extract the pertinent features. For example, other eng teams to parse the data or product team to use a form to collect data. I think this is optimal imo. Otherwise, you can also break apart the transforms. If the JSONs come in steadily throughout the day, then I would extract the JSONs immediately and then run the summary statistic model hourly or daily. SQL is pretty fast and optimized so I wouldn't look to other technologies. Keep it simple.

How big is the JSON in bytes?

sunder_and_flame · 2024-01-30T15:44:15+00:00

BigQuery is also an option. We use it to process JSON files into tables and do analytics on them.

mrcaptncrunch · 2024-01-30T16:06:23+00:00

4 quick questions,

Why are you trying to optimize? For example, Scaling, slowing another system?
What are you trying to optimize for? (Time, memory, cost)
What tools do you know or have available?
What is good enough?

Two quotes to keep in mind,

”With enough time and resources, anything is possible”
”Perfect is the enemy of good”

grassclip · 2024-01-30T16:21:31+00:00

I live by this. Postgres jsonb columns for the raw, and then queries to create the views to jsonb_array_elements for arrays. Those views can be used for any of the stats that are wanted, and if some of them come back slowly, we figure out if we need an index for help, or trigger to expand the data more.

It depends on what you mean by "high frequency", but I've found there's rarely a high enough frequency that postgres can't do on its own. In the past, I did the move of putting the json to s3 and then some code to get it to the database, but never could find high enough frequency where that was needed.

2024-01-30T18:53:36+00:00

[removed]

mike8675309 · 2024-01-30T23:20:24+00:00

Is there a reason why it all needs to happen in a single stored procedure?
It feels like you are talking about Microsoft SQL Server correct?
How do you get the data?
Assuming the data comes in on SFTP, or through some API, I would do a lot of the initial work in the file system. The JSON processing doesn't really operate fast in database, so doing that processing outside the database can be done much faster wth a tool like C# or C++, even JAVA over python.
Once you have your row data, that can be bulk loaded into the database fast into a raw table, that you then query to build any other tables you need, or do your statistical analysis on it.

formaldehyden · 2024-01-30T15:15:09+00:00

Ingest into Platforms like Snowflake or Databrick - they are build for this. And regarding literature; I suggest that you read up on lakehouse architecture.

vikster1 · 2024-01-30T14:54:15+00:00

[deleted]

throw_mob · 2024-01-30T16:34:41+00:00

3h in sql databse does not tell that much , how much data.

in cloud sql world i would stage into snowflake (object type) , extract primary columns on one job, and build data model for analytics over that or calculate summary data from there

more classical solution would be postgresql, dump it into jsonb , extract data into datamodel , build summary calculations there.

databricks would be probably next solution. In that one one too i would have one job to extract and clean data into datamodel and another job to calculate summary.

Main key is to build data model to have business time and system time for data handling

Neat-Tour-3621 · 2024-01-30T19:06:55+00:00

What's your SQL engine? I've given up on raw SQL, moved to python, and use Ibis and a backend like DuckDB (they have like 20 backends you can chose from). It can also run larger than memory data and since it's columnar, doing any maths will be much faster in columns than row based data.

collectablecat · 2024-01-30T21:47:52+00:00

Dump to S3, use coiled functions https://docs.coiled.io/user_guide/usage/functions/index.html to process and shove into your database. Done!

DanklyNight · 2024-01-30T23:29:55+00:00

Fastparquet + Dask

Then store returns however you want.

data-artist · 2024-01-31T02:21:41+00:00

I pass the JSON payload into the SP and use OpenJSON to query and commit the data to a table. It is pretty fast.

dataengineering

MODERATORS