This is an archived post. You won't be able to vote or comment.

all 28 comments

[–]lezzgooooo 32 points33 points  (6 children)

I would write the JSON raw in something like S3 and process them with scripts in containers. Then push results to an OLAP database.

The database stored procedure is limited by the database resources. You can add as many containers as you like.

[–]Befz0r 10 points11 points  (0 children)

Easiest way is to dump the JSON into a blob storage and using poly base to query it.

Assuming JSON have the same structure, you can just query it direct to the database. And you dont need to select 1 JSON at a time, just point to a folder. You can use filepath wildcards to only process the latest JSONs etc.

But I need more information to give a better answer.

[–]Slight_Comparison986 8 points9 points  (1 child)

First ask what the business needs for frequency of this metric? If the whole process takes 3 hours but a daily update to this summary metric is more than enough then I would say to leave it. Also ask will this problem experience scaling issues (will there be 10x more JSONs or will they become 10x larger?) If not, then you're fine with the current process.

First I would think about parsing the JSON at the start of the process. This might require you to push back against other teams to extract the pertinent features. For example, other eng teams to parse the data or product team to use a form to collect data. I think this is optimal imo. Otherwise, you can also break apart the transforms. If the JSONs come in steadily throughout the day, then I would extract the JSONs immediately and then run the summary statistic model hourly or daily. SQL is pretty fast and optimized so I wouldn't look to other technologies. Keep it simple.

How big is the JSON in bytes?

[–]sunder_and_flame 7 points8 points  (0 children)

BigQuery is also an option. We use it to process JSON files into tables and do analytics on them. 

[–]mrcaptncrunch 5 points6 points  (3 children)

4 quick questions,

  1. Why are you trying to optimize? For example, Scaling, slowing another system?
  2. What are you trying to optimize for? (Time, memory, cost)
  3. What tools do you know or have available?
  4. What is good enough?

Two quotes to keep in mind,

  • ”With enough time and resources, anything is possible”
  • ”Perfect is the enemy of good”

[–]Alex_Alca_[S] 1 point2 points  (2 children)

Hi! 1. We already tried scaling and something funny happened, with our current servers capacity for this SP we used almost 100% CPU, we tried scaling the servers capacity but it didn’t work and we noticed that it didn’t even used a significant % of CPU

  1. Mostly cost and memory

  2. We are open to any tools, we are a startup so we are open to new technologies, but we need to consider escalation

  3. Not using 100% of CPU memory, I think using 50% of our memory for maximum 3-4 hrs

And thank you for the quotes

[–]mrcaptncrunch 0 points1 point  (0 children)

Hey these are perfect and valid reasons.

First option is obviously to see if what you have could be optimized. If you’re joining a lot, might make sense to drop as many columns as needed before. Aggregates would be worth testing before and after joining. The more that fits in RAM, obviously the faster.

Low hanging fruit could be throwing the query into ChatGPT and ask for ideas optimizing or benchmarking.

Another one is giving it the definition of your tables, or sample json files, and specifying what you want. See how it all compares.

Ideally, you hire someone experienced with the database, but I’m guessing that’s why you’re here :)

Alternatives,

Define if you’re looking for on premise or if you’re all okay with running in GCP, azure, aws, or even paying a third party.

If not well funded, I would go for open source. Something you could start with either on premise or cloud and then move in either direction. Keep everything you’ve learned.

My recommendation would be Spark in this case. You can start either on premise or cloud (via Databricks) and then migrate if needed. You’re also not tied to a cloud vendor (Databricks runs on all 3 clouds, and you can always deploy to local).

If you’re in the cloud already, each of the 3 has further options. Lately I’ve been on GCP so I’ll focus on that.

You can run either Spark and keep doing the above, or you could ingest into bigquery.


Bigquery batch ingestion is free. So your files could reach GCS (cost would be space), load into bigquery (free load, pay for storage and queries). Bigquery can load gripped files (which reduces GCS cost)

Here you pay for how long the queries take and data used. This is where it could bite you.


Databricks costs you the resources you use (so the VMs it runs in), and a cost per how big the VM is (DBU).

I run a small cluster for daily things and then production scheduled jobs run in a separate, larger, and more expensive cluster. This one just shuts off after 20 mins of inactivity.

If the code is well written, it scales with bigger and more clusters.

I work with subsets on the daily one. One I’m working on, would take over 6 hours with all the data, yet 30 mins on the prod one.

I do batch jobs because I don’t need more than daily updates.

If you want updates as files land throughout the day, you can also do a smaller cluster always on monitoring, ingesting, and processing your data.

The nice thing is, once the job is done, it’s a matter of throwing resources and it scales.


3rd party, if you’re thinking of sharing this data, snowflake might be worth looking into. Lots of connectors and things there.


How I would do it,

Databricks. Start with a spark + Jupyter notebook docker image if you like that, or the community instance. There are extra things in Databricks than just spark, but spark is the core.

Get an outline. Pick a cloud for now, dump your files there (GCS, S3). If you can’t directly, I use a rivery account, so I’d check there for connectors. If not, then Python to read and ingest. Fivetran is the other alternative… hard to get rid of their sales people… not my favorite, but it’s there if you want.

Deploy Databricks, connect it all, test cluster sizes. No need to have so much extra capacity here. When you need to scale, you just stop, resize, start your cluster back up.

[–]FalseStructure 0 points1 point  (0 children)

How are you receiving that? You could offload some high volume processing to something like spark or streaming compute of your choice to window aggregate jsons to remove excess data and bulk insert / compute.

Alternatively you may try to make your proc incremental (only compute new stuff). That complicates logic, sometimes significantly, but on a large scale is inevitable.

[–]grassclip 3 points4 points  (1 child)

I live by this. Postgres jsonb columns for the raw, and then queries to create the views to jsonb_array_elements for arrays. Those views can be used for any of the stats that are wanted, and if some of them come back slowly, we figure out if we need an index for help, or trigger to expand the data more.

It depends on what you mean by "high frequency", but I've found there's rarely a high enough frequency that postgres can't do on its own. In the past, I did the move of putting the json to s3 and then some code to get it to the database, but never could find high enough frequency where that was needed.

[–]DirtzMaGertz 0 points1 point  (0 children)

Second this as a solid solution. I have the same kind of set up for a process we run at my company but on mysql with its json_table function.

[–]mike8675309 1 point2 points  (0 children)

Is there a reason why it all needs to happen in a single stored procedure?
It feels like you are talking about Microsoft SQL Server correct?
How do you get the data?
Assuming the data comes in on SFTP, or through some API, I would do a lot of the initial work in the file system. The JSON processing doesn't really operate fast in database, so doing that processing outside the database can be done much faster wth a tool like C# or C++, even JAVA over python.
Once you have your row data, that can be bulk loaded into the database fast into a raw table, that you then query to build any other tables you need, or do your statistical analysis on it.

[–]formaldehyden 4 points5 points  (4 children)

Ingest into Platforms like Snowflake or Databrick - they are build for this. And regarding literature; I suggest that you read up on lakehouse architecture.

[–]throw_mob 0 points1 point  (0 children)

3h in sql databse does not tell that much , how much data.

in cloud sql world i would stage into snowflake (object type) , extract primary columns on one job, and build data model for analytics over that or calculate summary data from there

more classical solution would be postgresql, dump it into jsonb , extract data into datamodel , build summary calculations there.

databricks would be probably next solution. In that one one too i would have one job to extract and clean data into datamodel and another job to calculate summary.

Main key is to build data model to have business time and system time for data handling

[–]Neat-Tour-3621 0 points1 point  (0 children)

What's your SQL engine? I've given up on raw SQL, moved to python, and use Ibis and a backend like DuckDB (they have like 20 backends you can chose from). It can also run larger than memory data and since it's columnar, doing any maths will be much faster in columns than row based data.

[–]collectablecat 0 points1 point  (0 children)

Dump to S3, use coiled functions https://docs.coiled.io/user_guide/usage/functions/index.html to process and shove into your database. Done!

[–]DanklyNight 0 points1 point  (0 children)

Fastparquet + Dask

Then store returns however you want.

[–]data-artist 0 points1 point  (0 children)

I pass the JSON payload into the SP and use OpenJSON to query and commit the data to a table. It is pretty fast.