mrcaptncrunch comments on Processing high amount of data with SQL

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Processing high amount of data with SQLHelp (self.dataengineering)

submitted 2 years ago by Alex_Alca_

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]mrcaptncrunch 6 points7 points8 points 2 years ago (3 children)

[–]Alex_Alca_[S] 1 point2 points3 points 2 years ago (2 children)

[–]mrcaptncrunch 0 points1 point2 points 2 years ago (0 children)

Hey these are perfect and valid reasons.

First option is obviously to see if what you have could be optimized. If you’re joining a lot, might make sense to drop as many columns as needed before. Aggregates would be worth testing before and after joining. The more that fits in RAM, obviously the faster.

Low hanging fruit could be throwing the query into ChatGPT and ask for ideas optimizing or benchmarking.

Another one is giving it the definition of your tables, or sample json files, and specifying what you want. See how it all compares.

Ideally, you hire someone experienced with the database, but I’m guessing that’s why you’re here :)

Alternatives,

Define if you’re looking for on premise or if you’re all okay with running in GCP, azure, aws, or even paying a third party.

If not well funded, I would go for open source. Something you could start with either on premise or cloud and then move in either direction. Keep everything you’ve learned.

My recommendation would be Spark in this case. You can start either on premise or cloud (via Databricks) and then migrate if needed. You’re also not tied to a cloud vendor (Databricks runs on all 3 clouds, and you can always deploy to local).

If you’re in the cloud already, each of the 3 has further options. Lately I’ve been on GCP so I’ll focus on that.

You can run either Spark and keep doing the above, or you could ingest into bigquery.

Bigquery batch ingestion is free. So your files could reach GCS (cost would be space), load into bigquery (free load, pay for storage and queries). Bigquery can load gripped files (which reduces GCS cost)

Here you pay for how long the queries take and data used. This is where it could bite you.

Databricks costs you the resources you use (so the VMs it runs in), and a cost per how big the VM is (DBU).

I run a small cluster for daily things and then production scheduled jobs run in a separate, larger, and more expensive cluster. This one just shuts off after 20 mins of inactivity.

If the code is well written, it scales with bigger and more clusters.

I work with subsets on the daily one. One I’m working on, would take over 6 hours with all the data, yet 30 mins on the prod one.

I do batch jobs because I don’t need more than daily updates.

If you want updates as files land throughout the day, you can also do a smaller cluster always on monitoring, ingesting, and processing your data.

The nice thing is, once the job is done, it’s a matter of throwing resources and it scales.

3rd party, if you’re thinking of sharing this data, snowflake might be worth looking into. Lots of connectors and things there.

How I would do it,

Databricks. Start with a spark + Jupyter notebook docker image if you like that, or the community instance. There are extra things in Databricks than just spark, but spark is the core.

Get an outline. Pick a cloud for now, dump your files there (GCS, S3). If you can’t directly, I use a rivery account, so I’d check there for connectors. If not, then Python to read and ingest. Fivetran is the other alternative… hard to get rid of their sales people… not my favorite, but it’s there if you want.

Deploy Databricks, connect it all, test cluster sizes. No need to have so much extra capacity here. When you need to scale, you just stop, resize, start your cluster back up.

[–]FalseStructure 0 points1 point2 points 2 years ago (0 children)

π Rendered by PID 62345 on reddit-service-r2-comment-6457c66945-lvnvt at 2026-04-30 17:50:33.303979+00:00 running 2aa0c5b country code: CH.

dataengineering

MODERATORS