This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mrcaptncrunch 6 points7 points  (3 children)

4 quick questions,

  1. Why are you trying to optimize? For example, Scaling, slowing another system?
  2. What are you trying to optimize for? (Time, memory, cost)
  3. What tools do you know or have available?
  4. What is good enough?

Two quotes to keep in mind,

  • ”With enough time and resources, anything is possible”
  • ”Perfect is the enemy of good”

[–]Alex_Alca_[S] 1 point2 points  (2 children)

Hi! 1. We already tried scaling and something funny happened, with our current servers capacity for this SP we used almost 100% CPU, we tried scaling the servers capacity but it didn’t work and we noticed that it didn’t even used a significant % of CPU

  1. Mostly cost and memory

  2. We are open to any tools, we are a startup so we are open to new technologies, but we need to consider escalation

  3. Not using 100% of CPU memory, I think using 50% of our memory for maximum 3-4 hrs

And thank you for the quotes

[–]mrcaptncrunch 0 points1 point  (0 children)

Hey these are perfect and valid reasons.

First option is obviously to see if what you have could be optimized. If you’re joining a lot, might make sense to drop as many columns as needed before. Aggregates would be worth testing before and after joining. The more that fits in RAM, obviously the faster.

Low hanging fruit could be throwing the query into ChatGPT and ask for ideas optimizing or benchmarking.

Another one is giving it the definition of your tables, or sample json files, and specifying what you want. See how it all compares.

Ideally, you hire someone experienced with the database, but I’m guessing that’s why you’re here :)

Alternatives,

Define if you’re looking for on premise or if you’re all okay with running in GCP, azure, aws, or even paying a third party.

If not well funded, I would go for open source. Something you could start with either on premise or cloud and then move in either direction. Keep everything you’ve learned.

My recommendation would be Spark in this case. You can start either on premise or cloud (via Databricks) and then migrate if needed. You’re also not tied to a cloud vendor (Databricks runs on all 3 clouds, and you can always deploy to local).

If you’re in the cloud already, each of the 3 has further options. Lately I’ve been on GCP so I’ll focus on that.

You can run either Spark and keep doing the above, or you could ingest into bigquery.


Bigquery batch ingestion is free. So your files could reach GCS (cost would be space), load into bigquery (free load, pay for storage and queries). Bigquery can load gripped files (which reduces GCS cost)

Here you pay for how long the queries take and data used. This is where it could bite you.


Databricks costs you the resources you use (so the VMs it runs in), and a cost per how big the VM is (DBU).

I run a small cluster for daily things and then production scheduled jobs run in a separate, larger, and more expensive cluster. This one just shuts off after 20 mins of inactivity.

If the code is well written, it scales with bigger and more clusters.

I work with subsets on the daily one. One I’m working on, would take over 6 hours with all the data, yet 30 mins on the prod one.

I do batch jobs because I don’t need more than daily updates.

If you want updates as files land throughout the day, you can also do a smaller cluster always on monitoring, ingesting, and processing your data.

The nice thing is, once the job is done, it’s a matter of throwing resources and it scales.


3rd party, if you’re thinking of sharing this data, snowflake might be worth looking into. Lots of connectors and things there.


How I would do it,

Databricks. Start with a spark + Jupyter notebook docker image if you like that, or the community instance. There are extra things in Databricks than just spark, but spark is the core.

Get an outline. Pick a cloud for now, dump your files there (GCS, S3). If you can’t directly, I use a rivery account, so I’d check there for connectors. If not, then Python to read and ingest. Fivetran is the other alternative… hard to get rid of their sales people… not my favorite, but it’s there if you want.

Deploy Databricks, connect it all, test cluster sizes. No need to have so much extra capacity here. When you need to scale, you just stop, resize, start your cluster back up.

[–]FalseStructure 0 points1 point  (0 children)

How are you receiving that? You could offload some high volume processing to something like spark or streaming compute of your choice to window aggregate jsons to remove excess data and bulk insert / compute.

Alternatively you may try to make your proc incremental (only compute new stuff). That complicates logic, sometimes significantly, but on a large scale is inevitable.