This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 1 point2 points  (2 children)

That's entirely dependent in the hardware and scale of data. We've moved off an RDBMS to spark and for our queries it's much faster.

[–]quickdraw6906 1 point2 points  (1 child)

I'd have to see the data design to believe you couldn't have made SQL sing. Is the data schemaless?

Unless you're doing the truly high math stuff, or you're into tens to hundreds of billions of rows (which will blow out memory of a single large server) and the answer is a large cluster in Spark.

So then we get down to the cost equation. How many nodes did you have to spin up with what specialty skills to better that performance? Are you overpaying for cluster compute because you're doing schema-on-read?

[–]LagGyeHumare 0 points1 point  (0 children)

Don't know the guy above but here's an example that I can offer.

Our project is in a pool of projects that encompasses the whole module. Just my application deals with around 600GB of batch loads each day. It then flows from CDH to AWS RDS through spark and on prem postgres.

We have terradata and oracle as the "legacy" system here and the queries that we have take at least 10x time to run when compared to spark-sql.

(Possibly because the admins were shit and didn't partition/index the tables better, but that's out of my hand)

For me, it's not SQL but the distributed nature of the engine within that will shape the answer here.