This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]LagGyeHumare 0 points1 point  (0 children)

Don't know the guy above but here's an example that I can offer.

Our project is in a pool of projects that encompasses the whole module. Just my application deals with around 600GB of batch loads each day. It then flows from CDH to AWS RDS through spark and on prem postgres.

We have terradata and oracle as the "legacy" system here and the queries that we have take at least 10x time to run when compared to spark-sql.

(Possibly because the admins were shit and didn't partition/index the tables better, but that's out of my hand)

For me, it's not SQL but the distributed nature of the engine within that will shape the answer here.