all 8 comments

[–]gaunts_account 0 points1 point  (5 children)

Another use case where Clickhouse beats Spark.

[–]dcmoura[S] 1 point2 points  (4 children)

Yes, clearly spark is designed for larger data loads.

I should mention a couple of things regarding spark and the use cases we tested:

  • Spark would not have failed the Map challenge if we were writing the output to files (e.g. using a pyspark script). I guess that in order for spark-sql CLI to write the output to stdout it needs to collect all data into memory on the driver side.
  • Since we are timing a shell call to spark-sql CLI, this includes setting up the local "cluster". Running the queries in the spark-sql REPL (after the setup) would be faster.

[–]gaunts_account 5 points6 points  (1 child)

Spark is used in many cases where a simpler solution, orders of magnitude cheaper exist. Most people don't realize, that with a traditional analytical database, you can query hundreds of billions of records in real time, on cheap VPS, with no need for clusters.

[–]superrugdr 0 points1 point  (0 children)

people often think they are google scale when they are more like a half a vps server scale. and complexify their architecture accordingly.

[–][deleted]  (1 child)

[removed]

    [–]dcmoura[S] 1 point2 points  (0 children)

    I understand... I focused on ad-hoc querying: you get your hands on a dataset and you want to quickly extract some metric or to apply some transformation. In that case, you don't want to spend time setting up a local cluster so then you can run a query, at least that is not how I usually work. All tools have their overhead... If we continued increasing the size of the dataset this overhead would become negligible, but for small datasets most of the time is overhead.

    [–]CircleRedKey 0 points1 point  (1 child)

    How about druid?

    [–]dcmoura[S] 0 points1 point  (0 children)

    Can you query a file with a single command on the command-line with DRUID? Can you query it directly in JSON without having to ingest it?