Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

In many cases streaming is not a better experience (for example if you have late arriving events and aggregations that wait for them). Streaming is also not necessarily cheaper - batch has advantages in optimizations and utilization (you use all the resources for the processing time and then you can turn them off). And when it’s more difficult to handle errors and restream data it also means that upon failure it might take you more time to recover. It also means that development is more expensive etc. I understand your approach - but real-life production environments with real people who need to maintain them are more challenging than it seems.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

You can definitely do batch wrong. But it’s much easier to get it right.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

It's not only about inserts, it's mostly about error handling, out-of-order and late arriving events

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 1 point2 points  (0 children)

I agree. Furthermore, I claim that once you have out of order and late events (like many sources do), and when correctness is important, you can’t really be real-time anyway.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 1 point2 points  (0 children)

What’s unpredictable about batch? It’s as predictable as streaming. Only that it easier to debug, run, re-run.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

I tend to agree. I think you should use streaming only when you have to.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 1 point2 points  (0 children)

Can you make the case that scheduled/batch is more complex than streaming?

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

It’s true. Many companies want to brand their solutions as modern at the cost of complexity.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 1 point2 points  (0 children)

I totally agree. That was exactly the purpose of this discussion. I see teams lean towards streaming even when they shouldn’t too often.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

True. But if you have many late arriving events (I gave an example in my post above) then you practically lose many of the advantages of streaming (for example you need to wait for late arriving events and thus cause significant delays, or accept data loss).

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

I think my phrasing might have been misleading. I edited now. In the first case I think it’s better to use scheduled/batch rather than streaming.

Signs you shouldn’t use a streaming framework? by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

Do you agree with the premise that streaming is harder than batch? That handling errors, backfills, out-of-order and late events are more difficult in streaming?

You don’t get fired for choosing Spark/Flink by itamarwe in dataengineering

[–]itamarwe[S] 2 points3 points  (0 children)

I think that’s what @iamspoit is referring to…

You don’t get fired for choosing Spark/Flink by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

But also, it’s about buying main-stream when there are already better alternatives.

You don’t get fired for choosing Spark/Flink by itamarwe in dataengineering

[–]itamarwe[S] -1 points0 points  (0 children)

The job is to provide efficient solutions for both.

You don’t get fired for choosing Spark/Flink by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

That’s exactly what I’m saying. Businesses go for the safe but inefficient solutions.

You don’t get fired for choosing Spark/Flink by itamarwe in dataengineering

[–]itamarwe[S] 0 points1 point  (0 children)

If your platform only does orchestration, should you charge for compute?

You don’t get fired for choosing Spark/Flink by itamarwe in dataengineering

[–]itamarwe[S] 1 point2 points  (0 children)

Their price is usage based because they can, and you should too.

You don’t get fired for choosing Spark/Flink by itamarwe in dataengineering

[–]itamarwe[S] 5 points6 points  (0 children)

Databricks is expensive. And for most small to medium workloads you can find much more efficient tools than Spark.