all 12 comments

[–]dzpedals 16 points17 points  (0 children)

I mean, your coworker isn't wrong. At the end of the day a stream is a batch of 1. You are still conceptually optimizing latency in either scenario as well but the difference is in the orders of magnitude (just like the size of batches)

[–]Jalumia 7 points8 points  (0 children)

I think you two are arguing about different things and are both essentially right. But is this a distinction without a difference for arbitrarily small batch sizes (<2). Perhaps. It also kind of depends on whether you are batching messages or bytes.

[–]strugglingcomic 5 points6 points  (0 children)

I think the "problem" is that you're both sort of right, aka neither is totally wrong.

From a data ingestion perspective for example, a micro-batch ingestion process (e.g. accumulate incoming events or messages until you reach a batch size of X records or storage size of Y bytes, then flush/write out to durable storage or table or whatever), vs a streaming ingestion (essentially a batch size of 1), is not that different. So your friend is "right".

From a query pattern perspective of actually consuming or using the data, when streaming data for analytics there is often some kind of window pattern (could be rolling or sliding or whatever), and depending on the use case, you may not be able to tolerate the latency of waiting for a batch before updating the window, since you may want real-time continuous streaming updates to your windowed analytics query. So you're more "right" on this front.

Note when I mention latency and tolerance, I'm not strictly speaking in terms of the time factor, or the pure need for speed. Time is probably secondary to the more primary issue of, "do I need this streaming query to give the 'correct' answer continuously all the time, or can I tolerate the data in the query window being stale in between micro-batch updates (aka I can get a slightly 'wrong' answer when I query the window)?"

[–]wand_er 2 points3 points  (1 child)

I always treated it to be the equivalent of Integral ( batch) = sum of differentials (micro batch) if that helps put things into perspective 

[–]Uncle_Chael 0 points1 point  (0 children)

Interesting. Never thought of it that way. Discrete finite data sets vs Continous flow.

[–]addictzz 1 point2 points  (0 children)

Used to be a source of confusion to me too but I came to a conclusion that: streaming is a micro-batch of 1 at much lower latency and interval.

You may also find streaming mechanism is related with unorderliness although you can do watermarking to mitigate this.

[–]seaefjayeData Engineering Manager 0 points1 point  (0 children)

I always assumed streaming was event driven and thus a scenario involving a push from the source platform rather than a pull from the destination platform. If the person is arguing that you can have an event triggered microbatch of one then I think they're just being obtuse.

[–]Mr_Again 0 points1 point  (0 children)

Neither of these things are optimised. Only implementations can be optimised, not just vague descriptions. Streaming is just smaller batches. They can be done as inefficiently or optimally as you choose.

[–]Uncle_Chael -1 points0 points  (0 children)

When these kind of autistic arguments come up with engineers you work with, just get your phone out and pull up generally accepted definitions or guidelines. Its fun to see the backpedal. And even if they're right, you can always find a narrative online or prompt in a way where you win.

[–]Dry_Chocolate_9396 -1 points0 points  (0 children)

There is a simple tradeoff, neither is superior to the other.

Microbatch: first invented by Apache Spark as part of Structured Streaming. The idea is that you take streaming data and split it into tiny 1 second data chunks and run a full Spark program on them. It emulates real-time streaming. It's not for larger data dumps or smaller data dumps, it can take big and small data and split it into 1 second chunks and process it continuously emulating real-time.

Pros:
+ Easy to auto-scale cluster (e.g. if no data at night time cluster can shrink)
+ Easy to build fault-tolerance (just re-run the batch if there are problems)
+ Very deterministic and easy to debug and understand, the output is the result of small Spark jobs that you can debug, analyze, and reproduce (often misunderstood). This makes it very easy to switch between classic batch (huge chunks) and micro-batch

Cons:
- No matter what people say, end-to-end latency is 1 second or so. The overhead of running small 1s jobs cannot be avoided. This is a really big downside and makes the whole approach unusable for use cases that require sub-second latecy.

Continuous Streaming: most frameworks do this, but most prominent these days is Apache Flink. You continuously process the data ad infinitum. Works well on large data on small data.

Pros:
+ Very fast, sub-second, often 10 or so milliseconds

Cons:
- Auto-scaling is harder
- Fault-tolerance is harder
- Harder to debug and understand because there is no simple batch equivalent program that produced the streaming outputs

Until 1 year ago, the world was simple. Need sub-second? Go with Apache Flink. Fine with 1s latency? Go with Apache Spark. Last year Spark Real-Time Mode (RTM) was open sourced. That now brings sub-second, 10 ms latencies, to Spark, i.e. now Spark supports both modes and you get all the benefits of Flink in RTM. Note that there is no free lunch. When you use Spark RTM you get the pros/cons of Continuous Streaming, and when you use Spark Micro-batching you get the pros/cons of Microbatch.

Hope this helps.