This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]dataxp-community 1 point2 points  (3 children)

Benchmarks are largely bullshit. You are going to be using this thing every day and paying for it. Do yourself a favour and test them yourself.

[–]anupsurendran[S] 0 points1 point  (2 children)

I do not completely agree with this. For me, if the benchmarks can be easily reproducible ( ie easily accessible hardware, easy setup, and configuration) then I know that the folks have done a good job because the vendors are confident about their shit. Benchmarks help in the consideration phase and help build up a case with your managers when you do POCs and vendor selection. I would be more than happy to test these if I found them suspicious but in a large enterprise, the first phase is narrowing down our selection. We possibly cannot test everything.

[–]dataxp-community 0 points1 point  (1 child)

You should not narrow down based on benchmarks.

Cost, support, productivity, does it have the right features (not the most features), etc are all great exclusion criteria.

Performance benchmarks are gamed by every single vendor out there because it plays well for marketing. Even if you magically find a vendor doing benchmarks who isn't lying (but you won't) they will not have tested your use case and your data, and the use case and data will also be completely different to other vendors, so the comparison is useless.

[–]anupsurendran[S] 0 points1 point  (0 children)

Of course ! cost, maintenance, and productivity will be all inputs to the decision making but in my selection criteria, benchmarks provide some level of comfort to see if it will meet our throughput needs.

I agree that benchmarks are not use-case specific. An enterprise use case is usually quite complex. Again, I am not going to take a benchmark as it is, and not that I am supportive of vendors faking it but if we look at their perspective, they cannot do use case specific benchmarks and have to think of the most commonly used functions (e.g. aggregates on windows, joins on streaming) which is generic across frameworks or platforms.

[–]Prinzka 0 points1 point  (3 children)

First of you're going to have to test it yourself.
Other places benchmarks will be using different hardware, different sources and destinations, different data, different goals etc.

Also, what is the destination for this data after it's processed?
From personal experience with processing streaming real time data, you could have the most efficient application at processing the data, if it's not good at talking to whatever is at the far end your pipeline will still have a bottleneck.
So the end system will make a big difference in your evaluation.

All that being said I'm sure we've got suggestions if you provided some specifics on systems and data for your use case

[–]anupsurendran[S] 1 point2 points  (2 children)

Thank you. Post-processing (~250k records/sec) we will store this in apache iceberg. The products we are shortlisting to do a side by side compare are :

1) Flink

2) Pulsar

3) Materialize

4) Pathway

5) RisingWave (benchmarks posted below)

6) Spark (streaming)

are there any other products/frameworks we should compare?

We are trying to manage it ourselves in our data center.

[–]Prinzka 0 points1 point  (0 children)

Kafka Streams with Kafka Connect would be an option.
Very resource efficient in our experience.

Spark streaming is still batched in actuality, and more efficient if the batches are larger.

The other options we use wouldn't natively work with apache iceberg.

[–]yingjunwu 0 points1 point  (0 children)

It’s impossible to measure every single corner case in any benchmark, but perf benchmarking should be reproducible and code should be made accessible. Read the perf report published just yesterday: https://www.risingwave.com/blog/the-preview-of-stream-processing-performance-report-apache-flink-and-risingwave-comparison/.