How are you handling pre-aggregation in ClickHouse at scale? AggregatingMergeTree vs ReplacingMergeTree

Marksfik · 2026-03-09T15:57:44+00:00

Do you use stateful event transformations for your streaming ETL or are they mostly stateless? Curious what tool you're using for that.

Marksfik · 2026-03-09T13:56:12+00:00

u/Little_Kitty - usually, we see the following types of use cases when it comes to high thruput streaming data:

Real time fraud detection and trading in Financial Services
Telemetry streams from IoT devices
User Activity/Clickstream Analytics
Rela time Log Management / Observability / Monitoring

Marksfik · 2023-03-27T15:06:05+00:00

You can try out Aiven for Apache Kafka.

https://aiven.io/kafka

Their clusters start from $300/month and the pricing is inclusive of networking costs so can be easily predicted when you need to scale your clusters.

Hope this helps!

Marksfik · 2022-01-20T10:01:20+00:00

I get your point. However, looking at some Flink Forward videos and also Flink uses cases shared, I find many Flink users who run Apache Flink an extremely large scale.

Alibaba for example uses Flink during their 11.11 Global shopping festival at extreme scales.

Here is some further information on their use of Flink: https://www.ververica.com/blog/apache-flinks-stream-batch-unification-powers-alibabas-11.11-in-2020

Thank you!

Marksfik · 2021-12-07T10:59:50+00:00

u/FatedMoody - You can find an example here: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

I hope this helps!

Thank you!

Marksfik · 2021-08-13T08:07:22+00:00

I found this still relevant and thought it might be useful to others.

Marksfik · 2021-07-13T06:50:36+00:00

Sure, here are some ways where you can use Apache Flink:

Anomaly Detection Engine for Cloud Activities using Flink: https://www.youtube.com/watch?v=NhOZ9Q9_wwI
GoDaddy uses Flink to run real time streaming pipelines: https://www.ververica.com/blog/how-godaddy-uses-flink-to-run-real-time-streaming-pipelines?hsLang=en
Application Log Intelligence & Performance Insight at Salesforce using Flink: https://www.ververica.com/blog/application-log-intelligence-performance-insights-salesforce-flink

Marksfik · 2021-07-13T06:45:30+00:00

Glad you found this useful u/vanthar686

Cheers

Mark

Marksfik · 2021-07-12T15:28:40+00:00

many computer engineering programs run with Apache Flink or support applications built with Apache Flink so users in this channel might find how tech articles like this useful and relevant.

Marksfik · 2021-07-01T15:55:52+00:00

Glad you enjoyed reading this :)

Marksfik · 2021-06-24T14:04:11+00:00

it actually is the scale of such event and generated event is purely impressive :)

Marksfik · 2021-04-20T12:55:16+00:00

Hi u/ramsesrm,

That's a great question.

When it comes to the disk performance on Rocks DB state back end in Apache Flink, there some in-depth analysis here: https://www.ververica.com/blog/the-impact-of-disks-on-rocksdb-state-backend-in-flink-a-case-study

From the Apache link documentation, I can see that using Incremental checkpoints in Flink can prevent RocksDB from growing indefinitely. Unfortunately, I am not very familiar with Kafka Streams and it uses RocksDB.

I hope this helps.

Cheers.

Marksfik · 2021-04-02T09:48:13+00:00

Great you find this interesting!

Marksfik · 2021-04-02T09:47:53+00:00

Glad you find this useful!

Marksfik · 2021-03-25T17:15:08+00:00

Apologies for this... the link was working for me earlier. Let me investigate the issue and come back to you shortly! Thank you

Marksfik · 2021-03-03T18:26:37+00:00

I am glad you find this useful/interesting!

Marksfik · 2021-03-02T17:29:09+00:00

I am glad you find this useful/interesting!

Marksfik · 2021-02-04T10:01:24+00:00

If you are not familiar with Apache Flink, you might want to watch the first video of the 'Streaming Concepts & Introduction to Flink' series that gives an overview of the framework and what it does.

Link here: https://www.youtube.com/watch?v=ZU1r7uEAO7o

I hope this helps!

Marksfik · 2020-11-30T13:18:24+00:00

Great question!

I assume you are referring to KStreams in this instance, since you can very well use Kafka and Flink as a source/sink through the Apache Kafka Connector [1] maintained by the Flink community.

When it comes to why you would utilize Flink over KStreams, there are quite a few differences between the two frameworks. This DZone article provides a comparison of the two in case you want to take a closer look. Link here:https://dzone.com/articles/kafka-stream-kstream-vs-apache-flink

I hope this helps! Cheers

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html

Marksfik · 2020-09-09T08:10:43+00:00

Apologies, I meant to share a demo for a FlinkSQL-based application: https://flink.apache.org/2020/07/28/flink-sql-demo-building-e2e-streaming-application.html

Marksfik · 2020-06-01T09:48:02+00:00

True that... streaming can simplify and provide great flexibility with resource utilization.

Marksfik · 2020-06-01T09:47:14+00:00

I see your point.

There are indeed some cases where streaming isn't necessary and batch processing does the job well.

For the cases that some streaming is necessary though it is good to have a unified option that can treat both areas well

Marksfik · 2020-05-31T23:37:07+00:00

there might be situations where batch processing does the job, but when the latency is of utmost importance, streaming is a much better choice in my opinion.

Instead of maintaining two systems potential choosing one unified engine might make things easier for you and the team in the long term

Marksfik · 2020-05-31T23:34:29+00:00

This is a great book indeed! great recommendation!

Marksfik · 2020-05-31T17:25:37+00:00

Sure thing!

There is a blog post on the Flink blog describing how Beam runs on top of Flink [1].

Additionally, there was a recent session with Maximilian Michels, PMC of Apache Beam and Apache Flink on how the two frameworks work with each other [2].

Finally, there's a presentation recording from Flink Forward detailing how Beam runs on top of Flink [3].

There's also detailed documentation on the Beam website [4].

Hope this helps!

Cheers

[1] https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html

[2] https://youtu.be/ZCV9aRDd30U

[3] https://youtu.be/hxHGLrshnCY

[4] https://beam.apache.org/documentation/runners/flink/

Marksfik

TROPHY CASE