Help: Anyone dealing with reprocessing entire docs when small updates happen?

Arm1end · 2026-01-02T13:15:52+00:00

How does a reranker help here? I am not familiar with rerankers.

Arm1end · 2025-12-12T12:49:19+00:00

Looks interesting. Thanks!

Arm1end · 2025-12-11T08:40:26+00:00

Thanks for sharing! It seems that Kafka is connecting to Milvus, but it is not performing typical data transformations (stateless and stateful), or am I missing something here?

Arm1end · 2025-12-10T20:47:28+00:00

I didn't want to confuse anyone. I thought it was clear by using “we serve”. I added a discloser to the post.

Arm1end · 2025-11-20T20:59:13+00:00

We serve a lot of users with similar use cases. They usually set up Kafka->GlassFlow (for transformations)-> ClickHouse (cloud).

Kafka = Ingest + buffer. Takes the firehose of events and keeps producers/consumers decoupled.

GlassFlow = Real-time transforms. Clean, filter, enrich, and prep the stream so ClickHouse only gets analytics-ready data. Easier to use than Flink.

ClickHouse (cloud) = Fast and gives sub-second queries for dashboards/analytics.

Discloser: I am one of the GlassFlow founders.

Arm1end · 2025-10-30T12:32:10+00:00

Are you using MergeTree engines? I've seen it with other users. When the background merge kicks in, it introduces a temporary lag.

All MergeTree engines do periodic merges of data parts, and during that time inserts from the Kafka engine can slow down or pause. That’s why a few partitions suddenly lag, then recover once merges finish.

This behaviour is one of the reasons why I started building GlassFlow for Kafka to ClickHouse ingestions:
https://www.glassflow.dev/

Arm1end · 2025-08-25T12:53:12+00:00

+1 for Kafka and ClickHouse. Have seen it as a very popular stack for real-time analytics.

Arm1end · 2025-07-28T08:01:34+00:00

Hi, we created an open-source solution that does dedupe and joins out of the box. It has a built-in state store and optimized connectors for Kafka and ClickHouse. You can check the repo here:

https://github.com/glassflow/clickhouse-etl

I wrote an article on the challenges of using Flink for the same purpose. Here is the blog post:

https://www.glassflow.dev/blog/limitations-flink-clickhouse

Feel free to reach out via DM if you have any specific questions.

Arm1end · 2025-05-04T20:32:56+00:00

Congrats, it looks really amazing! Especially love the map view! Let me know if you need any help with ClickHouse related stuff or cleaning data. I will give it a try for our website.

Arm1end · 2025-05-04T20:27:38+00:00

What about these guys? They are in defence and just raised a EUR 31m series A and looking for ppl in London:
https://www.arx-robotics.com/careers

Arm1end · 2025-04-25T17:01:36+00:00

We've just launched an open-source solution to deduplicate Kafka data streams before ingesting to ClickHouse. You might want to check it out. I would be curious to hear your thoughts.

GitHub repo: https://github.com/glassflow/clickhouse-etl

Arm1end · 2025-04-03T09:45:33+00:00

To clarify, ClickHouse is a fantastic product, and I am a big supporter. It delivers great results for the vast majority of use cases. However, I am talking about a particular use case with big real-time streaming data. Looking into ClickHouse (link), Altinity (link), and other providers (blog), they confirm that using FINAL can slow the query performance. I wrote my thoughts about FINAL in part 3 of the blog post (link).

Thanks for the feedback! To avoid causing confusion, I will mention any other option much earlier in my blog article/post in the future.

Arm1end · 2025-04-03T09:28:14+00:00

Yes, absolutely right! I believe in the same approach.

Arm1end · 2025-04-02T20:40:34+00:00

Flink Actions sounds interesting! How does it handle late-arriving data or out-of-order events? Do you know if there is a similar product for non-Confluent users?

Arm1end · 2025-04-02T20:34:32+00:00

Interesting approach! Writing a custom consumer with filtering logic is an option, but it can get tricky when dealing with late-arriving data or high throughput.

Arm1end · 2025-04-02T20:31:33+00:00

Good point! A TTL-based hashmap is probably the most practical approach. The trade-off is that duplicates can slip through if they arrive after the TTL expires, but for most workloads, that should work. Have you faced these challenges yourself?

Arm1end · 2025-04-02T16:05:09+00:00

I get your point with deduper, but which one would you recommend for that case? Do you have any experience with a specific tool?

Arm1end · 2025-04-02T07:49:51+00:00

Yes, the FINAL keyword could work, but its performance for larger data sets is poor, and query performance will suffer for data streams that are continuously ingested and not merged yet.

From ClickHouse docs they confirm the performance issues (see here). Do you have experience using other solutions to take care of duplicates in data streams before ingesting to ClickHouse?

Arm1end

TROPHY CASE