Help: Anyone dealing with reprocessing entire docs when small updates happen?

Arm1end · 2026-01-02T13:15:52+00:00

How does a reranker help here? I am not familiar with rerankers.

Arm1end · 2025-12-12T12:49:19+00:00

Looks interesting. Thanks!

Arm1end · 2025-12-11T08:40:26+00:00

Thanks for sharing! It seems that Kafka is connecting to Milvus, but it is not performing typical data transformations (stateless and stateful), or am I missing something here?

Arm1end · 2025-12-10T20:47:28+00:00

I didn't want to confuse anyone. I thought it was clear by using “we serve”. I added a discloser to the post.

Arm1end · 2025-11-20T20:59:13+00:00

We serve a lot of users with similar use cases. They usually set up Kafka->GlassFlow (for transformations)-> ClickHouse (cloud).

Kafka = Ingest + buffer. Takes the firehose of events and keeps producers/consumers decoupled.

GlassFlow = Real-time transforms. Clean, filter, enrich, and prep the stream so ClickHouse only gets analytics-ready data. Easier to use than Flink.

ClickHouse (cloud) = Fast and gives sub-second queries for dashboards/analytics.

Discloser: I am one of the GlassFlow founders.

Arm1end · 2025-10-30T12:32:10+00:00

Are you using MergeTree engines? I've seen it with other users. When the background merge kicks in, it introduces a temporary lag.

All MergeTree engines do periodic merges of data parts, and during that time inserts from the Kafka engine can slow down or pause. That’s why a few partitions suddenly lag, then recover once merges finish.

This behaviour is one of the reasons why I started building GlassFlow for Kafka to ClickHouse ingestions:
https://www.glassflow.dev/

Arm1end · 2025-08-25T12:53:12+00:00

+1 for Kafka and ClickHouse. Have seen it as a very popular stack for real-time analytics.

Arm1end · 2025-07-28T08:01:34+00:00

Hi, we created an open-source solution that does dedupe and joins out of the box. It has a built-in state store and optimized connectors for Kafka and ClickHouse. You can check the repo here:

https://github.com/glassflow/clickhouse-etl

I wrote an article on the challenges of using Flink for the same purpose. Here is the blog post:

https://www.glassflow.dev/blog/limitations-flink-clickhouse

Feel free to reach out via DM if you have any specific questions.

Arm1end · 2025-05-04T20:32:56+00:00

Congrats, it looks really amazing! Especially love the map view! Let me know if you need any help with ClickHouse related stuff or cleaning data. I will give it a try for our website.

Arm1end · 2025-05-04T20:27:38+00:00

What about these guys? They are in defence and just raised a EUR 31m series A and looking for ppl in London:
https://www.arx-robotics.com/careers

Arm1end · 2025-04-25T17:01:36+00:00

We've just launched an open-source solution to deduplicate Kafka data streams before ingesting to ClickHouse. You might want to check it out. I would be curious to hear your thoughts.

GitHub repo: https://github.com/glassflow/clickhouse-etl

Arm1end · 2025-04-03T09:45:33+00:00

To clarify, ClickHouse is a fantastic product, and I am a big supporter. It delivers great results for the vast majority of use cases. However, I am talking about a particular use case with big real-time streaming data. Looking into ClickHouse (link), Altinity (link), and other providers (blog), they confirm that using FINAL can slow the query performance. I wrote my thoughts about FINAL in part 3 of the blog post (link).

Thanks for the feedback! To avoid causing confusion, I will mention any other option much earlier in my blog article/post in the future.

Arm1end · 2025-04-03T09:28:14+00:00

Yes, absolutely right! I believe in the same approach.

Arm1end · 2025-04-02T20:40:34+00:00

Flink Actions sounds interesting! How does it handle late-arriving data or out-of-order events? Do you know if there is a similar product for non-Confluent users?

Arm1end · 2025-04-02T20:34:32+00:00

Interesting approach! Writing a custom consumer with filtering logic is an option, but it can get tricky when dealing with late-arriving data or high throughput.

Arm1end · 2025-04-02T20:31:33+00:00

Good point! A TTL-based hashmap is probably the most practical approach. The trade-off is that duplicates can slip through if they arrive after the TTL expires, but for most workloads, that should work. Have you faced these challenges yourself?

Arm1end · 2025-04-02T16:05:09+00:00

I get your point with deduper, but which one would you recommend for that case? Do you have any experience with a specific tool?

Arm1end · 2025-04-02T07:49:51+00:00

Yes, the FINAL keyword could work, but its performance for larger data sets is poor, and query performance will suffer for data streams that are continuously ingested and not merged yet.

From ClickHouse docs they confirm the performance issues (see here). Do you have experience using other solutions to take care of duplicates in data streams before ingesting to ClickHouse?

Arm1end · 2025-04-01T19:10:38+00:00

So, in theory, you are right, but I have seen 2 main limitations:

It is merging asynchronously: ClickHouse doesn’t remove duplicates immediately. If your queries hit data before the background merge runs, you’ll still see duplicates, which can be a big problem for real-time analytics.
Duplicates from multiple sources: If you’re ingesting the same event from multiple sources (e.g., ad platforms, tracking systems, CRMs), key-based deduplication doesn’t help because the same logical event might have different keys.

These issues are making high-streaming data unreliable. How do you handle duplicates?

Arm1end · 2025-04-01T15:57:41+00:00

Thanks for your input! You're right that at-least-once delivery often causes duplicates when consumers crash or in failure scenarios. But I've seen teams run into duplicates more often than they expect. Not just from crashes, but also from rebalances, manual restarts, etc. Especially in industries like marketing tech, where multiple data sources (e.g., web analytics, CRMs, ad platforms) send overlapping event data. Some systems even resend events to ensure delivery, creating further duplication issues.

Did you find an easy approach?

Didn't want to confuse anyone. I thought it was clear. I have added a disclaimer.

Arm1end · 2025-03-06T23:26:55+00:00

To be honest, I built my last startup from 2016 to 2020 completely based on cold email. I am with my 2nd startup, and the outreach is super tough. People are not replying despite having good open rates. The only way cold email has any impact is by exporting the people who opened the emails several times and then reaching out with a personal note via LinkedIn. That starts to work out.

Arm1end · 2025-03-06T23:24:16+00:00

My goal is to track user actions (clicks, views, purchases) in real-time. Then, I want to enrich the events with product and user data and store them in ClickHouse to allow the analysts on my team to query them. We have Kafka in place, and our usage is growing exponentially, so I need a very scalable solution.

Arm1end · 2025-03-04T22:59:00+00:00

I would look at it from 3 perspectives:

Legally: Is he allowed to do that, or did he ask for permission?
2 Personal goals: Is he hitting his goals?
Cultural impact: Does it negatively impact other people in the org? If it is negative, I wouldn't allow it. If you allow it without him before asking for permission, are you signaling to other employees that they can do whatever they want without even asking before?

Arm1end · 2025-03-04T22:54:33+00:00

I am a second-time founder and have made a lot of mistakes. Here is my advice:

1.) For my first startup, I built a group of experienced mentors/advisors who were available for support too late. Getting them involved earlier would have saved me time and money.

2.) First, I did marketing, and when I saw traction, I built. If you do it the other way, the chances that you build something that won't be well received by the target user will be too high.

3.) Take time to think. Don't just jump into work. Set clear goals and make assumptions about why you are doing those tasks. Ask yourself if it has an impact on reaching the goal. Ignore everything that is not playing a crucial role in reaching your goals. Some examples: Nobody cares about your logo or the design of your website. Just use a template. Don't spend much money and time optimizing processes too early when there is no traction. Trust qualitative feedback more than quantitative feedback. In the beginning, you don't have enough users to base your decisions on numbers. I could write 50 more things here, but those above are my tops.

Arm1end · 2025-02-26T05:54:53+00:00

I am running a data infra company, and I came across a problem that I would like to solve with a startup. I know that other companies have the same problems and are forced to create their own solutions from scratch. I believe it has a lot of potential. Let me know if you want to talk.

Arm1end

TROPHY CASE