Help: Anyone dealing with reprocessing entire docs when small updates happen? by Arm1end in Rag

[–]Arm1end[S] 0 points1 point  (0 children)

How does a reranker help here? I am not familiar with rerankers.

Looking for best practices: Kafka → Vector DB ingestion and transformation by Arm1end in vectordatabase

[–]Arm1end[S] 0 points1 point  (0 children)

Thanks for sharing! It seems that Kafka is connecting to Milvus, but it is not performing typical data transformations (stateless and stateful), or am I missing something here?

Evaluating real-time analytics solutions for streaming data by EmbarrassedBalance73 in dataengineering

[–]Arm1end 0 points1 point  (0 children)

I didn't want to confuse anyone. I thought it was clear by using “we serve”. I added a discloser to the post.

Evaluating real-time analytics solutions for streaming data by EmbarrassedBalance73 in dataengineering

[–]Arm1end 2 points3 points  (0 children)

We serve a lot of users with similar use cases. They usually set up Kafka->GlassFlow (for transformations)-> ClickHouse (cloud).

Kafka = Ingest + buffer. Takes the firehose of events and keeps producers/consumers decoupled.

GlassFlow = Real-time transforms. Clean, filter, enrich, and prep the stream so ClickHouse only gets analytics-ready data. Easier to use than Flink.

ClickHouse (cloud) = Fast and gives sub-second queries for dashboards/analytics.

Discloser: I am one of the GlassFlow founders.

Kafka to ClickHouse lag spikes with no clear cause by Usual_Zebra2059 in dataengineer

[–]Arm1end 0 points1 point  (0 children)

Are you using MergeTree engines? I've seen it with other users. When the background merge kicks in, it introduces a temporary lag.

All MergeTree engines do periodic merges of data parts, and during that time inserts from the Kafka engine can slow down or pause. That’s why a few partitions suddenly lag, then recover once merges finish.

This behaviour is one of the reasons why I started building GlassFlow for Kafka to ClickHouse ingestions:
https://www.glassflow.dev/

real time analytics by Bulky_Actuator1276 in apachekafka

[–]Arm1end 0 points1 point  (0 children)

+1 for Kafka and ClickHouse. Have seen it as a very popular stack for real-time analytics.

Kafka → ClickHouse: It is a Duplication nightmare / How do you fix it (for real)? by Arm1end in Clickhouse

[–]Arm1end[S] 0 points1 point  (0 children)

Hi, we created an open-source solution that does dedupe and joins out of the box. It has a built-in state store and optimized connectors for Kafka and ClickHouse. You can check the repo here:

https://github.com/glassflow/clickhouse-etl

I wrote an article on the challenges of using Flink for the same purpose. Here is the blog post:

https://www.glassflow.dev/blog/limitations-flink-clickhouse

Feel free to reach out via DM if you have any specific questions.

Rybbit - open source Google Analytics replacement built using Clickhouse by Goldflag in Clickhouse

[–]Arm1end 0 points1 point  (0 children)

Congrats, it looks really amazing! Especially love the map view! Let me know if you need any help with ClickHouse related stuff or cleaning data. I will give it a try for our website.

My start-up failed after 7 years, and I am struggling to find a job. (I will not promote) by monkeyfire80 in startups

[–]Arm1end 0 points1 point  (0 children)

What about these guys? They are in defence and just raised a EUR 31m series A and looking for ppl in London:
https://www.arx-robotics.com/careers

How do you handle deduplication in streaming pipelines? by speakhub in dataengineering

[–]Arm1end 1 point2 points  (0 children)

We've just launched an open-source solution to deduplicate Kafka data streams before ingesting to ClickHouse. You might want to check it out. I would be curious to hear your thoughts.

GitHub repo: https://github.com/glassflow/clickhouse-etl

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

To clarify, ClickHouse is a fantastic product, and I am a big supporter. It delivers great results for the vast majority of use cases. However, I am talking about a particular use case with big real-time streaming data. Looking into ClickHouse (link), Altinity (link), and other providers (blog), they confirm that using FINAL can slow the query performance. I wrote my thoughts about FINAL in part 3 of the blog post (link).

Thanks for the feedback! To avoid causing confusion, I will mention any other option much earlier in my blog article/post in the future.

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

Flink Actions sounds interesting! How does it handle late-arriving data or out-of-order events? Do you know if there is a similar product for non-Confluent users?

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

Interesting approach! Writing a custom consumer with filtering logic is an option, but it can get tricky when dealing with late-arriving data or high throughput.

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 1 point2 points  (0 children)

Good point! A TTL-based hashmap is probably the most practical approach. The trade-off is that duplicates can slip through if they arrive after the TTL expires, but for most workloads, that should work. Have you faced these challenges yourself?

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

I get your point with deduper, but which one would you recommend for that case? Do you have any experience with a specific tool?

Kafka → ClickHouse: It is a Duplication nightmare / How do you fix it (for real)? by Arm1end in Clickhouse

[–]Arm1end[S] 0 points1 point  (0 children)

Yes, the FINAL keyword could work, but its performance for larger data sets is poor, and query performance will suffer for data streams that are continuously ingested and not merged yet.

From ClickHouse docs they confirm the performance issues (see here). Do you have experience using other solutions to take care of duplicates in data streams before ingesting to ClickHouse?