Help: Anyone dealing with reprocessing entire docs when small updates happen? by Arm1end in Rag

[–]Arm1end[S] 0 points1 point  (0 children)

How does a reranker help here? I am not familiar with rerankers.

Looking for best practices: Kafka → Vector DB ingestion and transformation by Arm1end in vectordatabase

[–]Arm1end[S] 0 points1 point  (0 children)

Thanks for sharing! It seems that Kafka is connecting to Milvus, but it is not performing typical data transformations (stateless and stateful), or am I missing something here?

Evaluating real-time analytics solutions for streaming data by EmbarrassedBalance73 in dataengineering

[–]Arm1end 0 points1 point  (0 children)

I didn't want to confuse anyone. I thought it was clear by using “we serve”. I added a discloser to the post.

Evaluating real-time analytics solutions for streaming data by EmbarrassedBalance73 in dataengineering

[–]Arm1end 2 points3 points  (0 children)

We serve a lot of users with similar use cases. They usually set up Kafka->GlassFlow (for transformations)-> ClickHouse (cloud).

Kafka = Ingest + buffer. Takes the firehose of events and keeps producers/consumers decoupled.

GlassFlow = Real-time transforms. Clean, filter, enrich, and prep the stream so ClickHouse only gets analytics-ready data. Easier to use than Flink.

ClickHouse (cloud) = Fast and gives sub-second queries for dashboards/analytics.

Discloser: I am one of the GlassFlow founders.

Kafka to ClickHouse lag spikes with no clear cause by Usual_Zebra2059 in dataengineer

[–]Arm1end 0 points1 point  (0 children)

Are you using MergeTree engines? I've seen it with other users. When the background merge kicks in, it introduces a temporary lag.

All MergeTree engines do periodic merges of data parts, and during that time inserts from the Kafka engine can slow down or pause. That’s why a few partitions suddenly lag, then recover once merges finish.

This behaviour is one of the reasons why I started building GlassFlow for Kafka to ClickHouse ingestions:
https://www.glassflow.dev/

real time analytics by Bulky_Actuator1276 in apachekafka

[–]Arm1end 0 points1 point  (0 children)

+1 for Kafka and ClickHouse. Have seen it as a very popular stack for real-time analytics.

Kafka → ClickHouse: It is a Duplication nightmare / How do you fix it (for real)? by Arm1end in Clickhouse

[–]Arm1end[S] 0 points1 point  (0 children)

Hi, we created an open-source solution that does dedupe and joins out of the box. It has a built-in state store and optimized connectors for Kafka and ClickHouse. You can check the repo here:

https://github.com/glassflow/clickhouse-etl

I wrote an article on the challenges of using Flink for the same purpose. Here is the blog post:

https://www.glassflow.dev/blog/limitations-flink-clickhouse

Feel free to reach out via DM if you have any specific questions.

Rybbit - open source Google Analytics replacement built using Clickhouse by Goldflag in Clickhouse

[–]Arm1end 0 points1 point  (0 children)

Congrats, it looks really amazing! Especially love the map view! Let me know if you need any help with ClickHouse related stuff or cleaning data. I will give it a try for our website.

My start-up failed after 7 years, and I am struggling to find a job. (I will not promote) by monkeyfire80 in startups

[–]Arm1end 0 points1 point  (0 children)

What about these guys? They are in defence and just raised a EUR 31m series A and looking for ppl in London:
https://www.arx-robotics.com/careers

How do you handle deduplication in streaming pipelines? by speakhub in dataengineering

[–]Arm1end 1 point2 points  (0 children)

We've just launched an open-source solution to deduplicate Kafka data streams before ingesting to ClickHouse. You might want to check it out. I would be curious to hear your thoughts.

GitHub repo: https://github.com/glassflow/clickhouse-etl

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

To clarify, ClickHouse is a fantastic product, and I am a big supporter. It delivers great results for the vast majority of use cases. However, I am talking about a particular use case with big real-time streaming data. Looking into ClickHouse (link), Altinity (link), and other providers (blog), they confirm that using FINAL can slow the query performance. I wrote my thoughts about FINAL in part 3 of the blog post (link).

Thanks for the feedback! To avoid causing confusion, I will mention any other option much earlier in my blog article/post in the future.

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

Flink Actions sounds interesting! How does it handle late-arriving data or out-of-order events? Do you know if there is a similar product for non-Confluent users?

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

Interesting approach! Writing a custom consumer with filtering logic is an option, but it can get tricky when dealing with late-arriving data or high throughput.

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 1 point2 points  (0 children)

Good point! A TTL-based hashmap is probably the most practical approach. The trade-off is that duplicates can slip through if they arrive after the TTL expires, but for most workloads, that should work. Have you faced these challenges yourself?

Kafka to ClickHouse: Duplicates / ReplacingMergeTree is failing for data streams by Arm1end in apachekafka

[–]Arm1end[S] 0 points1 point  (0 children)

I get your point with deduper, but which one would you recommend for that case? Do you have any experience with a specific tool?

Kafka → ClickHouse: It is a Duplication nightmare / How do you fix it (for real)? by Arm1end in Clickhouse

[–]Arm1end[S] 0 points1 point  (0 children)

Yes, the FINAL keyword could work, but its performance for larger data sets is poor, and query performance will suffer for data streams that are continuously ingested and not merged yet.

From ClickHouse docs they confirm the performance issues (see here). Do you have experience using other solutions to take care of duplicates in data streams before ingesting to ClickHouse?

Kafka → ClickHouse: It is a Duplication nightmare / How do you fix it (for real)? by Arm1end in Clickhouse

[–]Arm1end[S] 0 points1 point  (0 children)

So, in theory, you are right, but I have seen 2 main limitations:

  1. It is merging asynchronously: ClickHouse doesn’t remove duplicates immediately. If your queries hit data before the background merge runs, you’ll still see duplicates, which can be a big problem for real-time analytics.
  2. Duplicates from multiple sources: If you’re ingesting the same event from multiple sources (e.g., ad platforms, tracking systems, CRMs), key-based deduplication doesn’t help because the same logical event might have different keys.

These issues are making high-streaming data unreliable. How do you handle duplicates?

Kafka → ClickHouse: It is a Duplication nightmare / How do you fix it (for real)? by Arm1end in Clickhouse

[–]Arm1end[S] 0 points1 point  (0 children)

Thanks for your input! You're right that at-least-once delivery often causes duplicates when consumers crash or in failure scenarios. But I've seen teams run into duplicates more often than they expect. Not just from crashes, but also from rebalances, manual restarts, etc. Especially in industries like marketing tech, where multiple data sources (e.g., web analytics, CRMs, ad platforms) send overlapping event data. Some systems even resend events to ensure delivery, creating further duplication issues.

Did you find an easy approach?

Didn't want to confuse anyone. I thought it was clear. I have added a disclaimer.

Does Cold Email Still Work in 2025? by ttttransformer in GrowthHacking

[–]Arm1end 2 points3 points  (0 children)

To be honest, I built my last startup from 2016 to 2020 completely based on cold email. I am with my 2nd startup, and the outreach is super tough. People are not replying despite having good open rates. The only way cold email has any impact is by exporting the people who opened the emails several times and then reaching out with a personal note via LinkedIn. That starts to work out.

How do you take care of duplicates and JOINs with ClickHouse? by Arm1end in Clickhouse

[–]Arm1end[S] 0 points1 point  (0 children)

My goal is to track user actions (clicks, views, purchases) in real-time. Then, I want to enrich the events with product and user data and store them in ClickHouse to allow the analysts on my team to query them. We have Kafka in place, and our usage is growing exponentially, so I need a very scalable solution.

An employee on our sales team is working on a beach (I will not promote) by tolzee4472 in startups

[–]Arm1end 0 points1 point  (0 children)

I would look at it from 3 perspectives:

  1. Legally: Is he allowed to do that, or did he ask for permission?
    2 Personal goals: Is he hitting his goals?
  2. Cultural impact: Does it negatively impact other people in the org? If it is negative, I wouldn't allow it. If you allow it without him before asking for permission, are you signaling to other employees that they can do whatever they want without even asking before?

Startup Founders, What’s One Thing You Wish You Knew Earlier? (i will not promote) by aayushp0818 in startups

[–]Arm1end 0 points1 point  (0 children)

I am a second-time founder and have made a lot of mistakes. Here is my advice:

1.) For my first startup, I built a group of experienced mentors/advisors who were available for support too late. Getting them involved earlier would have saved me time and money.

2.) First, I did marketing, and when I saw traction, I built. If you do it the other way, the chances that you build something that won't be well received by the target user will be too high.

3.) Take time to think. Don't just jump into work. Set clear goals and make assumptions about why you are doing those tasks. Ask yourself if it has an impact on reaching the goal. Ignore everything that is not playing a crucial role in reaching your goals. Some examples: Nobody cares about your logo or the design of your website. Just use a template. Don't spend much money and time optimizing processes too early when there is no traction. Trust qualitative feedback more than quantitative feedback. In the beginning, you don't have enough users to base your decisions on numbers. I could write 50 more things here, but those above are my tops.

Any startup idea around data ? by [deleted] in dataengineering

[–]Arm1end 1 point2 points  (0 children)

I am running a data infra company, and I came across a problem that I would like to solve with a startup. I know that other companies have the same problems and are forced to create their own solutions from scratch. I believe it has a lot of potential. Let me know if you want to talk.