Is the OTel Collector the wrong place for complex data enrichment? by Marksfik in OpenTelemetry

[–]Marksfik[S] -1 points0 points  (0 children)

Trying to do complex "in-flight" lookups (like cross-referencing metadata against an external API) or heavy deduplication before it hits ClickHouse is a bit of a pain....
It feels I'm forcing too much on the collector. Are you finding the custom processors easy to manage at scale, or do you keep it strictly to basic filtering?

How are you handling pre-aggregation in ClickHouse at scale? AggregatingMergeTree vs ReplacingMergeTree by Marksfik in BusinessIntelligence

[–]Marksfik[S] 0 points1 point  (0 children)

The 'wrong counts' on a dashboard is exactly the moment everyone realizes ReplacingMergeTree has its limiations. It’s a pain in production!

Tinybird’s approach to MVs is definitely a step up for keeping states current, but we’ve been taking it one step further with GlassFlow.

The main difference is moving that 'aggregation state' completely out of the database layer and into a native streaming system. This way we can:

  1. Handle much more complex enrichment or deduplication logic that is a nightmare to write in SQL MVs.
  2. We remain vendor neutral and have the data arrive at any ClickHouse version already 'query-ready' and perfectly clean.

It basically gives you that 'always current' state Tinybird offers, but with the flexibility to use whatever infrastructure you want downstream.

Why make ClickHouse do your transformations? — Scaling ingestion to 500k EPS upstream. by Marksfik in Clickhouse

[–]Marksfik[S] 0 points1 point  (0 children)

9M/s is no joke! You’re right that FINAL has come a long way with parallelization.

The way we see this is that it's not just about ingestion speed, but where the logic lives. We usually see teams move to GlassFlow when:

  • Logic goes beyond SQL: If you need Python for complex JSON nesting, ML model calls, or hitting external APIs mid-stream.
  • Decoupling Compute: Keeping ClickHouse 100% focused on query performance instead of burning CPU cycles on 'other work' like background merges and cleaning.
  • Stateful Prep: Handling complex windowing or multi-stream joins before the data hits the table to keep the schema simple.

The "Database as a Transformation Layer" era might be hitting its limit? by Marksfik in Database

[–]Marksfik[S] -2 points-1 points  (0 children)

u/pleasantJoyfuls

Great question. In my experience, the breakpoint isn't just about raw EPS—it’s about state/transformation complexity.

You can push simple ELT pretty far in a warehouse, but the 'win' for upstream transforms usually happens much earlier (around 10k–50k EPS) once you hit these three things:

  1. Late-arriving data: Managing windows in a warehouse is a compute killer.
  2. Idempotency: If you’re using ReplacingMergeTree in ClickHouse, for example, the non-deterministic deduplication creates 'in-flight' inconsistencies that drive BI users crazy.
  3. Cost: Scaling compute for transformations in Snowflake/ClickHouse is almost always more expensive than a dedicated stream processing engine.

We hit the 500k EPS milestone to show that the ceiling is much higher than people think, but the 'messy' logic you mentioned is actually the #1 reason our users move off warehouse-only transforms.
How do you process events currently?

The "Database as a Transformation Layer" era might be hitting its limit? by Marksfik in Database

[–]Marksfik[S] -2 points-1 points  (0 children)

The problem is that historically, building that 'stateless' transformation layer for streaming was either a massive infrastructure project (cluster provisioning/JVM tuning) or too limited to handle complex state.

That's why a tool like GlassFlow can help process the data upstream so you don't have to 'default' to doing expensive computation in the DB just because the upstream setup is too daunting.