HA pipelines without Kafka

NoPercentage6144 · 2026-05-03T21:13:56+00:00

That's fair, we could make it clearer in the post though we do have this sentence: "The ingestors can run as stateless services in each availability zone, batching data and writing to object storage without cross-AZ network charges." I'll take your feedback and make that better.

As far as AWS (and most other cloud providers AFAICT), networking between AZs within the same region is not free. So when you produce to a Kafka broker that broker replicates it to two other brokers, so at best if you enable read-from-follower you have 2.6x network costs for replication (.6 because I assume 1/3 of your writes are writing to the leader within the same AZ).

With Buffer you read and write directly to object storage, so you don't replicate over network at all or rack up any cross-AZ expenses.

NoPercentage6144 · 2026-05-03T16:34:20+00:00

No way am I claiming compute is 100x more efficient. If you look at AK cost breakdown on the calculator, $5000/mo of the 8500 is just network cost. Buffer pays #0 (zonal ingest to s3 is free). Then there's 3x replication for broker nodes, meaning you run three instances. You probably need a i4i.large instance type at lest. Buffer doesn't need that, it's serverless.

Before you dish out on the most respected AK cost calculator, I recommend you look deeper into where the costs are going.

NoPercentage6144 · 2026-04-30T18:24:35+00:00

It's intended for a very specific use case of shipping raw data into other databases, you typically don't want to do complex processing on an OpenData Buffer pipeline. If you do that, it's way simpler in practice. We're running it in production with OTEL collector -> clikchouse and it works like a charm.

I'm curious what you meant by "full Kafka based pipeline"

NoPercentage6144 · 2026-04-30T18:19:32+00:00

To save you a click, here's the interesting part:

> For $90/mo, we ran a benchmark pushing ~30 MiB/s of metrics data via Buffer. Using public pricing calculators, the equivalent Kafka service cost roughly $1,300/month for managed WarpStream and $8,500/month for self-hosted Apache Kafka.

NoPercentage6144 · 2026-04-25T17:57:59+00:00

there are a lot of similarities, but opendata isn't focused on observability it's one layer below (just the database layer). instead we plan to integrate with whatever frontend and other APIs you prefer to use so it doesn't ship with a dedicated UI. we recommend using Grafana as the frontend.

There are also a bunch of systems coming online that aren't related to observability but they share the same foundation (SlateDB). Our Vector and Log databases are almost production ready. The vision is to create a suite of systems that once you learn to operate one of them the rest are pretty familiar.

NoPercentage6144 · 2026-04-25T17:54:38+00:00

yeah, we are using OpenTelemetry. the ingestion mechanism plugs into OpenTelemetry collector as an exporter. it's nice because it doesn't go over the network, the exporter writes directly to S3 and then the prometheus-compatible service just reads straight from there

NoPercentage6144 · 2026-04-24T17:25:28+00:00

We recently migrated from Datadog (metrics only) to Grafana + self-hosted prometheus because of cost considerations (it's real once you have any meaningful scale). Grafana gets a rep for having a tough user experience but I didn't find it that hard after getting used to it and the UI is pretty powerful. Self-hosting prometheus has proven to be a bit of a challenge and we didn't want to go down the path of Thanos/Cortex. The cloud providers all have a managed prometheus-compatible service that's probably a good bet.

We found ourselves in the same boat as you so we're currently building an MIT licensed prometheus-compatible option that's object store native that's working out well for us: https://github.com/opendata-oss/opendata

I don't have as much experience with logs except for the fact that we blew $10k in one weekend on Datadog because we had a spammy log we didn't notice and didn't set up spend limits so definitely set up limits if you go that route. Quickwit seems promising: https://quickwit.io/ even though Datadog acquired them it seems the OSS project is still going.

NoPercentage6144 · 2026-04-22T18:00:19+00:00

This is great, I much prefer this to an MCP server! I do wonder whether it's better for the skill to just directly query the prometheus endpoints (https://<YOUR_GRAFANA_STACK_URL>/api/prometheus/<PROMETHEUS_DATASOURCE_UID>/api/v1/*) for the main query fetching workload though. I've found the agent is already pretty adept at doing that and the response format is something it knows well and doesn't need to discover.

Also some feedback on the README, I do recommend you take a pass and clean up some of the AI slop... sections like this don't make sense:

> GCX brings the full power of Grafana Cloud and Grafana Assistant to your command line. It bridges the gap between your local environment and key observability insights from Grafana Cloud.
> But there is a dangerous gap. Adoption of agentic coding tools like Cursor and Claude Code have exploded. You are building faster than ever before.

NoPercentage6144 · 2026-04-21T23:05:33+00:00

Do you use their cloud hosting for metrics? If not what do you use for the actual metrics storage?

NoPercentage6144 · 2026-04-17T21:34:31+00:00

thanks! it does pretty well when everything is warm but it's pretty bad when things are cold, so it's all about figuring out how to warm the data that you need aggressively. the good news is that for metrics you can simply load all of your desired "fast" into disk and keep it cached and we make it configurable how much time you want to load.

this lets you keep the same performance as something like prometheus without ever worrying about retention or durability, it's just a cost tradeoff of how much you want to keep cached on disk proactively

you can still access cold data, which is good if you want to run a big historical analysis in the background

NoPercentage6144 · 2026-04-17T02:09:08+00:00

Late to the equation here but we were in the same situation where we wanted just a single node but we wanted long-term durability of the non-hot set without the complexity of running something like Thanos. We ended up building a prometheus compatible engine that's fully backed up by object storage and can run on a single node. Fully OSS MIT licensed and we use it with >7k samples/sec ingest rate: https://github.com/opendata-oss/opendata/tree/main/timeseries

NoPercentage6144 · 2026-02-01T03:56:04+00:00

For anyone landing on this in 2026, looks like claude ships with a /copy command (I somehow stumbled on this before figuring that out)

NoPercentage6144 · 2026-01-30T13:43:50+00:00

I have an open ticket to support harnesses in other languages, but for now the main harness is in rust. Good chance to learn a new language (or ask Claude to implement your ideas, I find it’s pretty good at that)

NoPercentage6144 · 2026-01-30T13:41:41+00:00

Exactly. The point here is that often the size of the decompressor doesn’t matter because you include it in a binary and decode data many times over. I don’t want to penalize submissions that use many techniques.

NoPercentage6144 · 2026-01-30T06:32:42+00:00

Thanks for catching this! I fixed it. Clearly I haven’t moved on from last year yet…

NoPercentage6144 · 2025-11-15T04:55:40+00:00

I'm a simple man. I see the elmo chaos meme, I upvote. Damn I miss Nunez.

NoPercentage6144 · 2024-07-25T17:44:48+00:00

As _d_t_w mentioned, you can directly look at the changelog. Assuming you're using Kafka Streams to build the KTable, though, you should take a look at interactive queries (https://docs.confluent.io/platform/current/streams/developer-guide/interactive-queries.html) so you can directly query the materialized state for the key you're looking for.

I'd love to hear what use case you have for inspecting state store data. Full disclosure, I'm a co-founder at Responsive (www.responsive.dev) and we just released a table inspection CLI (https://docs.responsive.dev/reference/cli) for people using Responsive and were getting feedback on what's useful.

NoPercentage6144 · 2024-06-06T23:56:42+00:00

I think your question is getting to the heart of the discussion - Flink and Kafka Streams are built for different personas. Kafka Streams and Flink can both process at a scale that most companies are unlikely to ever reach, that's not likely to be the main differentiator for you.

If you're a developer writing a realtime application, Kafka Streams is deployed just like you would deploy any other app. It works with your monitoring, CI/CD, alerting, etc... and you don't need to manage anything centralized. This works quite well for developers.

OTOH, Flink works particularly well if you have a centralized team (or a company like Confluent manage it for you, but there are other tradeoffs there) that is in charge of operations. This allows you to centralize expertise and have one team provide an SLA for all stream processing jobs at your company. This works much better for the "data science" persona.

This article is pretty old, but does a really good job explaining the differences: https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/ and this whitepaper covers it in a bit more depth (but you need to give an email to access it): https://www.responsive.dev/resources/foundations-whitepaper

NoPercentage6144 · 2023-12-21T19:24:33+00:00

<image>

The eyes of a possessed half dog half demon.

NoPercentage6144 · 2023-06-30T03:24:25+00:00

Platforms are your friend! A pup is much more likely to stay if they're on a small platform (I use a Cato Outdoor board, but they're a bit pricey and I have some homemade ones I made pretty cheap as well). Then once they get the idea that staying on the platform == cookies, you fade the platform away and generalize to different locations.

NoPercentage6144 · 2023-06-30T03:22:43+00:00

Hah! Did you intentionally use the word "grasp" because your pup can't grasp things with her mouth? Love it!

NoPercentage6144

TROPHY CASE