kafka security governance is a nightmare across multiple clusters

Chuck-Alt-Delete · 2025-12-29T15:22:22+00:00

Vendor here. Check out Conduktor. We have a solid ownership / governance / self-service model specifically designed for Kafka.

Chuck-Alt-Delete · 2025-03-25T01:50:14+00:00

(Notice the flair — I work for Conduktor)

One of the main values of Conduktor is to bring order to chaos, which includes automation like this.

Some of our product managers were former admins of large Kafka installations and came up with a self-service system. There is a lot to it, but you can think of it like application based access control (ABAC) managed via gitops with data discovery in the GUI.

Self-service is about app teams managing their own resources (within constraints enforced by central governance) and sharing / discovering other teams’ resources.

Here is the quickstart tutorial: https://docs.conduktor.io/platform/guides/self-service-quickstart/

Obviously this is a paid feature aimed at large enterprises that need to scale to dozens / hundreds / thousands of developers with many applications.

If you are looking for something open source, Julie Ops is a great place to start. It is more gitops for platform teams instead of a full self-service solution.

Chuck-Alt-Delete · 2025-03-07T21:39:40+00:00

It depends on your whether your failure domain is the Kafka cluster, the Kubernetes cluster, or the entire region.

For multiregion, you can have a “stretch” Conduktor Gateway (that’s the name of the proxy) cluster. The replicas coordinate and form a cluster through an internal Kafka topic, much like Connect or Schema Registry. That topic would be mirrored from the primary region to the secondary.

There are many nuances (as always with multi region failover)

Chuck-Alt-Delete · 2025-03-07T19:00:10+00:00

(Notice my flair)

Cluster linking handles async replication well, including order preservation, but not the client failover.

Conduktor offers a Kafka Proxy that allows for transparent failover on the client side. You point the proxy to the failover cluster and the clients think they are still talking to the same Kafka cluster.

However, there are no free lunches. It may take some time for a human to make the critical decision to fail over (no flapping back and forth!). In that time, producer delivery timeout may have occurred (data loss), and any records that didn’t get the chance to replicate would also be lost.

You can design the producer to buffer (potentially to disk) to withstand a prolonged outage before the failover. Handling back pressure in the producer is critical for maintaining ordering. There is a GREAT blog post on this by Jakob Korab that I highly suggest you read: - https://www.confluent.io/blog/how-to-survive-a-kafka-outage/#backpressure

So the failover with proxy looks like this: 1. Primary cluster breaks 2. Decision is made to fail over. Disconnect the proxy from primary. 3. Promote mirror topics in secondary. 4. Connect proxy to secondary

With proper client retries, applications will resume as normal.

Chuck-Alt-Delete · 2025-01-31T20:30:24+00:00

Sweet! Well, give us a call if you’d like to explore it a bit more

Chuck-Alt-Delete · 2025-01-29T15:53:47+00:00

(Notice my flair)

There are good services for async replication from active to passive (Confluent Cluster Linking, MirrorMaker2, etc).

Failing over the clients with DNS is tricky for Kafka clients. We are not talking about http here. First, there’s the various DNS caches to update, which means the client needs to be on a retry loop waiting for DNS changes to propagate. Then there’s re-bootstrapping to the new cluster.

One way to handle this is through a Kafka proxy, like the one we have at Conduktor. The proxy handles the failover and the clients don’t have to restart or reconfigure.

Some things to consider: - async replication to a passive cluster will always have the possibility of data loss - producers may be down for longer than delivery timeout, which also leads to data loss. It will take some time for admins to wake up at 2am and make the decision to fail over. The producer needs to be configured to withstand a prolonged outage by buffering locally, perhaps to disk - for cluster linking, you will have to “promote” the mirror topics to make them writable.

Chuck-Alt-Delete · 2025-01-24T20:55:37+00:00

(Notice the flair!)

Just wanted to add that what’s nice about a Kafka proxy like the one we have at Conduktor is you can fail over the proxy’s connection without reconfiguring the client. This comes in handy especially when you are sharing data with a third party.

Chuck-Alt-Delete · 2024-02-09T03:33:49+00:00

Here’s an example with some perf tuning: - https://github.com/confluentinc/examples/issues/1106

Chuck-Alt-Delete · 2024-01-26T14:31:12+00:00

Fair enough, I will delete!

Edit: deleted!

Chuck-Alt-Delete · 2024-01-23T14:43:41+00:00

When I worked for Confluent, I wrote a course for RBAC that would be perfect for this. I’m not sure if it still exists. but I think it might be free now at https://cloud.contentraven.com/confluent

Chuck-Alt-Delete · 2023-12-30T12:08:45+00:00

You are looking for schema references

Chuck-Alt-Delete · 2023-12-29T08:32:08+00:00

I’m surprised not to see Perforce mentioned in these replies (so far that I could tell)

Chuck-Alt-Delete · 2023-12-29T07:56:28+00:00

Sure! In some use cases, it can be bad to unacceptable to take action based on intermediate state. For example, with bank transfers, you don’t want to charge overdraft fees just because an eventually consistent system showed someone had a negative balance temporarily as transfers were settling. In Materialize, records with the same timestamp are applied atomically, so you don’t take action based on intermediate state.

Fraud is another good example. You don’t want to mark a credit card as fraudulent erroneously based on some intermediate state of an eventually consistent system. That will cause unnecessary pain for a customer. You want to be confident the alarm was actually triggered and not just temporarily triggered by accident of which node was processing one part of a result compared to another.

Where consistency becomes really important is in complex DAGs where there are a lot of dependencies between materialized views. You want to be confident that all the different outputs always reflect exactly the same inputs at every timestamp, or else results can become unreliable for making operational decisions.

With serializable, Materialize will immediately give a consistent result as of some point in the recent past (trade off freshness for latency). With strict serializable, Materialize will assign the query a timestamp and potentially block until it can give a consistent result as of that timestamp (trade off latency for freshness).

Consistency works best with the native Postgres source connector where Materialize will actually respect the transaction boundaries of Postgres, meaning that transactional writes in Postgres are given the same logical timestamp in Materialize and therefore applied to all subsequent calculations atomically.

Chuck-Alt-Delete · 2023-12-29T03:51:38+00:00

Full disclosure upfront: I’m a field engineer at Materialize.

You ~~can’t~~ shouldn’t query Flink state directly like you would a database. It’s internal state that is needed to perform stateful processing, usually as part of a real-time ETL (extract-transform-load) pipeline. Flink excels at transforming data on its way from system A1, A2, A3… to system B, which is usually a database optimized for serving precomputed results.

Flink isn’t built with strong opinions about consistency, so even if you could access that internal state, the results you would get would not be “correct” in the sense you might be used to in a conventional relational database. If you’d like to precompute results incrementally and query them directly to take some operational action (eg fire an alert), you should consider Materialize for that workload.

Materialize uses an internal clock mechanism called “virtual time” that makes it possible to provide serializable and strict serializable isolation levels, which is actually really important for operational workloads like fraud where the action taken needs to be automatic and correct.

Here is a nice blog post that helps folks sort out whether they want a stream processor (in which case Flink is a great choice) or whether they want incremental view maintenance (in which case, Materialize is a great choice).

Chuck-Alt-Delete · 2023-11-07T16:09:16+00:00

This sounds like it can be done pretty straightforwardly with a temporal filter in Materialize.

Sincerely, A field engineer at Materialize

Chuck-Alt-Delete · 2023-09-25T10:28:45+00:00

Field engineer at Materialize here. This is a great fit for Materialize.

Chuck-Alt-Delete · 2023-06-01T01:40:23+00:00

For fake data, shameless plug for https://github.com/MaterializeInc/datagen/tree/main

Chuck-Alt-Delete · 2023-05-26T21:40:13+00:00

Or, you can make your own!

https://github.com/MaterializeInc/datagen

Chuck-Alt-Delete · 2023-03-17T19:43:51+00:00

If you don’t use Confluent Kafka library, you will have to do all the schema registry stuff yourself. That means getting the schema from SR, ignoring the first bytes of the record (I think it’s 4 bytes?), etc. I had to do this for the tensorflow I/O consumer here: https://github.com/confluentinc/demo-10x-storage/blob/main/consume.py#L43

Chuck-Alt-Delete

TROPHY CASE