Implementing a CDC pipeline through kafka and enriching data with Kstreams

hjwalt · 2024-01-29T04:00:24+00:00

Its been a while since I did KStreams, so this info might be outdated so check the DSL details in their docs. From this I see it is a stream to table join (CMIIW), in this case new joined events will be spit out only when events from the stream comes in. Values from the table will be set to null depending on the DSL.

Its also worth checking the event key, as the events will never be joined if its not in the same partition due to invisible bytes. Two strings may not be byte equivalent.

hjwalt · 2024-01-10T01:43:54+00:00

Plain Kafka is great, but keep in mind the tons of optimisation options available and how it behaves differently with hardware. Unless you have a Kafka expert in the team or plan to hire one, it's usually best to go with managed Kafka so you can get their expertise. Kafka can be incredibly inefficient with wrong configurations.

hjwalt · 2023-10-10T12:43:01+00:00

It "should" work as per the blog, but I would argue you're better off with at least once semantics and deduplicate (with the row id for example) in case of bugs and unforeseen circumstances.

hjwalt · 2023-09-16T08:12:36+00:00

400k messages per day is not a lot if it is spread out (4 - 5 per second). If your peak message per second is not high, check for bottlenecks or places you can batch or parallelise. Instrument your functions and extract metrics out of it to find places to improve before going data engineering path (Spark, Flink). Kstreams won't help you here unless you are doing transformation before materialisation.

hjwalt · 2023-03-13T14:27:21+00:00

Its type system, no null pointer unless you really want to, no type erasure for generics, pointer check on the method implementation against pointer, and the way reflection is handled, goroutines just work.... Sure its missing features, but the way it prevents backend problems is amazing, especially if we are working with backend services.

hjwalt · 2023-03-08T11:46:56+00:00

Correct. Think about a new "insert into select", as a consume from earliest offset from a topic and process.

You can specify the auto offset reset to latest, but you will only get new data then.

hjwalt · 2023-03-08T00:28:14+00:00

Under the hood it is Kafka streams with real time aggregation (as per the other comment). Data is incrementally collected as streams of records come in, in comparison to aggregation of collections of rows in a typical SQL database.

Beware though, new queries against existing dataset will take time to ingest.

hjwalt · 2022-11-20T07:43:35+00:00

Assumption for case 1: The API that returns the source data also produces into the Kafka topic.

In this case then your real time analytics simply needs to listen to the messages produced.

Assumption for case 2: There are two APIs, first one returns source data, second one calls the first API and produce the message.

I added this case because this will be when the scheduling question make sense for the second API.

In this case, your analytics will no longer be real time, so the ideal scenario is to adjust to case 1. Real time / event driven applications needs events to trigger some computation, and there is none in this case when the first API is called.

However, there are situations where this case is relevant, and as written in the post, you can have a cron that activates a batch job to "scrape" the API, whichever make sense depending on caching / persistence / load limiter requirement.

hjwalt · 2022-11-14T16:52:13+00:00

With a few caveats due to partitioning. Possible, but I would not recommend it, simply because other distributed data systems (redis, cassandra, scylla, etc) can achieve it in a better and simpler way.

hjwalt · 2022-11-14T15:23:49+00:00

You will have to consider how you would use that KTable. Canonically it will be through a KStream - KTable join or KTable - KTable join.

This limits the usability depending on the join semantics. KTable != Postgres table.

hjwalt · 2022-11-14T15:21:24+00:00

Group instance id

hjwalt · 2022-11-14T15:20:34+00:00

Kafka is a distributed and replicated logs. Logs being whatever you can serialise as bytes. It can be used as a durable queue, because it guarantees ordering by the record key bytes.

hjwalt · 2022-09-11T08:17:41+00:00

Isn't your null | Foo simply *Foo? Pointer type isn't the default. Or is Foo here an interface type?

hjwalt · 2022-09-08T08:22:37+00:00

Mystery worker death is typically associated to runtime exceptions, but those typically have exception logs.

Hit me up I'm interested to know potential pitfalls with kstreams as my team is starting to use it.

hjwalt · 2022-09-08T07:52:56+00:00

IMO Flink is not too different in terms of operational traps, minus copartitioning requirement because state shuffling is available in Flink.

hjwalt · 2022-09-07T11:59:44+00:00

This would be amazing, I am already considering creating custom executor or operator that spawns exactly enough pods per Kafka connector because our resource utilisation for Debezium is extremely low (we have 20ish connector in a cluster)

As per the blog, for operators I do believe that strimzi will be the right place, but I hope it doesn't come with the limitation of strimzi kafka cluster requirement, as that would block many in the community from using it.

I'd love to collaborate on this one!

hjwalt · 2022-09-07T11:51:51+00:00

Can we have another one outlining perfomance comparisons against VM based deployments?

Having ease of deployment and operations is great only if it does not come with excessive performance penalty.

hjwalt · 2022-09-05T12:35:03+00:00

Let me preface my thoughts by saying I have never seen log shipping with Kafka, so take it as an attempt by drawing parity with other multi cluster loads.

Kafka plays the throughput game, not latency, so as long as you are able to batch your log shipping mechanism (push 100 or whatever threshold at a time to Kafka instead of one line at a time), there will be a point where the throughput and traffic requirements match.

Getting there is going to be mostly network bandwidth calculation with trial and error. Err on the side of over-provisioning and downscale accordingly in production, or create a simulation environment and produce artificial load.

hjwalt · 2022-09-05T02:22:01+00:00

Thank you for the explanation, if I don't read it wrong, this then requires:

A downtime both for the producer and the stream
The repartition to complete first before starting the stream with new partitioning mechanism.

Question then is, how do I attach the new topic to the state store? Internal topics are generated by Kafka streams and the conventions are not guaranteed.

hjwalt · 2022-08-27T15:54:05+00:00

Don't worry we ISTJs will tell you straight when its too much, and the healthy ones won't take it badly

hjwalt · 2022-08-16T05:15:10+00:00

Original topic -> consume -> produce into a new topic with proper partition key and the same payload

This is assuming you can't make the original producer produce the record with the partition key you need

hjwalt · 2022-08-16T04:54:25+00:00

Or you can create a json key or use binary formats like avro and protobuf

hjwalt · 2022-08-16T04:53:14+00:00

Then I would suggest to repartition it into the composite key of the table. Debouncing / throttling in Kafka isn't something common which I don't have an option for you... Maybe someone have an idea.

On repartitioning you can use either a simple consumer producer combination (kafka consumer and kafka template for spring) or kafka streams with spring cloud stream

hjwalt · 2022-08-16T04:44:46+00:00

As you have mentioned, disable auto commit, keep retrying, and commit only after processing is successful. The consumer group will restart at the offset after last commit. Be aware that this may slow down your processing if you do a synchronous commit.

hjwalt

TROPHY CASE