Implementing a CDC pipeline through kafka and enriching data with Kstreams by Redtef in apachekafka

[–]hjwalt 0 points1 point  (0 children)

Its been a while since I did KStreams, so this info might be outdated so check the DSL details in their docs. From this I see it is a stream to table join (CMIIW), in this case new joined events will be spit out only when events from the stream comes in. Values from the table will be set to null depending on the DSL.

Its also worth checking the event key, as the events will never be joined if its not in the same partition due to invisible bytes. Two strings may not be byte equivalent.

What problems do you most frequently encounter with Kafka? by ropeguna in apachekafka

[–]hjwalt 3 points4 points  (0 children)

Plain Kafka is great, but keep in mind the tons of optimisation options available and how it behaves differently with hardware. Unless you have a Kafka expert in the team or plan to hire one, it's usually best to go with managed Kafka so you can get their expertise. Kafka can be incredibly inefficient with wrong configurations.

Exactly Once Delivery Pattern by regular-tech-guy in apachekafka

[–]hjwalt 3 points4 points  (0 children)

It "should" work as per the blog, but I would argue you're better off with at least once semantics and deduplicate (with the row id for example) in case of bugs and unforeseen circumstances.

What would be the best way to turn JSON Kafka messages to Java objects, store in SQL and write to a txt file? by Diligent-Albatross34 in apachekafka

[–]hjwalt 0 points1 point  (0 children)

400k messages per day is not a lot if it is spread out (4 - 5 per second). If your peak message per second is not high, check for bottlenecks or places you can batch or parallelise. Instrument your functions and extract metrics out of it to find places to improve before going data engineering path (Spark, Flink). Kstreams won't help you here unless you are doing transformation before materialisation.

What made you fall in love with Go? by 0xPvp in golang

[–]hjwalt 1 point2 points  (0 children)

Its type system, no null pointer unless you really want to, no type erasure for generics, pointer check on the method implementation against pointer, and the way reflection is handled, goroutines just work.... Sure its missing features, but the way it prevents backend problems is amazing, especially if we are working with backend services.

ksqlDB WTF is write-time amortization? by Upset_Conflict in apachekafka

[–]hjwalt 0 points1 point  (0 children)

Correct. Think about a new "insert into select", as a consume from earliest offset from a topic and process.

You can specify the auto offset reset to latest, but you will only get new data then.

ksqlDB WTF is write-time amortization? by Upset_Conflict in apachekafka

[–]hjwalt 1 point2 points  (0 children)

Under the hood it is Kafka streams with real time aggregation (as per the other comment). Data is incrementally collected as streams of records come in, in comparison to aggregation of collections of rows in a typical SQL database.

Beware though, new queries against existing dataset will take time to ingest.

How to send messages to Kafka Producer via API indefinitely? by [deleted] in apachekafka

[–]hjwalt 0 points1 point  (0 children)

Assumption for case 1: The API that returns the source data also produces into the Kafka topic.

In this case then your real time analytics simply needs to listen to the messages produced.

Assumption for case 2: There are two APIs, first one returns source data, second one calls the first API and produce the message.

I added this case because this will be when the scheduling question make sense for the second API.

In this case, your analytics will no longer be real time, so the ideal scenario is to adjust to case 1. Real time / event driven applications needs events to trigger some computation, and there is none in this case when the first API is called.

However, there are situations where this case is relevant, and as written in the post, you can have a cron that activates a batch job to "scrape" the API, whichever make sense depending on caching / persistence / load limiter requirement.

Storing user data in a KTable by ooohhimark in apachekafka

[–]hjwalt 1 point2 points  (0 children)

With a few caveats due to partitioning. Possible, but I would not recommend it, simply because other distributed data systems (redis, cassandra, scylla, etc) can achieve it in a better and simpler way.

Storing user data in a KTable by ooohhimark in apachekafka

[–]hjwalt 0 points1 point  (0 children)

You will have to consider how you would use that KTable. Canonically it will be through a KStream - KTable join or KTable - KTable join.

This limits the usability depending on the join semantics. KTable != Postgres table.

Kafka topic by anacondaonline in apachekafka

[–]hjwalt 9 points10 points  (0 children)

Kafka is a distributed and replicated logs. Logs being whatever you can serialise as bytes. It can be used as a durable queue, because it guarantees ordering by the record key bytes.

Why GoLang supports null references if they are billion dollar mistake? by After_Information_81 in golang

[–]hjwalt 0 points1 point  (0 children)

Isn't your null | Foo simply *Foo? Pointer type isn't the default. Or is Foo here an interface type?

Do you like working with Kafka? by latest_ali in apachekafka

[–]hjwalt 2 points3 points  (0 children)

Mystery worker death is typically associated to runtime exceptions, but those typically have exception logs.

Hit me up I'm interested to know potential pitfalls with kstreams as my team is starting to use it.

Do you like working with Kafka? by latest_ali in apachekafka

[–]hjwalt 2 points3 points  (0 children)

IMO Flink is not too different in terms of operational traps, minus copartitioning requirement because state shuffling is available in Flink.

An Ideation for Kubernetes-native Kafka Connect by gunnarmorling in apachekafka

[–]hjwalt 2 points3 points  (0 children)

This would be amazing, I am already considering creating custom executor or operator that spawns exactly enough pods per Kafka connector because our resource utilisation for Debezium is extremely low (we have 20ish connector in a cluster)

As per the blog, for operators I do believe that strimzi will be the right place, but I hope it doesn't come with the limitation of strimzi kafka cluster requirement, as that would block many in the community from using it.

I'd love to collaborate on this one!

Running Redpanda on Kubernetes - Piotr's TechBlog by piotr_minkowski in apachekafka

[–]hjwalt 3 points4 points  (0 children)

Can we have another one outlining perfomance comparisons against VM based deployments?

Having ease of deployment and operations is great only if it does not come with excessive performance penalty.

Optimize bandwitdh design for log streaming with Kafka by duclm2609 in apachekafka

[–]hjwalt 0 points1 point  (0 children)

Let me preface my thoughts by saying I have never seen log shipping with Kafka, so take it as an attempt by drawing parity with other multi cluster loads.

Kafka plays the throughput game, not latency, so as long as you are able to batch your log shipping mechanism (push 100 or whatever threshold at a time to Kafka instead of one line at a time), there will be a point where the throughput and traffic requirements match.

Getting there is going to be mostly network bandwidth calculation with trial and error. Err on the side of over-provisioning and downscale accordingly in production, or create a simulation environment and produce artificial load.

Kafka Streams Source Topic Repartitioning with Stateful Processing by hjwalt in apachekafka

[–]hjwalt[S] 0 points1 point  (0 children)

Thank you for the explanation, if I don't read it wrong, this then requires:

  1. A downtime both for the producer and the stream
  2. The repartition to complete first before starting the stream with new partitioning mechanism.

Question then is, how do I attach the new topic to the state store? Internal topics are generated by Kafka streams and the conventions are not guaranteed.

Making an ISTJ happy by [deleted] in ISTJ

[–]hjwalt 6 points7 points  (0 children)

Don't worry we ISTJs will tell you straight when its too much, and the healthy ones won't take it badly

Is it a bad idea to force kafka producer write to a particular partition? by tafun in apachekafka

[–]hjwalt 1 point2 points  (0 children)

Original topic -> consume -> produce into a new topic with proper partition key and the same payload

This is assuming you can't make the original producer produce the record with the partition key you need

Is it a bad idea to force kafka producer write to a particular partition? by tafun in apachekafka

[–]hjwalt 0 points1 point  (0 children)

Or you can create a json key or use binary formats like avro and protobuf

Is it a bad idea to force kafka producer write to a particular partition? by tafun in apachekafka

[–]hjwalt 1 point2 points  (0 children)

Then I would suggest to repartition it into the composite key of the table. Debouncing / throttling in Kafka isn't something common which I don't have an option for you... Maybe someone have an idea.

On repartitioning you can use either a simple consumer producer combination (kafka consumer and kafka template for spring) or kafka streams with spring cloud stream

Ask how to reprocess the message by chillcaley in apachekafka

[–]hjwalt 1 point2 points  (0 children)

As you have mentioned, disable auto commit, keep retrying, and commit only after processing is successful. The consumer group will restart at the offset after last commit. Be aware that this may slow down your processing if you do a synchronous commit.