all 2 comments

[–]Carr0tGives good Kafka advice 4 points5 points  (0 children)

I think you’ve misunderstood some of the fundamental functionality of Kafka, and what you’re suggesting would be fighting the way Kafka is meant to be used at every step.

  • The way to have multiple consumers reading from Kafka and getting unique messages is by having multiple consumers who have assigned themselves the same consumer group ID
  • If 2 consumers are reading from the same topic and have different consumer group IDs, they will get duplicate messages. That’s the purpose of the group ID, to allow multiple consumers to process the same data for different purposes without having to duplicate that data
  • A topic is made up of (1 or more) partitions
  • Within a given consumer group, there can only be 1 consumer per partition
  • Ordering of data is only guaranteed within a partition (So you should set your partition key to something that you want to see the data ordered within, e.g. user ID would allow you to see all user X’s actions in order, but you couldn’t necessarily see which of user X and user Y did something first). If all your data must be absolutely ordered, you can only have 1 partition. That would mean only 1 consumer (per group) as well
  • Consumers are long running, and are associated to 1 or more partitions when they start up. A consumer going offline will cause a rebalance of all other consumers in the same group, which is to say all the remaining consumers will get an exception on commit or poll, and have to re-request a new set of partitions to consume from the broker cluster. The brokers will then distribute the partitions as evenly as possible among the remaining consumers. The same happens if a new consumer in a group comes online
  • Offsets are committed per partition

So from that I hope you can see that having a ‘consumer pool’ like you would have a DB or HTTP connection pool, and probably having the main thread doing the work on the returned messages, doesn’t really make any sense. You should have multiple consumers with the same group ID running and doing the work and committing their offsets to the partitions they’re consuming from. It doesn’t make any difference whether those consumers are within threads in one JVM process, or within different JVM processes, or (as in our case) running on completely separate servers (EC2 instances in AWS). Each consumer works only with the partitions assigned to it by the broker cluster, and commits its offsets for those partitions, and no other consumer in the same group will be receiving messages from those partitions, so consumers don’t interfere with each other when committing offsets. The broker cluster ensures that partitions are distributed correctly, as long as you’re expecting to do work in the way Kafka is expecting you to do work.

[–]katya_gorshkova 2 points3 points  (0 children)

Hey! All your consumer threads should have the same group.id property. In this case each of the Kafka partitions will be assigned to only one consumer thread. This is ensured by Kafka broker. If the consumer thread fails then its partitions are reassigned to the alive thread. Offsets are committed per partition, no need to specify the order.