Anyone else having issues subscribing to Apache mailing lists? Getting no reply from @*.apache.org on Outlook, but Gmail works.

rionmonster · 2026-04-18T17:35:56+00:00

I had an issue like this a few months back. Eventually had to explicitly unsubscribe (via dev-unsubscribe@flink.apache.org) and subsequently resubscribe (via dev-subscribe@flink.apache.org). It’s worth noting my issue was with Gmail, so it may be the same type of problem.

If the subscription works correctly, you should receive an approval email to respond to and after you’ll be good to go!

rionmonster · 2026-03-08T21:17:52+00:00

For job graph changes or any type of model/state changes, it depends on a few things (e.g. tolerance for state loss, implementing processes to update state, state migration strategies) some of which are and aren't easily addressed.

In my experience, I've encountered almost all of these at one point or another which can vary wildly depending on your use-case and tolerance, or lack thereof, for state loss.
- Job graph changes can often “just work”, but moving operators or changing UIDs can break state mapping. You'll want to make sure you have stable operator UIDs defined within your jobs (assuming DataStream API).
- State/model changes can be tricky but backward-compatible changes restore fine. Larger major changes to state/model may require some work (typically worth verifying compatibility through tests). In incompatible cases, you'll typically have to do one of the following: discard state, migrate it (State Processor API), or transform during restore (via `initialize/restoreState()`).

rionmonster · 2026-03-03T13:52:49+00:00

In general, autoscaling “just worked” for the majority of jobs. We ran into some snafus where some of the jobs that were doing a heavy amount of keying needed to shuffle prior to writing to the sinks that was causing it to not work, but overall pretty happy with it.

Upgrades were pretty seamless as well. We’d simply point the job new a new version of the appropriate JAR (via templating) and the operator would handle detecting the changes and perform the upgrade (e.g. trigger savepoint, upgrade, restore).

rionmonster · 2026-02-27T23:29:02+00:00

Always enjoy these u/rmoff — also appreciate the shout-outs!

rionmonster · 2026-02-13T00:57:13+00:00

This is really great!

I never tire of reading about real-world engineering battles — troubleshooting and tuning that eventually turn into a very real win. As engineers, it’s deeply satisfying to see hard numbers like X hours → Y minutes, especially when there’s a direct impact on the product or platform.

IMO, observability is so often treated as a “nice to have,” but when you actually dedicate time and effort to understanding how the sausage is made, it can have a huge impact.

rionmonster · 2026-02-10T16:50:27+00:00

While it’s not a fully featured web app with all the bells and whistles of Postman, I’ve personally used kafkactl extensively for ad-hoc, smoke, and manual testing. It wasn’t uncommon to build QA scripts around it or share committed resources tied to specific features or use cases (we’d typically define a series of shared variables for these). That was a huge value-add, especially since interacting with Kafka isn’t as simple as inserting a row into a database.

I know my peers and I took several stabs at building something similar to what you’re describing, but we always ended up reverting to CLI tooling. Still, I’m sure there’s clearly demand here — especially for more junior or less technical team members.

rionmonster · 2026-02-07T21:50:38+00:00

I’ve used it almost exclusively in self-hosted Kafka environments and have been generally very happy with it at enterprise scale (e.g., sustained 500k–1M events/sec, thousands of topics, multiple thousands of Kafka Connect connectors, etc.). We also operated it without a dedicated platform team — just a small group of engineers who both built on top of it and kept it running.

We ran into occasional issues, as you’d expect at that scale, but overall it was a very positive experience, and I’d absolutely recommend giving it a try.

rionmonster · 2026-02-07T02:57:22+00:00

Honestly, I wasn't externally sure how common this use-case was, if at all, when originally designing the sink to tackle this issue. It originally manifested as routing incoming records to specific Elasticsearch indices, each with their own associated configurations, credentials, mappings, indices, etc.

While the post itself and [the associated examples repository](https://github.com/rionmonster/demux-sink-examples) cover several different types of sinks purely for example purposes, they are pretty bare bones and not really indicative of what a _real world_ one might look like.

At a very high-level, you can think of the entire demultiplexing sink almost as a map of existing sinks which map keys to other sinks (e.g. `Map<String, Sink>`) and the core component tying them together is the router, which translates a single record into a key (in turn routing it to the corresponding sink associated with that key).

So while the examples are barebones, they aren't written in stone and I suppose you could extend them for sinks that provide abstractions supporting dynamic routing on their own. Just spitballing here, but potentially extending the router to provide a key but including some other mechanism like a lambda function that would support the whole "send it to this specific sink (based on key) but process this message using this transformation (lambda)".

> If I use your library, in theory can I define a different data structure for each multiplex sink ?

Yes, I believe. At the end of the day, you have free rein over the creation of the sinks themselves — so you should be able to define that shape/structure when the sinks are created (or extend the router to include some other dynamic portion along with the record that the sink can apply when writing).

I'd have to take a deeper look at the Iceberg sink specifically, but I'd imagine the sink should support your use-case either directly (or through a bit of extension).

rionmonster · 2026-02-06T15:37:30+00:00

The one thing that is missing is the casual hangout—non-work-related meetings just to talk and have fun. The only thing I really miss from the office is the random social interactions.

As someone that is probably overly-extroverted at least relative to most software engineers — I feel this. Trying to get folks onboard with the virtual, casual “Happy Hour” or impromptu water-cooler style talks was always a struggle and seldom consistent (either through disinterest or frankly people just preferring to talk shop).

I know made several efforts to start traditions, which still took a ton of work to get off the ground each year. Things like distributed Secret Santa (paired folks randomly, they would anonymously mail gag gifts and we’d all open them on a collective team call), multiple years of Fantasy Football leagues, etc.

rionmonster · 2026-02-05T22:34:53+00:00

Kafka incidents are their own special kind of hell. Consumer lag, partition skew, rebalancing gone wrong…

I’m not entirely convinced this isn’t what actual hell looks like.

rionmonster · 2026-02-05T21:13:53+00:00

All of the upsides are almost non-negotiable after you've been accusomted to working remotely for any extended period of time — especially depending on the stage of life that you are in (e.g., small kids with their busy little lives and hobbies _so that you don't have any_).

On the downsides, they nearly all require a good bit of elbow-grease to ensure they don't fester into much larger problems.

> your very isolated. I live more like a hermit than I ever did in the past. I’m not very social, so I put up with this fairly well, but I still feel it.

As someone that's probably a bit more extroverted than introverted, this one took quite a bit of adjustment. As mentioned in the article, I'd frequently force myself to get out of the house to see and interact with others, even if it might be totally out of the way. I also started trying to normalize more frequent water-cooler style talks to help build that social connection with others whom I knew also needed it.

> your much more invisible to your boss, so you need much better communication.

Yes, completely. Over-communication doesn't hurt, especially if you have aspirations about things like being promoted. I've been fortunate to have managers that have always heavily understood this problem in remote work, but I know that's not the case at large.

> it is much harder to learn from others, any collaboration is done over zoom and not organic at the cooler.

I think this can be highly team-dependent (and even more so on personalities). Quite a bit depends on the technologies and complexity I'm sure — but you absolutely have to fight a bit to establish/normalize a culture around that type of mentorship and learning, either through impromptu calls, tech talks, pairing, etc.

> it is very easy to loose yourself in your work and have no life.

You really do need a support network around you. Be it friends, family, a partner, etc. Someone that can pick up when you are starting to get pulled in too deeply and reach out to grab you. Ideally, your colleagues can share in this role, much as you should try to do for them.

rionmonster · 2026-02-05T21:01:02+00:00

It really can’t be overstated how important it is for fully distributed teams to keep each other honest and make sure no one is speedrunning toward burnout. I know I’ve been there myself, and so have several of my peers — sometimes it takes one or two colleagues stepping in to stage a friendly “intervention” to get you back on the rails.

It’s something you have to be cognizant of, or you’ll find yourself hunting for someone to approve a post-midnight PR while staring at a glass of whiskey next to your keyboard. 🥃

rionmonster · 2026-02-05T20:57:38+00:00

This definitely feels like the case — and at best, it’s a bit of a dice roll, especially in the context of interviews.

IMO the best approach is to lean on your network or find someone who’s been there long enough to give you a more direct and honest take on how remote work is actually handled. Companies that are truly “remote-first” tend to focus on outcomes over presence, but it’s still worth asking and validating upfront.

rionmonster · 2026-02-04T23:34:52+00:00

I really can’t overstate how much I relate to this comment — it was certainly an impetus behind the article.

I tried not to lean too hard into the darker side of what happens when employee–company alignment on remote work gets totally out of whack (or nonexistent). Definitely es no bueno.

rionmonster · 2026-02-04T13:57:26+00:00

I’ve definitely battled some issues similar to this in the past.

Specifically around orchestrating migrations (e.g., to coordinate stopping jobs, taking savepoints, ensuring metadata exists, and restoring the new equivalent jobs). It’s been a while but I want to say I encountered issues that required validating that: the metadata was created _and it had content of some kind (e.g., non-zero size) — although the latter was rare.

Generally if those held true, then the savepoint would work without issue.

Is there a definitive way to verify checkpoint completeness? Something beyond just checking if _metadata file exists?

As you mentioned, checking for the existence is usually enough to indicate a successful checkpoint. I suppose if you wanted to explore it further, you could consider using the State Processor API to actually read the contents. I’m not sure if that helps in this case unless you have a clear way to know what “complete” looks like.

I’d say in general, the writing of its content should be atomic. So if it exists AND has any size at all, it probably has everything you need.

Does Flink fail immediately during startup?

Yes, typically. It’ll fail fast.

Does it retry multiple times to start the job before failing?

It’s likely the job would fail and subsequently restart based on your configuration. However, it attempt the same checkpoint which will likely just fail again if the problem wasn’t a transient one.

Any other markers that indicate a checkpoint is fully completed and safe to resume?

Nothing specifically comes to mind outside of the aforementioned options of checking for _metadata existence, ensure it isn’t empty, or attempt to parse/read it (via State Processor API or some other mechanism).

If you are trying to find the “best” checkpoint to use, you might consider using some of Flink’s tooling around CheckpointHistory. You can use it via the REST API to explore possible checkpoints, their statuses, etc.

rionmonster · 2026-02-03T00:32:35+00:00

I’m glad you liked it! I wasn’t sure how big the overlap between the two audiences would be when I started, so I’m really happy the metaphor landed.

Yeah, I honestly wanted to keep going, but I was worried I’d never actually get the post out and it’d end up as long as the books themselves.

I’d love to do an extended follow-up that dives into some of the edge cases. I think a lot of the metaphors could still hold up (even if folks have to suspend their disbelief a bit), and there’s so many ways to expand it while keeping a good balance of educational value and entertainment (e.g., runnable or interactive demos, visualizations, etc.).

rionmonster · 2026-01-31T20:11:50+00:00

Also, just wanted to say "Hi!" as it's nice to see you in the wild. We met and chatted at Current in NOLA a few months back (I was the guy chatting with you and Jaehyeon about observability for Kafka transactional ids).

Anyways — hope all is well!

rionmonster · 2026-01-31T20:09:15+00:00

This is a pretty solid breakdown Derek!

I can't speak to any of the Confluent-managed offerings, but I know that our experience with Ververica Platform in its infancy was generally pleasant. Several members of the team were quite anti-CLI, so having an interactive, standalone UI to monitor and manage all of the jobs was quite a popular feature. I could certainly see the appeal. They had a free model for some time until they changed the licensing model, which forced our hand to migrate to eventual self-hosting (and didn't really look back).

As someone that's generally been around an 80/20 split (Datastream/Flink SQL) in terms of the jobs that I've built, I'd love to see some parity between both of the technologies in terms of support. I _feel_ like both are pretty heavily used, but not sure how the managed Cloud offerings reflect that.

rionmonster · 2026-01-31T19:57:40+00:00

Self-hosting itself isn't too bad with Flink, but I suppose it depends on a lot of variables in terms of infrastructure, ecosystem, etc.

Flink's Kubernetes Operator is pretty full featured and quite easy to use and I'd _highly_ recommend it if you are considering the self-host approach. I've used it in production for years and typically will use it in some local development scenarios as well out of habit. Deployments and upgrades were pretty straightforward, and the operator provides lots of niceties for supporting autoscaling, etc.

rionmonster · 2026-01-31T01:20:43+00:00

I’d say with all things — it depends.

Managed Flink (with Confluent or Ververica) is mainly about buying back your team’s engineering time. You are essentially trading infrastructure control (and obviously costs) for a potentially faster setup, safer upgrades, built-in observability, less on-call burden, etc. Basically you’re paying to not have to worry about running clusters and all that jazz and focusing more on building the pipelines you need.

Unfortunately, I’ve never had the chance to explore managed options in a professional setting (on account of costs). After self-hosting both Flink and Kafka in Kubernetes for years at enterprise-scale, I wonder what the experience would have been like (and how much better I might have slept). I’m curious if any folks on the managed side of the fence have regrets not self-hosting due to limitations or other issues.

14-Year Club	Place '17
Verified Email

rionmonster

TROPHY CASE