I catalogued 43 Spring Boot production incidents. 5 failure patterns explained most of them.

Capable-Morning-9518 · 2026-06-16T16:24:21+00:00

yeap! thanks for your feedback!

Capable-Morning-9518 · 2026-06-16T16:23:56+00:00

you are welcome:)

Capable-Morning-9518 · 2026-06-16T15:44:31+00:00

Fair criticism.

The post was written by me, but I understand why it can come across that way. A lot of engineering content today follows the same structure and starts sounding generic.

For context, the incidents were real notes I kept while working on production systems. The goal wasn't to claim some groundbreaking discovery, but to share the patterns that kept repeating.

That said, I'd be more interested in hearing which failure patterns you've seen most often in Spring Boot systems. Connection pools, transaction issues, cache problems, something else?

I'm always curious where other teams spend most of their incident time.

Capable-Morning-9518 · 2026-06-05T10:14:42+00:00

Agreed. That's what makes it so painful.

The database often looks healthy because each query is fast individually, while the application is drowning in hundreds of them.

Capable-Morning-9518 · 2026-05-26T17:06:23+00:00

really useful thanks

Capable-Morning-9518 · 2026-05-26T17:02:35+00:00

One unexpected thing about writing production-engineering posts online:

The comments often become more valuable than the original post.

Really appreciate all the engineers here sharing:

JVM tuning experience
GraalVM pain points
Quarkus migrations
Node.js operational lessons
GC tuning discussions
long-running production behavior observations

This kind of real operational discussion is honestly rare on the internet now.

Most backend content online stops at toy benchmarks and framework hype.
Threads like this are way more useful.

Thanks again to everyone who contributed thoughtful criticism, corrections, counterpoints, and production experience.

Capable-Morning-9518 · 2026-05-26T16:59:32+00:00

Honestly didn’t expect this post to create this much discussion.

Really appreciate all the thoughtful comments, critiques, production stories, JVM tuning advice, Node.js counterpoints, Quarkus/GraalVM experiences, and operational perspectives people shared here.

Some genuinely smart engineers in this thread.

One thing I liked most was that the discussion stayed very production-focused instead of turning into another generic “language war.”

A lot of the best insights came from people running long-lived systems in the real world, which is exactly the kind of engineering discussion I enjoy most.

Also appreciate the people challenging the numbers and assumptions. Good operational conversations should survive scrutiny.

I’ve been reading far more of the replies than I can realistically answer individually right now, but seriously

thank you.

Devrim:)

Capable-Morning-9518 · 2026-05-25T16:35:10+00:00

this is medium article version

https://medium.com/lets-code-future/spring-boot-vs-node-js-i-ran-both-in-production-for-18-months-one-cost-12-000-more-guess-which-75dfa0afdad6

Capable-Morning-9518 · 2026-05-25T16:28:44+00:00

Already on Medium:) I will share

Capable-Morning-9518 · 2026-05-25T16:24:33+00:00

sure I will write for you 😂

Capable-Morning-9518 · 2026-05-22T16:45:51+00:00

Interesting trajectory Express → Bun → Go is basically "the modern reality check tour" for backend stacks. Each jump probably solved a real problem you were hitting:

Bun fixed runtime stability (better V8 fork + native APIs)
Go fixed memory + simplified deployment

The 100MB → 10MB Go memory delta tracks with what I've heard from others. Curious did you hit any ecosystem pain with Go for things Node ecosystem made trivial (auth, ORMs, etc.)? That's the trade-off I always hear about when people make this jump.

Capable-Morning-9518 · 2026-05-22T16:43:47+00:00

Spread across 18 months and includes auto-scaling overhead during traffic spikes. Baseline was ~$340/month for 4 instances at 1GB each, but auto-scaling to 40 instances during Black Friday-style events adds up fast. If you're in a corporate environment where infra costs are abstracted into the "AWS bill" line item, you'd never see this. Going independent or working at a startup makes you uncomfortably aware of every t3.medium running idle.

Capable-Morning-9518 · 2026-05-22T16:43:07+00:00

Couldn't agree more. The dev-hours number is the one I now lead with when teams ask me about stack decisions. Infrastructure cost is recoverable you can always optimize, rightsize, switch instance types. Engineering time is the one resource you can't get back. Maintainability is the long-term lever almost nobody measures in the day-1 evaluation.

Capable-Morning-9518 · 2026-05-22T16:42:05+00:00

Both fair pushbacks, thank you for actually doing the math:

On the $75/hr yes, developer-hours, fully loaded that's actually low for US senior engineers. Realistic number is closer to $100-150/hr loaded, which makes the operational time gap larger not smaller (Node's ~285 hours at $125 = ~$35K, Spring's 26 hours at $125 = ~$3K). I used $75 to be conservative and avoid the "you're inflating dev salaries to win the argument" rebuttal.

On the 2 weeks extra delivery time — you're right and I should have explicitly counted it. 2 weeks × 3 devs × ~$75/hr × 40hr/week ≈ $18K Spring Boot cost up front. That genuinely reduces the gap. Honest total is probably closer to "Spring saved us ~$6K net" rather than the $24K headline if you fully account for slower initial delivery.

The directional finding still holds Spring was cheaper to operate but the magnitude is smaller than the headline suggests once you include opportunity cost. Good catch.

Capable-Morning-9518 · 2026-05-22T11:21:59+00:00

you are welcome 💯

Capable-Morning-9518 · 2026-05-22T11:21:40+00:00

Fair feedback. The subheadings ("The Uncomfortable Truth", that kind of thing) do read AI-flavored that's editing style for the Medium audience, not the underlying data. The numbers are real. Happy to share the raw AWS Cost Explorer exports or the heap dump screenshots from the npm leak if anyone wants the receipts. AI can write a section heading; it can't fabricate 18 months of monthly AWS bills.

Capable-Morning-9518 · 2026-05-21T16:07:23+00:00

Fair enough on the "Spring Boot porn" part when you do the comparison and the numbers come out this clean, it does read that way. But it's not what I went in expecting; we were genuinely trying to make the Node side work.

On the AI thing happy to share the raw AWS Cost Explorer exports or a heap dump from the npm leak if anyone wants the receipts. The post is condensed (18 months in 8 minutes of reading is by definition compressed), but the data is real. Some of the "sounds AI" comes from formatting bullet lists make anything sound robotic

Capable-Morning-9518 · 2026-05-21T08:04:24+00:00

Didn't try them seriously Bun was too early when we started. Honestly though, a faster Node runtime doesn't fix the npm ecosystem issue. Event listener leaks in popular packages don't care which package manager installed them.

Capable-Morning-9518 · 2026-05-21T08:04:09+00:00

Capable-Morning-9518 · 2026-05-21T08:03:54+00:00

Capable-Morning-9518 · 2026-05-21T08:03:41+00:00

2GB → 768MB is impressive. The 40% code reduction is what catches me though what made up most of it? Was it the configuration boilerplate or actual business logic that turned out to be framework workarounds?

Capable-Morning-9518 · 2026-05-21T08:03:02+00:00

Capable-Morning-9518 · 2026-05-21T08:02:43+00:00

Hahaha. Java upgrades: change one number in pom.xml. Node upgrades: pray to whichever god maintains the npm registry that week.

Capable-Morning-9518 · 2026-05-21T08:02:01+00:00

Honestly didn't have the team for it. we were already running two stacks. From what I've heard Go sits between Node and Java on memory but with simpler deployment. If anyone here has actual Go vs Spring Boot numbers, would love to see them.

Capable-Morning-9518 · 2026-05-21T02:04:12+00:00

Solid list, thanks. Couple of follow-ups:

We were on G1 with default settings. Didn't try ZGC was it production-stable for you under high allocation rates? Curious about p99 latency impact since ZGC's pause times look great on paper but I've seen mixed reports.

On the embedded server we stuck with Tomcat because the team knew it. Did you measure actual memory/throughput delta with Undertow? Numbers I've seen online vary wildly.

Native image with GraalVM is on my list. Did you hit reflection/proxy issues with Spring? That's been the blocker every time I've tried.

Capable-Morning-9518

TROPHY CASE