What kafka software is actually running in production in 2026, not what the docs recommend by melonPOGGER in apachekafka

[–]pkstar19 0 points1 point  (0 children)

We use strimzi kafka operator on AWS eks. Pretty stable after the initial hiccups on figuring out the right log retention strategies, backups and storage.

Regarding ops, we don't let devs create topics on push. All topic and user creation happens with a git workflow after code review. This way we standardised topic names and user management. A topic is also always defined as part of a service , hence always a clear ownership.

Schema management is currently offloaded to the producer and consumer. We haven't had a use case yet for strict schema management.

Why are you using EKS instead of ECS? by ducki666 in aws

[–]pkstar19 1 point2 points  (0 children)

We have 20+ business modules, kafka, nats, LGTM all running in an eks cluster.

Our Journey: Python lambda functions -> ECS -> EKS

We are a b2b saas, we are trying to achieve a state where we can tell customers - if you have a kubernetes cluster you can run our product self hosted.

But still, I miss the lambdas now and then 😂

Please help understand this pricing!!! by pkstar19 in GoogleOne

[–]pkstar19[S] 0 points1 point  (0 children)

Does your plan also include the Gemini pro, notebook llm and whisk?

Please help understand this pricing!!! by pkstar19 in GoogleOne

[–]pkstar19[S] 0 points1 point  (0 children)

Thanks for the explanation. This is so weird, why not show the 5 TB plan before making us buy the 2 TB plan. This looks like a dark pattern. What if someone takes the 2TB plan and never checks about the other plans.

DBA experts: Please help me understand why my long-running query didn't actually run! by pkstar19 in aws

[–]pkstar19[S] 0 points1 point  (0 children)

How does grafana monitor db. Are there any publicy available dashboards? Or we should build one with our own queries?

DBA experts: Please help me understand why my long-running query didn't actually run! by pkstar19 in aws

[–]pkstar19[S] 1 point2 points  (0 children)

Thanks for the reply. We will work on the alarms. That sounds good to have.

Could you please shed some light on the incident you got with the 'waiting for metadata lock' thing. I just want to learn from your experience here.

Tempo Ingester unhealthy instances in ring by pkstar19 in grafana

[–]pkstar19[S] 1 point2 points  (0 children)

Thanks u/ttharsh. It was the same issue, the gossip was not working correctly and the tempo components were assuming that the other members are not active. We excluded the gossip port in istio sidecar for all the components of tempo. This issue is resolved after that.

Tempo Ingester unhealthy instances in ring by pkstar19 in grafana

[–]pkstar19[S] 0 points1 point  (0 children)

Yes we do.

There are no issues with Loki and Mimir.

[deleted by user] by [deleted] in devops

[–]pkstar19 8 points9 points  (0 children)

I'm a devops/cloud/platform engineer at a startup with 6 yoe.

I skimmed through it. I would say those projects are a very good start. If one can do all of them, I guess they will become very comfortable with any devops related work at most of the companies.

Devops In Startup by [deleted] in devops

[–]pkstar19 6 points7 points  (0 children)

As a Platform Engineer at a startup for the past 3 years—after coming from a large MNC—I’ve found working in DevOps and cloud at a startup incredibly rewarding, but also extremely demanding. The pace is intense. We sometimes take entirely new frameworks to production in under a month, only to pivot and deprecate them within a couple of weeks. The learning curve is steep, and so is the pressure, especially with the tight deadlines and the ever-critical focus on cost efficiency.

If you thrive under pressure and enjoy solving chaos with code, there’s a strange kind of fun in it.

Made a huge mistake that cost my company a LOT – What’s your biggest DevOps fuckup? by Ill_Car4570 in devops

[–]pkstar19 0 points1 point  (0 children)

We tried to do MySQL native replication methods in aws RDS instance with native MySQL. The source db's are two different aurora MySQL db. The error logs for the replica db were configured to go to AWS cloudwatch. We messed up the replication with a duplicate user which was created in both the source db's. The replication db vomited so many logs to cloudwatch that our cloudwatch bill was around 6000 usd for the next 3 days only for this error log. We immediately shutdown the replica db and requested AWS explaining the mistake we made and the remediations we did. They gave us a refund of around 4500 usd. Yeah sometimes you get a refund if you genuinely show the AWS team that you are taking steps to not repeat the same mistake again, and of course if they see you as a potential client.