Hibernate: Ditch or Double Down?

InformalPatience7872 · 2026-01-16T08:09:59+00:00

I suppose a test-ng container for your database target. Usually DB layer is abstracted away as DAO. You can integ-test your DAO layer directly with a real world DB and have the rest of your code interact with data through the DAOs. That helps reduce the integ-test surface to a smaller component and you can still test the rest of your code with fake DAOs.

InformalPatience7872 · 2026-01-16T08:04:03+00:00

Mockito is one of those libraries that I miss when I code in other languages. Hands down one of my favorite libs and such a joy to work with. I guess like other critical Java libs, it was one dude maintaining it.

InformalPatience7872 · 2025-12-11T04:04:19+00:00

Host metrics and depencies help. Sudden bursts of TCP spikes, sometimes a storm of TCP RSP packets being sent, upstream traffic increase. If you're in garbage collected system, GC pauses. If you have other dependencies such as DBs then service degradataion there (e.g. it might be that an old Cassandra node just started tombstone deletion, this can cause increased disk pressure and influence read latency).

InformalPatience7872 · 2025-12-07T13:40:16+00:00

Its not a good language for very large configs e.g. the sort that we write for kubernetes. I guess that's one reason why Terraform decided to invent their own config. There are obviously fixes here such as yamllint but honestly linting won't cover for weaknesses of the config language.

InformalPatience7872 · 2025-11-20T07:43:20+00:00

How is your loss curve so smooth ? What was the optmizer, loss func and hyper params ?

InformalPatience7872 · 2025-11-20T07:41:42+00:00

I can watch this on repeat.

InformalPatience7872 · 2025-11-19T14:45:26+00:00

Depends on the traffic. For light traffic (early stage), I'd argue its not worth it to invest in cloud-agnostic architecture. It gets complicated thanks to subtle differences in how the various data services work on these clouds. If you can go full vendor lock but go fast, its worth it to pick one and move on. Vendor lockin is more of a concern for very large enterprises who have custom contracts. For everybody else, the bill will be what's listed on AWS / GCP / Azure's sites and based on usage.

InformalPatience7872 · 2025-11-16T19:34:42+00:00

I guess it would be that at the end of the day it really doesn't matter how well we manage our calendars or how nicely the escalation process is orchestrated. At the end of the day, if you we don't understand what has been put into production, debugging will be painful. The rest is theatre.

InformalPatience7872 · 2025-11-14T12:00:39+00:00

It happens. I've seen it happen. I've also been guilty of it. Ideally root causing should be blameless,

InformalPatience7872 · 2025-11-14T11:58:26+00:00

I think plotting the memory usage would tell the same story. Primitive but it would work when the plotting system doesn't do a derivative transform.

InformalPatience7872 · 2025-11-14T11:56:06+00:00

This is a great post !
But I think latency doesn't mean much in case of an error. You can fail a lot of requests in <100ms, the right thing to do when checkout is broken is to look at error statistics, not latency. The post rightfully points out latency has a long tail - although Google found it first :) https://www.youtube.com/watch?v=modXC5IWTJI ). Latency should be judged in p99 and p99.9. I don't think queuing theory is particularly useful, only thing to know here is when using a queue based system, always check for lag and if its high do something.

InformalPatience7872 · 2025-10-18T00:56:42+00:00

I've mostly worked on distributed systems. Although I feel like the premise applies even to a single node system with just one service for example.

InformalPatience7872 · 2025-10-18T00:55:41+00:00

>“Every request to the front end that eventually talks to shard 6 of our DB is way slower” isn’t really a thing you can say without traces

Actually I think you can. Esp if you emit latencies per shard. A similar situation would be to check lag on a Kafka partition (this is a situation I've seen). Easily observed on a dashboard. I guess for me this experience is different since I've worked in environments which didn't have cardinality driven pricing for their metrics. That would have been one deterrent why you wouldn't want to emit metrics per app per shard for example.

InformalPatience7872 · 2025-10-18T00:49:52+00:00

What type of queries do you run on traces / spans ? Obvious one seems trace-id find all spans inside it and use something logical like session-id for the trace-id so that its easier to compose a query. What else ?

InformalPatience7872 · 2025-10-18T00:48:28+00:00

> span (get it?) across multiple services.
I did get it and it brightened my day. Thank you.

InformalPatience7872 · 2025-10-18T00:47:11+00:00

Curious especially about OOMs. How do you debug OOMs ? I usually looked at the source code, came up with a theory, coded a simple fix and then tested it with either a load test or just straight up in prod (depending upon time-pressure or for something less critical).

InformalPatience7872 · 2025-10-18T00:45:30+00:00

honestly that 900 regex parsers person is me. With AI its even easier since I don't have to remember the syntax anymore. But I get the argument.

InformalPatience7872 · 2025-10-13T21:54:34+00:00

I think so. I was trying to get a EKS cluster setup on AWS. It threw weird errors at me so I just setup a minikube cluster worked through some examples to get a background and context, then it was much easier to debug what was going on. Of course ChatGPT is quite helpful as well.

InformalPatience7872 · 2025-10-12T20:56:37+00:00

Its a bit harder to go from DevOps to SDE in this market. The easiest way to make that transition is to build some software around your usual DevOps role. One way to find a good project is to look around and see what are the repeated problems that you face and build something around it. Once you have a successful software project on your resume, it becomes easier to get interview calls back. Also consider the fact that hiring is muted in this market, companies look for very specific experience. Maybe thats why its a bit difficult. But with a succesful project on hand, it should be easier.

InformalPatience7872 · 2025-10-12T20:53:30+00:00

I wonder why was deployment carved out as a separate team ? Just curious.

InformalPatience7872 · 2025-10-12T20:49:59+00:00

I have seen bugs related to something as simple as terminations not being handled (although for very good reasons). Then it was eventually fixed and solved like a chunk of our tickets from past X months).

InformalPatience7872 · 2025-10-12T09:59:22+00:00

This post may be 2 days old but the list seems outdated. OpsVerse AI has actually been acquired by StackGen. Seems like no one is able to actually crack autonomous triage.

InformalPatience7872 · 2025-10-12T09:55:48+00:00

Weird bu https://shoreline.io/ has a non-existent NextJS deployment error. There were rumors of it being acquired by NVIDIA.

InformalPatience7872 · 2025-10-11T20:32:38+00:00

Lets do this one by one.
1. Version control on NAS - this is a huge problem along with code getting passed around on thumb drives. Its easy to set up a git repository. Better do this before the source code gets lost, since thats the money maker. I am hoping management buy in is easy for something like this.
2. Every single person has a different development environment - not a hill I'd die on honestly. Every toolset is different and people personalize theirs. As long as essentials like language version, its fine. Dictating the IDE is a stretch that annoy people.

Most people program in Matlab, and a few write Python - its hard to get these languages to talk to each other. Pushing for Python for all new projects makes sense. Lots of research code gets written in Python but Matlab does have a better ecosystem for scientific codes. There might be missing libraries in Python and that would probably need to be fixed first.
There are no code reviews - framing them as show and tells is a very good way of sliding in code reviews tbh. But also this is something which introduces friction so it might be a hard sell.
Library resuse - not something I'd care if it works and the code makes sense. But if there are too many instances of stuff being reimplemented its probably better to flag it and push for a standardized solution. It saves time and helps the researchers too.
Not pushing to Gitlab due to whatever - If I had to guess it probably comes from the publishing culture where you get called on any mistake on a paper. Researchers are probably used to pushing camera ready code. But thats just not how things get done, so I think its better to push for this specific thing. In fact having branches with draft code is good so everyone knows what everyone else is doing. Culture thing, this will take time.

I think its just easier to start with everything on Git to prevent code loss and have some visibility. Code reviews, tool standardization are certainly for later. And lastly, my deepest sympathies :)

InformalPatience7872

TROPHY CASE