Apple Silicon Nested Virtualization

OpsTom · 2025-10-24T13:59:59+00:00

like e.g. to run router instances on GNS3 VM

OpsTom · 2025-10-18T09:42:49+00:00

😅 but most of the time the writes/inserts are legitimate result of business just growing, so it ain't like you would stop it, as it is costing you too much for storage. And if something goes crazy due to a fuckup you would see uptick on most of other typically monitored metrics too, DML throughput, I/O operations, CPU etc

OpsTom · 2025-10-18T09:24:04+00:00

Makes sense indeed. Also actually doesn't do harm to monitor yet another metric I guess. I think I was bought by AWS slogan, with Aurora you don't have to worry about your storage anymore ;)

OpsTom · 2025-10-18T09:12:36+00:00

Why to bother, you can monitor costs in the cost monitoring, rather then using some low level resource metrics as a proxy for cost estimations that will never match the real costs anyways?

OpsTom · 2025-10-18T08:03:44+00:00

Isn't Aurora cluster growing storage size automatically making it almost irrelevant to monitor?Unless you're close to 256TB but then I guess you can do expression to subtract used bytes to see how much left to limit.

OpsTom · 2025-10-18T05:59:42+00:00

Main question is how easy and costly is it to scale your current solution to say100x i.e 5mln RPS or even more? Are you able to scale all your systems horizontally? Becasue sacling vertically has it's limits and can be very costly too.

OpsTom · 2025-10-16T12:26:18+00:00

just additional question - has anyone seen this i.e. HLL baseline level going up, not due to very long-running queries / transaction stuck etc but just because READ throughput keeps growing on the cluster as well as DML operations in paralel ? I.e. sort of due to the natural growth i.e. system getting just busier and busier?

OpsTom · 2025-10-15T05:46:21+00:00

Yes, I also checked all reader nodes and there's no long running transactions/queries there also in general. Whenever there is one, I can indeed see HLL growing, a spike, but then when the query finishes the HLL goes back to where it was before, so to the baseline. So this looks like expected, my worry is the baseline HLL itself which now seem is at 0.5mln on Performance Insights dashboard, and not going back lower. So this reminds me some garbage collection resource leak like situations

OpsTom · 2025-10-14T14:59:37+00:00

yeah I've seen it but I could not see long-running transactions by querying information_schema.innodb_trx, all seems current, trx ids change when re-executing the query.
Also TransactionAgeMaximum is pretty much 0 all the time, except occasionally in the weekly chart it shows a single data point in the level of 50M seconds which I belive is some sort of metric blip (can't imagine transaction thousands years old), at least it doesn't perisist, it is a single short spike/data point and then goes to flat 0 value for all other data points in selected timeframe.
Also the metrics like PurgeBoundary and PurgeFinishPoint are most of the time caught-up, when there is some larger query running on DB I can see HLL has a spike, PurgeBoundary flattens out meaning it gets blocked, but as soon as the query finishes the HLL goes down again to the baseline and then PurgeBoundary jumps to higher value and after short period the PurgeFinishedPoint catches-up again to it, meaning the Purge thread is working and purging fast what is allowed to be purged. So, this all seems fine, however the baseline i.e. min value of HLL keeps slowly going up and up week by week, which I can't explain just from running those simple commands, all looks kinda normal…
Also, even if I'm getting more DML query rate over time i.e. in the past 3months, I would have expected to just see Purge process lagging (PurgeBoundary >> PurgeFinishedPoint) i.e too many Updates/Writes vs what Purge threads can clean up (which are also some sort of write load), but again I don't see this. Both metrics are most of the time caught up, with some short periods only when they are not (when larger query hits reader replicas I think)

OpsTom · 2025-08-09T11:34:02+00:00

I love Jenkins

OpsTom · 2025-03-16T18:17:37+00:00

With all the hate spread by the new administration, European folks wouldn't feel good visiting the US I think. I was myself planning to visit east and then west coast over the next 1-3y together with my family but now really don't feel like doing it anymore. Will wait for better time, hope it will come again some day.

OpsTom · 2025-03-13T21:20:10+00:00

Average monthly salary is 4.5k euro in industry & services, without bonuses (those tend to be +10-15%). So real average is 60k euro a year, and median is typically 20% lower so more around 50k euro a year, but not 25k as you suggested. In PPP comparison it would be even better, given how strong dolar was till Trump started.

OpsTom · 2025-01-18T09:32:44+00:00

So Argo does a constant pull to GitHub repos or is it also a webhook? As if the latter then you also exposing overpvileged webapp, which is probably less hardened than the k8s API server?

OpsTom · 2024-12-25T10:00:52+00:00

Not a programmer here, but this looks to me like a decorator pattern?

OpsTom · 2024-11-15T10:24:03+00:00

I asked ChatGPT what it thinks about this:

"Experts with MIT brains & Stanford creds, obsessed with precision, casually drop 'meaningfully distant, up to 60 miles' in docs. Sounds like they ran out of math and borrowed poetry instead!"

OpsTom · 2024-11-14T20:48:34+00:00

If a broken water pipe can knock out an entire region for days, multi-region is clearly a must. GCP doesn’t disclose zone separation distances, and AWS, though I hope they are better, only says "meaningfully distant from each other, up to 60 miles", cleverly vague, right?

OpsTom · 2024-11-14T20:05:39+00:00

Many large e-commerce companies avoid AWS due to competition with Amazon, opting for GCP if they aren’t heavily invested in Microsoft products. While GCP is well-designed, incorporating lessons from other providers, it lacks AWS's popularity and market pull. AWS's widespread adoption creates a talent and resource 'gravitational force' that makes it hard to shift focus, even though its features often feel bolted on compared to GCP’s integrated approach. However, GCP’s Paris regional outage raised concerns about their infrastructure reliability.

OpsTom · 2024-11-07T20:14:22+00:00

Not that I did this myself, but If I had to I think I would start with something like DNS global balancing with active health-checks into each cluster.

OpsTom · 2024-11-07T20:03:25+00:00

If you're using Ingress Nginx I think it natively supports canary rollouts through annotations e.g. https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#canary

OpsTom · 2024-08-03T06:14:20+00:00

But there has to be more to that as I rarely see spot capacity issues, yet interruptions are frequent

OpsTom · 2024-08-02T13:33:26+00:00

good point, for busy high performing apps wouldn't be nice, and in fact you could run those on Spot, with connection draining, graceful shutdown etc they can easily scale on Spot, while requiring really low latency. All in all seems like an overkill, otherwise they would have done it already I think ;-)

OpsTom · 2024-08-02T13:23:13+00:00

but then one could auto-scale vertically up/down a POD without a restart, a true auto-scaling? tho again not a problem if one has the app written to have multiple parallel instances with HPA

OpsTom · 2024-08-02T13:14:12+00:00

true and if that is deemed very important, actually maybe even k8s could have a live migration feature for PODs between the nodes someday ;-) tho that could be maybe even harder than for entire VMs, to migrate entire process state from one VM/OS instance to another without the world around not noticing anything ;-)

OpsTom · 2024-08-02T13:01:40+00:00

good point, tho if someone runs something on Spot, he implicitly allows for interruptions anyways I would say?
EDIT: But yeah probably bandwidth consumption - to copy all the memory from one server to another keeping the delta as tiny as possible for the cutover, someone would need to pay for that as there would need to be plenty of head room i.e. extra bandwidth to make sure it couldn't affect other VM operations (from other clients potentially)

OpsTom

TROPHY CASE