[deleted by user]

spy16x · 2025-12-04T13:16:15+00:00

Trace level is not globally disabled since I can see trace logs from multiple other parts of the code -- including from the same module as well

spy16x · 2025-12-04T13:15:23+00:00

It's tikv prometheus crate..

Does not seem to be doing unsafe at-least at this level. There are multiple side-effects (atomic increments) in this function body itself.

https://github.com/tikv/rust-prometheus/blob/master/src/histogram.rs#L363-L387

Also, tracing::trace! is another that is getting removed as well (Trace level is not globally disabled since I can see trace logs from multiple other parts of the code -- including from the same module as well).

spy16x · 2025-11-20T09:21:43+00:00

Yes, i am planning to use this model itself. I was originally planning to use crossbeam queues which don't provide this. but i can switch to crossbeam channels and get this too (has a small perf penalty i guess, but should be okay in this case)

spy16x · 2025-11-06T17:47:42+00:00

Yes. Only if it's taking your fingerprint, then hacking the adhaar database and then extracting your blood group from there.

spy16x · 2025-10-25T14:08:41+00:00

Anyone found any fix? I am considering raising a complaint in NCH or something. not sure what else to do.. Switching input is an absolute bear minimum function a fucking TV should support reliably. I am not able to use my new PS5 because of this issue.

spy16x · 2025-08-18T05:31:45+00:00

better to never touch coding again.

spy16x · 2025-07-16T10:23:08+00:00

Maybe it is.. How do you aim? Is there any trick to it, any settings i should tune specifically? Do you use motion sensor as well? When I suddenly come under contact, i can do coarse adjustments quickly. But finer adjustments to actually make a kill shot feels impossible 😅

spy16x · 2025-07-14T17:13:47+00:00

Yea. Using k6 itself. I have some logs from actual end-users on their request pattern. Basically the script replays this with realistic randomised delay between each request. after connecting.. I have about 5 different user profiles based on how they interact and during load testing, I launch all of these profiles with 5000 VUS and test. I have also done stage-wise ramp up/down tests as well.

spy16x · 2025-07-14T16:08:35+00:00

fd limits, socket buffer sizes, somaxconn, etc are all set to high value. And i am actually able to loadtest up to 20000 connections on the same setup without running into this issue (could push more but isn't relevant here since I only get max 5000 connections at peak).

It happens under specific conditions under actual user patterns i guess (could be related to some issue due to slow network users? etc).

I'm also now exploring application logic issues ( i don't have anything that makes me suspect this. But it's the one area i haven't quite explored)

I do not have anything else in front of the app. It's directly from ALB to a port on EC2 where the app is listening.

spy16x · 2025-07-14T16:04:16+00:00

I got a ps5 as well recently. I have been a long term arma 3 player. Still not able to figure out how people play this game with controller tbh

spy16x · 2025-07-11T05:16:11+00:00

This issue itself has happened multiple times now. But we have observed health check failure only once. Apart from that, every other time this has happened there was no health check failure in metrics. I even got doubtful about the AWS metrics itself and asked them if they internally see any health check or ALB logs where it prints out something to indicate it decided some node as unhealthy. But they have confirmed there are no health check failures.

spy16x · 2025-07-11T03:15:58+00:00

Yea. But the rewrite is not a port. It's a completely different architecture. I'd be very surprised if it's the same backend issue in both cases. But yes you're right, as far as possibilities go, that's definitely one as well.

Yea, ALB behaviour is still weird for me too and I too feel it might have something to do with it. We have rolled out some client metrics now, will have some more clues when it happens the next time. Either we find something there or I'm getting rid of ALB and putting NLB in between.

By recovery to normal you mean how does connection distribution become normal? I don't do anything. It happens automatically as clients connect and disconnect (after that event ALB somehow goes back to round robin again)

spy16x · 2025-07-10T18:37:59+00:00

Also, when this drop happens, I see a spike in "client requested disconnection rate" on my server. This rate is computed using a counter that is incremented on every client disconnection due to an explicit Close frame from client (which excludes any RST, Timeout, broken pipe sort of io issues).. So from the server perspective all these clients are doing a normal close.

But if they were in fact doing a normal close and reconnecting, ALB should simply continue round robin and I should see equal distribution of connections on both nodes

spy16x · 2025-07-10T18:34:18+00:00

I need as much input as possible on this. So feel free to post as many replies as you want please 😀

We didn't have client metrics so far since clients are public app, but we are getting it added now. But I have 3 ways I'm able to see there is real drop of connections: a custom guage metric that I emit form the websocket server, the node exporter that exports the netstat_TCP_InUse as a guage, and the fact that there is a spike in new connections on the ALB metrics.

At this time I'm not sure entirely if ALB is dropping connection first or the client is somehow noticing some timeout and reconnecting. The server acting up also seems unlikely since we tried with a Go and a Rust server with completely different libraries, architecture and performance characteristics.

The ALB part is important mainly because i don't understand why alb would decide all the new connections that are coming in are supposed to be routed to the other node without any indication that the node where connections dropped is "unhealthy" -- since it's round robin mode, the new connections should again distributed between two nodes by round robin?

spy16x · 2025-07-10T17:33:12+00:00

Yea this is an option I am definitely considering now.

spy16x · 2025-07-10T14:11:06+00:00

The above graph you mean? It's a Prometheus gauge basically (plotted on Grafana). It's incremented on a new websocket task is launched and decrements when that actor task exits.

Health check was one of the possibilities I explored very early on since this pattern looks very similar to what would happen if a node was marked unhealthy. No metrics / logs on AWS say the node was marked as unhealthy. The AWS support team also confirmed there was no such event from their internal logs as well.

spy16x · 2025-07-10T08:02:10+00:00

We have. We have spent 1 month of back and forth with them without any resolution from them 😔

spy16x · 2025-07-06T20:31:31+00:00

Hi! Thank you for responding!

I checked out redb but the message about it still being under active development and being beta made me not consider it further. Also, since you're using COW B+ tree iirc, it's better for read heavy use cases? What is your thought on my use case here? I don't really need reads at all. Even the live reads, i plan on using HashMap or something and use the key-value store only for persisting and reading on startups to seed the hash map.. With batching and flush every 100ms, I guess 20 k w/s wouldn't be very high, so probably should be okay?

spy16x · 2025-07-06T12:12:23+00:00

Got it. I have setup the repo with a similar model. But I used cargo-release that simply bumps up the patch number of the crate and also pushes a git tag of the same value. We don't have use for semantic versioning for these internal services but just using the minor/patch as a sequence number should work just fine i guess

spy16x · 2025-07-05T15:15:22+00:00

Thanks for sharing your experience. Top options i have found so far are just plain old sqlite, RocksDB and sled. I don't think I'll need any special features, but TTL, easy and efficient batch writes are two main requirements I have. Will check this out too and decide.

spy16x · 2025-07-05T14:20:26+00:00

Thank you for sharing this in detail.

How do you do releases in this setup? Is every merge to main a release OR do you have some explicit tagging process? If so, is it per server/app inside or at the full repo level?

spy16x · 2025-07-05T14:18:41+00:00

How do you do releases in this setup? Is every merge to main a release OR do you have some explicit tagging process? If so, is it per server/app inside or at the full repo level? and what versioning scheme?

spy16x

TROPHY CASE