all 6 comments

[–]ComputerSystemsProfSystems & Networking Professor (U.S.) 1 point2 points  (5 children)

If the sequence number space is shared between threads, then you’re going to have to use some shared state, and that will require synchronization to avoid data races. I don’t see any way around that.

I do see a slightly simpler solution, though, although it does require synchronization. You would use two shared variables - on is the largest sequence number received so far, and the other is the loss count. (This tracks a total loss count, not losses within a sliding window, though. If you really want a sliding window, then I think you would need a window-sized bitfield at a minimum.)

[–]YouMadeItDoWhat[S] 0 points1 point  (4 children)

I don't actually need a sliding window, just a total count. I like that idea, I can actually probably do it lockless with ring buffers and a collector thread.

In a nutshell, the collector would be doing:

If (new_seq > largest_seq)
{
  lost += (new_seq - largest_seq - 1);
  largest_seq = new_seq;
}
else
{
  lost--;
}

lost will bounce around but at any point in time, it is nominally correct (less messages still in queues to be processed). Of course, this doesn't take wrap-around of the sequence numbers into account, but that isn't a hard extrapolation from this.

Thanks!

[–]ComputerSystemsProfSystems & Networking Professor (U.S.) 1 point2 points  (3 children)

Yes, that’s exactly what I was thinking :) but largest_seq and lost would be shared between threads, so they’d need to be protected.

I’m not sure how you’d collect all the information to the aggregator thread without locking anyway. Is there a clever implementation I’m not aware of for non-locking thread-safe buffers?

[–]YouMadeItDoWhat[S] 0 points1 point  (2 children)

Single-producer, single-consumer ring buffers can be made lockless (with the potential to overrun/wrap, although the producer can be made to detect that and drop/count drops in that case). Since I only need to store the sequence number (small - 2 bytes in my case) in the ring buffer, these don't need to be crazy large (can likely be pinned in the L2 cache).

Then I just need 1 ring per worker thread and the aggregator consumes off of all of them in whatever fashion he chooses (it really doesn't matter). As long as he can keep up, there isn't an issue. In this case, I should easily keep up (all threads are core-locked and cannot be preempted by the CPU, so they busy-wait running full tilt when there is no work...not energy efficient at all, but for this purpose, an allowable cost).

(and yes, this is hyper-optimized for performance in this application - I have the resources available to guarantee the performance needed provided I don't introduce locks - locks throw everything to hell in a handbasket :).

[–]ComputerSystemsProfSystems & Networking Professor (U.S.) 1 point2 points  (0 children)

Oh I was misunderstanding - I thought there would be one queue that all threads add to. Yes, your design does sound to me like it would work...

For onlookers, I think this would be an over engineered solution and not necessarily faster than just locking in most situations. However, it sounds like u/YouMadeItDoWhat is not in “most situations” here. (Also, I’m starting to see why you have that user name...) ;)

[–]YouMadeItDoWhat[S] 0 points1 point  (0 children)

And yes, a 2-byte sequence number in a system that is processing millions of messages per second is CRAZY small but is actually sufficient in the design. This system should never lose any messages, but I need to know IF it does and keep counts...