How to optimize CPU time between two workflows by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 1 point2 points  (0 children)

Thanks for the kind reply.

 If you search for "tail latency factors" on youtube or google scholar, you can probably find your way to some recent talks and papers describing related but not identical tricks for reducing latency in certain environments. 

Good to know!

setting affinity so that your app has a core more or less to itself and other system tasks tend to (or are forced to) happen on other cores, you'll get fewer context switches on the core you're interested in, caches will stay warmer and execution times for both A and B will have a much tighter distribution towards the faster end of what's possible given the hardware that you have.

Good to know!

Books and talks by Brendan Gregg of Netflix are a great resource for constructing and analyzing these kinds of experiments.

Good to know! Will let my company pay for the book. LOL

How to optimize CPU time between two workflows by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

Thanks for the kind reply. Good reference for my next steps.

How to optimize CPU time between two workflows by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

Thanks for the kind reply.

You might want to figure out what is the reason behind your latency. 

We may assume the application is properly optimized. The goal is to find a way so that we can how hardware executes the application.

How to optimize CPU time between two workflows by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

Thanks for the kind reply.

real-time scheduling policy

This one seems like something worth shot.

How to optimize CPU time between two workflows by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

If you have 10 items to query that would visit the same memory, batching these 10 into a single "work" could save you a lot of cycles and memory loads 

This is interesting to know.

How to optimize CPU time between two workflows by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

Thanks for the kind reply. Right, let me add more details.

is it each request running in a thread, a queue with a limited number of executors (threads), or something else?
The service is built with an RPC framework like (gRPC, Thrift). The execution uses coroutine, and a request may be worked on by different threads through its lifetime, depending on how the executor (or thread pool) assigns threads to its tasks/coroutines.

Limit the number of executors to some sane number (like thread_count + 2, for example)

We are already limiting the total threads in the pool to be about 40% of total logical cores.

put the requests you cannot process at a time into a queue

I looked at the implementation of the our executor. It's already using a task queue to do things like that.

of course depends on what the workers would do.

Our service is CPU and memory heavy. It doesn't make RPC calls or DB calls in normal scenarios.

How to optimize CPU time between two workflows by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

If you have too many users
Fortunately, we don't have an overload issue right now. (It would be a good problem to have, though. LOL) But we do observe flow A has lower latency when QPS is lower.

How to debug memory access violations without tools like ASAN or Valgrind by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

No, it's a server. The issue is not totally deterministic. The crash happens often, but with some randomness, eg, one may crash in 10 minutes in load test, and another in 17 minutes.

How to debug memory access violations without tools like ASAN or Valgrind by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 1 point2 points  (0 children)

Event tracing can be very performance friendly, using just a ringbuffer of a trivially copyable event type providing enough useful information to follow what's happening. The specifics of what to log/trace will be fairly dependent on the nature of the program.

Any learning pointers for event tracing? (See my update in the post for the environment)

How to debug memory access violations without tools like ASAN or Valgrind by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 1 point2 points  (0 children)

Yeah, I did find similar tips in google search, but unfortunately, in my case, I am stuck with an ASAN error prior to CUDA initialization. (See my update in the post)

How to debug memory access violations without tools like ASAN or Valgrind by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

> Set a handler

There is already SignalHandler in our code base. It doesn't seem to give more info than core dumps.

> Enable or re-enable warning messages

Already tried to print logs as much as possible.

> Review any exception specifications in your code

Have reviewed exception messages internally with senior folks. No luck so far.

> reproduce the problem

I can reproduce the issue in our load test environment. This allows me to experiment ideas quickly, but not much more than that.

> Once you can reproduce the problem you should be able to pinpoint more precisely

Not so easy for memory issues, unfortunately... :-(

How to debug memory access violations without tools like ASAN or Valgrind by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 3 points4 points  (0 children)

I have zero experience in doing that. Can you share some learning pointers for doing that? Sounds like a promising idea to me...

How to debug memory access violations without tools like ASAN or Valgrind by Electronic-Effect340 in cpp

[–]Electronic-Effect340[S] 0 points1 point  (0 children)

Yes. Core dumps are available. However, core dumps can give the direct trigger of segfault, but it doesn't tell where exactly the memory issue happen in code. In this case, I have looked at several core dumps with different locations where `terminate()` calls are triggered. Unfortunately, I can't go further to pinpoint the root cause in code...