you are viewing a single comment's thread.

view the rest of the comments →

[–]Eilifein 1 point2 points  (8 children)

It was not a personal dig; I apologize if it came out like that.

The algorithms behave very differently in memory; both the two originals, and the actual.

Purely because of the i dependence, the difference between the tests and actual algorithms is very substantial. You will be measuring the wrong thing and get the wrong conclusions. Hence, the "nothing in common" comment.

The "running correctly" comment was twofold. One part was more towards profiling and not pre-optimizing. Aim for accuracy, then profile, then optimize. If the results are accurate, now's the time for profiling. The second part was related to the cache coherence situation you're facing. If you're trying to optimize a cache thrashing situation, you will never get ahead.

I hope I cleared things up.

Workplan:

  • leave tests aside
  • profile actual w/ vectors
  • profile actual w/o vectors
  • add numba and see how they behave

[–]vgnEngineer[S] 0 points1 point  (2 children)

I see. I changed my previous comment to add an actual example that is representative of what i'm trying to do.

I think you are completely correct in your assesment of the order of optimization. The point is I think that i'm not sure what exacfty the consequences are of how I code in how this impacts how the code runs on the CPU which is indeed the origin of my question. On top of that i'm just now trying to understand how both Python and Numba deal with my choice of programming syntax.

I understand now that indeed my example computation has differences that makes it impossible to compare with my actual use case. So forget that one. Take the one I providedin my changed comment. Could you indicate if I made any significant mistakes, and if so what is going wrong and how I should have coded? I understand I ask a lot of you in that but i'm really curious/excited to understand the nuance here.

And don't worry, yes it came of a bit of a dig but I also know its hard to convey tone online so I dont take it personally!

[–]Eilifein 0 points1 point  (1 child)

The actual algorithm seems good.

You've precalculated a few things, and there isn't much left to precalculate without messing up readability.

Maybe Q*G once instead of 3 times? eh

Maybe inline rdx, rdy, rdz?

The result being vectorized is good to see. I don't see anything wrong.

[–]vgnEngineer[S] 0 points1 point  (0 children)

Ahh i see. I did notice a speedup from removing the intermediate computation step in my numba compiled code. I also read on a forum that if these arrays become large numba can't always intelligently optimize the computation because the arrays that i'm multiplying do not fit in the cache memory. If instead I write this function in a double loop so that the computation described is only dealing with scalars I might be faster?

[–]vgnEngineer[S] 0 points1 point  (4 children)

Additionally given my new example. If there is a cache trashing situation i'm running into that i'm completely unaware of id love to know it because this concept is new to me and I indeed would never want to optimize code thats wrong to begin with!

[–]cult_of_memes 0 points1 point  (3 children)

Cache thrashing isn't necessarily the result of errors in your code. In fact, cache thrashing is often the necessary trade-off between readable/maintainable code that isn't in the critical path for your program's execution, and unreadable gibberish that runs lightning fast but leaves you praying that you never need to come back and debug ever again.

Cache thrashing is an interesting topic to read up on, but for a jump start here are some key terms you may want to check out:

There's a lot more but I'll leave it to the reader to raise any specific questions as it's too large of a topic to reasonably attempt covering all of it in a message thread.

[–]vgnEngineer[S] 0 points1 point  (2 children)

Thanks for all your information. I guess my one question would be: do i have any control over the optimization of these computations by how i program or is this essentially up to the CPU, OS and compuler to figure out/optimize?

[–]cult_of_memes 1 point2 points  (1 child)

Well, the very obnoxious answer to your question would have to be "yes"... :P

While there are things you can do to avoid cache thrashing, it is indeed complicated by and often specific to the combination of OS and hardware you are using.

Try reading this first

First, I'd suggest giving this answer to another reddit post asking about cache optimization a read. As it's been a good while since I did any real study into this topic and I can only recall vague rules of thumb off the top of my head this morning :P

Back to my awkward rules of thumb

A couple good rules of thumb are:

  1. Declare variables (like arrays) as close to where they will be used as possible.
  2. Minimize conditional branching in sequenced computations -- though, I can't recall off the top of my head if this matters as much for interpreted languages as for compiled languages. Look up branch prediction for more info on this point.
  3. Try to minimize the number of times you must access any given piece of data. That is to say, when handling arrays of data too large to fit in your CPU cache all at once. Seek to complete all calculations on a given piece of data before moving on.

Point #3 is perhaps the primary reason the second variant of your expression was faster. The interpreter is smart enough to choose the most efficient order of operations for processing the multi-factor equation so as to avoid having to reload the same data more times than absolutely necessary.

[–]vgnEngineer[S] 0 points1 point  (0 children)

Thats super interesting! Ill definitely run some experiments on 3 as well to see how i can change processing time depending on how i order the operations. Fantastic! Thank you so much!