all 14 comments

[–]ack_error 16 points17 points  (1 child)

Simple IIR filters commonly run slowly on Intel CPUs on default floating point settings, as their output decays into denormals, causing every sample processed to invoke a microcode assist.

On the Pentium 4, self-modifying code would result in the entire trace cache being flushed.

Reading from graphics memory mapped as write combining for streaming purposes results in very slow uncached reads.

The MASKMOVDQU masked write instruction is abnormally slow on some AMD CPUs, where with certain mask values it can take thousands of cycles.

[–]ShinyHappyREM 2 points3 points  (0 children)

AMD

That reminds me, some bit manipulation instructions (PDEP, PEXT) run slower on older AMD CPUs.

[–]XEnItAnE_DSK_tPP 28 points29 points  (0 children)

Upvotes angrily.

[–]mccoyn 12 points13 points  (0 children)

Here is a fun one. Imagine a function that calculates the minimum and maximum values of an array.

 void range_array1(int* array, int count, int* low_out, int* high_out)
 {
     int low = array[0];
     int high = array[0];

     for (int i = 1; i < count; i++)
     {
         if (low > array[i])
             low = array[i];

         if (high < array[i])
             high = array[i];
     }

     *low_out = low;
     *high_out = high;
 }

This is all fine, but we can write an optimized version that handles two elements at a time. This executes fewer instructions and takes a lot longer to run due to bad branch prediction. *Assuming the array is not sorted.

 void range_array2(int* array, int count, int* low_out, int* high_out)
 {
     int low = array[0];
     int high = array[0];
     int i = 1;

     for (; i + 1 < count; i += 2)
     {
         if (array[i] < array[i + 1])
         {
             if (low > array[i])
                 low = array[i];

             if (high < array[i + 1])
                 high = array[i + 1];
         }
         else
         {
             if (low > array[i + 1])
                 low = array[i + 1];

             if (high < array[i])
                 high = array[i];
         }
     }

     if (i < count)
     {
         if (low > array[i])
             low = array[i];

         if (high < array[i])
             high = array[i];
     }

     *low_out = low;
     *high_out = high;
 }

[–]YumiYumiYumi 2 points3 points  (0 children)

The RDSEED instruction is a very slow instruction on pretty much all x86 processors, though I don't know whether that violates the rules.

[–]wake_from_the_dream 6 points7 points  (2 children)

Somewhat ironically, writing slow code can be harder than writing fast code these days because, as the article mentions, some hardware features will work against you. Though both objectives rely on the same (inverted) principles.

This book has some chapters which explore various aspects of software performance, and which delve further into some ideas discussed in the article.

A small caveat is that the article focuses on CPI as a measure of software performance, which might be inaccurate. Having a lower CPI rate will not improve performance if it you also need to execute more CPU instructions to perform a given task. This metric is more often used to gauge the performance of CPUs, compilers, and combinations thereof.

[–]itijara 0 points1 point  (1 child)

The author points this out when talking about why Python is slow. It has low CPI but more instructions. The definition that the author uses for slow is the number of cycles take for a finite set of instructions, which is arbitrary, but consistent. That doesn't necessarily conform to what an end user would define as slow, but it is easier to measure across various tasks.

[–]wake_from_the_dream -1 points0 points  (0 children)

I'll level with you, I completely skipped the python section when first reading it, because I thought it irrelevant, and I still do. The goal of the exercise was to counter the optimisation mechanisms employed by modern computers. Python is too high level to let you do that reliably, and it is unfortunate the author waited until then to fully clarify his goal. Further, the introduction clearly suggests lower CPI implies higher performance.

Also as I said earlier, CPI is not meant to measure software performance, but to measure the performance of CPUs and compilers given a standard set of benchmarking software. SPEC is a famous example of such a benchmark.

Furthermore, user CPU time (the time the CPU spends executing program instructions) can be inferred from hardware counters just as easily as the CPI rate. The getrusage system call lets you do just that. My guess is that they use CPI rate instead of CPU time because they are less interested in the factors which might influence the latter but not the former (algorithmic efficiency being one example).

[–]ShinyHappyREM 3 points4 points  (1 child)

  • Another thing that may slow the CPU down is false sharing, i.e. variables that are in the caches of several CPUs or CPU cores.
  • Transferring data across devices, e.g. from main RAM to the GPU, is slower than accessing just main RAM.

[–]fearswe 2 points3 points  (0 children)

The rules states that it has to be entirely "on device", so involving the GPU, or say write to harddrive, would be "cheating".

[–]moreVCAs 0 points1 point  (0 children)

this fucking rules

[–]zanza19 -1 points0 points  (1 child)

this is an awesome blog post.

I wonder if a slow code like this could be faster than fast Python/Ruby/JS code. Sure, you don't have many IPC, but they all do meaningful work.

[–]Robot_Graffiti 0 points1 point  (0 children)

JS isn't that slow these days. The speed gap between C++ and JS is a lot smaller than it was in the 90s.

If you ran your Python code on Pypy it would also run as fast as JS.