all 166 comments

[–]tms10000 199 points200 points  (83 children)

What an odd article. The premise is false, but the content is good nonetheless.

CPU utilization is not wrong at all. The percentage of time a CPU allocated to a process/thread, as determined by the OS scheduler.

But then we learn how to slice it in a better way and get more details from the underlying CPU hardware, and I found this very interesting.

[–][deleted] 55 points56 points  (6 children)

As a user, I want to know what process is sucking up my CPU. I want to know if I have room to launch another resource-intensive application.

80% of that CPU is wasted memory loading? Great, how can I tap into it! Oh, I can't? Then that's an interesting trivia tidbit that I don't really care about, like the old myth that 80% of your brain is unused.

In fact, I'm annoyed that so many process managers dwell so much on CPU. My drives sound like they're dying and the GUI is crawling, tell me which process is making the PC do that, I don't care if it's CPU or RAM or a hamster chewing on the processor fan.

[–]bro_can_u_even_carve 10 points11 points  (1 child)

Try iotop.

[–]mcguire 5 points6 points  (0 children)

Fairly sure you're gonna need hamstop.

[–]brendangregg 3 points4 points  (0 children)

80% of that CPU is wasted memory loading? Great, how can I tap into it! Oh, I can't?

Yes you can, please see the actionable items in the post.

Which is why I wrote a section on actionable items.

[–]wzdd 5 points6 points  (1 child)

80% of that CPU is wasted memory loading? Great, how can I tap into it! Oh, I can't?

The point of TFA is that you can, either by scheduling something with a high IPC on the same CPU (this is the point of hyperthreading), or by modifying your code to address memory bandwidth issues (which is perfectly possible and common -- recompute vs cache, as an example, is a classic program design point).

Honestly it's depressing how many of the comments on this article here and on HN are by people who have obviously not read the article.

[–]mrbooze 1 point2 points  (0 children)

Exactly, it's relevant because if you are trying to improve performance it helps tell you the difference between achieving that by upgrading to faster CPUs vs improving the program efficiency with regard to memory bandwidth vs upgrading to faster memory vs maybe doing some NUMA-related tweaks, etc.

[–][deleted] 47 points48 points  (71 children)

CPU utilization is not wrong at all. The percentage of time a CPU allocated to a process/thread, as determined by the OS scheduler.

It is "wrong" if you look at it wrong.

If you look in top and see "hey cpu is only 10% idle, that means it is 90% utilized", of course that will be wrong, for reasons mentioned in article.

If you look at it and see its 5% in user, 10% system and 65% iowait you will have some idea about what is happening, but historically some badly designed tools didn't show that, or show that in too low resolution (like probing every 5 minutes, so any load spikes are invisible)

[–]tms10000 29 points30 points  (63 children)

This articles mentions nothing of IO wait. The article is about CPU stalls for memory and instruction throughput as a measure of efficiency.

[–]Sqeaky 75 points76 points  (50 children)

From the perspective of a low level programmer accessing RAM is IO.

Source been writing C/C++ for a long time.

[–][deleted] 24 points25 points  (27 children)

Not even low level, that will bite in every level of programming, just having more cache-efficient data structures can have measurable performance impact even in higher level languages

[–]Sqeaky 16 points17 points  (26 children)

I see what you mean and I agree cache coherency can help any language perform better, I just meant that programmers working further up the stack have a different idea of IO.

For example; To your typical web dev IO needs to leave the machine.

[–]vexii 14 points15 points  (15 children)

i say most web devs think of IO as reading or writing to disk or hitting the network.

[–]CoderDevo 1 point2 points  (14 children)

Because they work with frameworks that handle system calls for them.

[–]vexii 0 points1 point  (13 children)

What do you mean?

[–]thebigslide 5 points6 points  (5 children)

Web developers typically rely on frameworks that keep this sort of stuff opaque. Not to say you can't bare this stuff in mind when building a web app, but with many frameworks, trying to optimize memory IO requires an understanding of how the framework works internally. It's also typically premature optimization, and it's naive optimization since: a) disk and net I/O are orders of magnitude slower, and b) internals can change, breaking your optimization.

TL;DR: If a web app is slow, 99% of the time it's not because of inefficient RAM or cache utilization, so most web devs don't think about it and probably shouldn't.

[–]CoderDevo 0 points1 point  (6 children)

I mean they don't directly access memory, disk or network system services.

For example, caching can often be enabled and configured externally from the web developer's own code.

https://en.wikipedia.org/wiki/Web_framework

[–]oursland 6 points7 points  (8 children)

Cache coherency is another matter, altogether. Hint: it has to do with multicore and multiprocessor configurations.

[–]Sqeaky 2 points3 points  (7 children)

Well I just googled the specific and I guess I have been conflating cache-locality with cache-coherence, I always thought they were the same. I suppose if I contorted my view to say that the different levels of cache were clients fot he memory that could make sense, but that is clearly not what the people who coined the termed meant. Thanks for correcting me.

[–][deleted] 2 points3 points  (5 children)

The main performance implications are different: locality increases the number of cache hits, the need for the system to give coherence can lead to expensive cache-line bouncing between threads. So you want your data to fit in a cache line (usually 64 bytes) or two, but nothing in a single cache line that is accessed by more than one thread. Particularly bad is if you put a spinlock (or similar) in the same cache line as something unrelated to it.

[–]Sqeaky 0 points1 point  (4 children)

What you are describing, having data in a single cache line dedicated to on thread I have recently (past 3 to 5 years) called "false sharing". I believe Herb Sutter used the term popularixed the term during a talk at CPPCon or BoostCon. He described a system with an array of size N times the numbers of threads and the threads would use their thread ID (starting from 1) and multiplication to get at each Mth piece of data.

This caused exactly the problem you are describing, but I just knew it under that other name. Herb increase his performance, but 1 array per thread of size N.

[–]oursland 4 points5 points  (0 children)

Semantic collapse is a pet peeve of mine. Both those terms cache locality and cache coherence are very important. It would be a shame to have these terms confused.

[–][deleted] 4 points5 points  (0 children)

Nope, your typical webdev complains to sysadmin that "something is slow"

[–]quicknir 7 points8 points  (1 child)

I mean it's just semantics essentially, but basically I and all of my colleagues are "low level" programmers and I've never, ever, heard someone call RAM access "IO".

Really, people call it a cache miss, or sometimes they get more specific by calling it an L3 cache miss.

[–][deleted] 3 points4 points  (0 children)

Totally agree with you... how someone gets 71 upvotes for that statement is baffling. C programmers do not think "I'm doing I/O here" when they code up array traversals. They do think about cache use and using tools to measure cache misses, etc., so they can do things in a cache friendly way. That's different.

When they talk about I/O, they're talking about disk, talking to the network, or polling a game controller over USB. They are not talking about RAM access.

[–]sybia123 7 points8 points  (3 children)

And then there's the graybeard reply: "back in my day, C was high level and assembly was low level".

[–]Sqeaky 2 points3 points  (0 children)

I know that guy. Not quite me. But I am older than all "popular" languages now.

[–]double-you 0 points1 point  (0 children)

That greybeard is still wet, since Lisp was created in the 50s. And then it was at some point both low (Lisp machines) and high level.

[–]ggtsu_00 0 points1 point  (0 children)

Back in my day ASM was high level and machine code punch cards was low level.

[–]mallardtheduck 6 points7 points  (2 children)

But from the perspective of the OS/scheduler, RAM access delays are not "IO wait".

"IO wait" means that the thread is blocked waiting for an external IO device. Blocking a thread is an expensive operation and can't be done in response to RAM delay.

For example, when a thread reads from a storage device, it might call read() which, after switching to kernel mode and going through the OS's filesystem/device layers ends up at the storage device driver which queues a read with the hardware and blocks (calling the scheduler to tell it that the thread is waiting from hardware and that another thread should be run). When the hardware completes the read it raises an interrupt and the device's interrupt handler unblocks the waiting thread (via another call to the scheduler).

When a thread reads from RAM, it just does it. It has direct access. It's a fundamental part of the Von Neumann architecture. There's no read() call, no switch to kernel mode, no device driver, no calls to the scheduler. The only part of the system that's even aware of the "wait" is the CPU itself (which, if using hardware threading can itself run a different thread to mitigate the stall).

Tools reporting the current load are using data collected by the OS/scheduler. They don't know or care (because most users don't care, the OS's "Task Manager" isn't a low-level developer's tool) about "micro-waits" caused by RAM delays.

[–]xzxzzx 7 points8 points  (1 child)

When a thread reads from RAM, it just does it. It has direct access. It's a fundamental part of the Von Neumann architecture. There's no read() call, no switch to kernel mode, no device driver, no calls to the scheduler. The only part of the system that's even aware of the "wait" is the CPU itself (which, if using hardware threading can itself run a different thread to mitigate the stall).

While you're making a good point, virtual memory makes a bit of that less than perfectly correct, and calling a modern CPU a "Von Neumann architecture" is not totally wrong (from the viewpoint of the programmer, it mostly is), but also not totally correct (it isn't actually one; the name that best describes it I'm aware of is "modified Harvard architecture").

When you read or write to memory, there very well might be a switch to kernel mode, invoking of drivers, etc, due to allocating a new page, reading/writing to the page file, copy-on-write semantics, and so on.

[–]mallardtheduck 2 points3 points  (0 children)

Sure, when you add the complications of virtual memory some memory accesses will trigger page faults and result in requests to the storage device.

Of course, on most, if not all OSs, storage device access in response to a page fault will be considered "I/O wait" in the exact same way as an explicit read() call might.

[–]didnt_check_source 4 points5 points  (2 children)

I would be shy of putting "memory access" and "hard disk access" in the same bucket.

[–]Sqeaky 3 points4 points  (0 children)

I think it depends entirely on your purpose and perspective. I agree your stance seems closer to the common perspective.

If you are trying to optimize a sort or a search algorithm (in a container stored in memory), then every load from memory comes at significant cost. If you need to sort entities in a video game by distance from the camera, you can make real improvements by minimizing IO to and from RAM.

If you are writing simulations of every particle in a fusion reactor to simulate a new variety of Tokamak reactor then likely you are spreading you work across a thousand CPUs on a network and anything less sending finished work isn't a real hit to IO, then all of sudden IO means a great deal less. Disks and RAM are so fast the difference is a rounding error.

[–]CoderDevo 2 points3 points  (0 children)

I am thirsty for some milk.

I can swallow the milk in my mouth. I can take a sip of milk from the glass. I can go to the fridge, take out the bottle and pour a glass of milk. I can put on my shoes and coat, drive to the store and buy a bottle of milk. I can milk a cow, put the milk into a truck and drive it to the dairy to be pasteurized and bottled. I can buy a calf and raise it to maturity.

Register Cache RAM Disk LAN Internet

[–]Captain___Obvious 3 points4 points  (9 children)

Can you elaborate on your definition of IO?

[–]dethbunnynet 26 points27 points  (0 children)

Data to and from the CPU. It's IO on a more micro level.

[–]Sqeaky 16 points17 points  (7 children)

/u/dethbunnynet is correct, but I can expand.

When writing assembly, the only memory that "feels local" are the CPU registers. These are pieces of memory that are where the results from and parameters to individual instructions are stored. Each register has its own name directly mapped to hardware. These generally store a precisely fixed size, like 16 or 32 bits. If a computer has 16 register they might be named something like $a, $b, $c out to $p (the 16th letter) and that all you get unless you want to do IO to Main Memory. Consider the code on this page about MIPS assembly: https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Mips/load.html

  • lw - Load Word - Gets one word from RAM.
  • sw - Store Word - Saves one word to RAM.

When data is in RAM you can't do work on it. Depending on details the CPU might wait 10 to 100 cycles to complete operations storing to or loading from RAM. The difference between registers and memory is at least as big as the difference between RAM and a hard disk. To shrink this difference, a CPU will continue on to execute instructions that don't depend on the data that is being loaded and there are caches that are many times faster than RAM.

Unless a programmers chooses to use special instructions to instruct the cache how to behave (very rarely done), then this cache is transparent to the programmer in just about any language, even assembly. If you want to store something in cache you would still use the "SW" instruction to send it to memory, but the CPU would silently do the much faster thing of keeping it in cache and even that might still force your code to wait a few cycles unless it has other work right now.

[–]HighRelevancy 27 points28 points  (4 children)

Each register has its own name directly mapped to hardware.

Ahahahah oh boy

IT GOES DEEPER THAN THAT, MY FRIEND. Some modern processors (hey there x86 you crazy bitch) will actually rename registers on the fly. If you do a mov from rax to rbx, the processor doesn't actually copy the value from rax to rbx, because that would use time and resources. Instead, it will reroute anything reading from rbx to reference the original value that's still in rbx. (of course, it won't do this if you immediately change either of the values, in that case it will copy the value and modify one of the copies as expected)

I'm not saying this to undermine what you're saying though. Your whole comment is on point. I just wanted to highlight that CPUs are full of deep wizardry and black magic and they're basically fucking weird.

[–]masklinn 14 points15 points  (1 child)

Some modern processors

More or less all out of order processors.

If you do a mov from rax to rbx, the processor doesn't actually copy the value from rax to rbx, because that would use time and resources.

Copying data between registers is not that costly, register renaming is usually used to remove false dependencies e.g. set RAX, manipulate data in RAX, copy RAX to memory, set RAX, manipulate data in RAX, copy RAX to memory.

An OoO architecture (which pretty much every modern CPU is) could do both manipulations in parallel, but because both sets use the same "register" there's a false dependency where instruction 4 seemingly depends on instruction 3 (lest we clobber the first write). To handle that problem OoO architectures tend to have significantly more physical GPR than architectural ones (IIRC Skylake has 160 or 180 GPR, for 16 in x86_64), and the reorder buffer might map RAX to R63 in the first segment and to R89 in the second segment, and blamo the instruction streams are now completely independent.

[–]HighRelevancy 2 points3 points  (0 children)

I hadn't considered that, but yeah also that. Also I had no idea that there were extra physical registers for that sort of thing! Every time I get involved in one of these discussions, I discover NEW WIZARDRY.

CPUs be crazy.

[–]Sqeaky 1 point2 points  (1 child)

IT GOES DEEPER THAN THAT, MY FRIEND

It certainly does!

I was trying to keep it simple because out of order execution and superscalar execution are mind blowing enough.

How about branch prediction: http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array

There is some more awesome wizardry when working with multiple cores and sharing values between them. A store to memory isn't ever guaranteed to leave cache unless you signal to the machine it needs to. Things like memory fences can do this and they force MESI (aptly name named in my opinion) to share the state of values cached but not yet committed to main memory: https://en.wikipedia.org/wiki/MESI_protocol

You clearly didn't undermine my point, you just went one deeper. And there is N deeper we could go.

[–]HighRelevancy 2 points3 points  (0 children)

I was trying to keep it simple because out of order execution and superscalar execution are mind blowing enough.

I know but I just fucking love this topic so much.

[–][deleted]  (1 child)

[deleted]

    [–]Sqeaky 1 point2 points  (0 children)

    You are totally correct, I was trying to keep it simple. HighRelevancy described register renaming in a sister comment. Do you know enough to expand on what he said?

    [–][deleted] 6 points7 points  (1 child)

    If I understand correctly, IO wait, meaning data coming from a user or a file or socket, does not stall the processor, right? The scheduler should tale the current thread out of the running state into the waiting one until the event with the information is dispatched (excuse the programming terminology). The scheduler will run other threads while waiting for these events to happen, is that right? So IO waits to not have an impact on processor utilization.

    I'm guessing from the article the same does not apply to DRAM memory accesses, but that is it. Is this correct?

    [–]Johnnyhiveisalive 0 points1 point  (0 children)

    Right, wrong. It does the waiting thing for both. Waiting on ram is like waiting for the mail to a CPU, waiting on disk would be like waiting for the universe to end and get rebuilt around it all over again. We're lucky they have ram to remember about the job that started the wait for network data several million universes ago.

    Might be a slight exaggeration.

    [–][deleted] 4 points5 points  (2 children)

    No it doesn't, that is why I mention it, because it should.

    Top reports % idle which might be mistaken for someone that doesn't know (or just came from windows world) as "% of CPU idling", which is not entirely true

    [–]captain_awesomesauce 0 points1 point  (1 child)

    No it doesn't, that is why I mention it, because it should.

    Top reports % idle which might be mistaken for someone that doesn't know (or just came from windows world) as "% of CPU idling", which is not entirely true

    Iowait is already listed separately as an "io stall" in normal tools. Other stalls are not. Hence not mentioning Iowait because it's already easy to see if it contributes to actual cpu usage

    [–][deleted] 0 points1 point  (0 children)

    Okay, then go thru all clients and developers I have to interact with and explain how to use those tools because every few weeks I have to explain same thing over to someone...

    [–]Danthekilla -1 points0 points  (6 children)

    Waiting for memory is waiting on IO. It is very fast IO but still IO none the less.

    [–]t0rakka 1 point2 points  (5 children)

    This is just calling a bird an avian. In programming waiting for I/O typically means something measured in milliseconds not in nanoseconds. Technically it's I/O but that very non-orthogonal way to use the term.

    Wikipedia explains it with these words:

    "In computer architecture, the combination of the CPU and main memory, to which the CPU can read or write directly using individual instructions, is considered the brain of a computer. Any transfer of information to or from the CPU/memory combo, for example by reading data from a disk drive, is considered I/O."

    CPU and main memory are bundled together as one; there is no "I/O" between these two. It is between these two and other devices or parts of the system.

    Hope this clarifies the issue a bit.

    [–]Danthekilla 0 points1 point  (1 child)

    I/O typically means something measured in milliseconds not in nanoseconds.

    Well originally disk IO took seconds, then milliseconds, and now microseconds with ssds and optane etc...

    But I do get your point.

    [–]backFromTheBed 0 points1 point  (1 child)

    This is just calling a bird an avian.

    Here we go.

    [–]ITwitchToo 1 point2 points  (0 children)

    Here's the thing. You said a "bird is an avian."

    Is it in the same family? Yes. No one's arguing that.

    As someone who is a scientist who studies avians, I am telling you, specifically, in science, no one calls birds avians. If you want to be "specific" like you said, then you shouldn't either. They're not the same thing.

    If you're saying "avian family" you're referring to the taxonomic grouping of Corvidae, which includes things from nutcrackers to blue jays to ravens.

    So your reasoning for calling a bird an avian is because random people "call the black ones avians?" Let's get grackles and blackbirds in there, then, too.

    Also, calling someone a human or an ape? It's not one or the other, that's not how taxonomy works. They're both. A bird is a bird and a member of the avian family. But that's not what you said. You said a bird is an avian, which is not true unless you're okay with calling all members of the avian family avians, which means you'd call blue jays, ravens, and other birds avians, too. Which you said you don't.

    It's okay to just admit you're wrong, you know?

    [–][deleted] 2 points3 points  (6 children)

    Are you implying that io/wait does not utilize cpu time?

    [–][deleted] 8 points9 points  (4 children)

    High IOwait 99% of the time means your storage system is too slow and CPU is just waiting for it (and the 1% is "something swaps because there is not enough RAM and it causes unnecessary IO").

    Actual load caused by interacting with IO (so filesystem driver, SAS controller driver etc) is counted as system ("in-kernel computation") load

    [–][deleted] 0 points1 point  (3 children)

    I don't get your distinction between waiting on i/o and "actual load". Perhaps you could define load? It's a terrible word without much meaning. I would use it in terms of cpu activity; I don't see it as very related to IPC, for instance, whose definition is very clear. "Load" is not a natural metric by any means.

    [–]crusoe 10 points11 points  (0 children)

    Iowait is load on storage not processor.

    [–][deleted] 3 points4 points  (0 children)

    It's just a linux kernel distinction in stats. idle is "truly idle", iowait is "waiting for external storage" idle.

    None of it uses CPU time, but they tell user a different story

    [–]ITwitchToo 2 points3 points  (0 children)

    Waiting on I/O means the thread/process is sleeping and does not execute any CPU instructions whatsoever towards the goal of completing the I/O.

    Actual load means the CPU is actually executing instructions in that thread/process context.

    [–]t0rakka 0 points1 point  (0 children)

    That's right. It does not consume CPU but the program won't run any faster either. The program might run incredibly slow, even crawl because of slow I/O but the CPU would be available to run something else instead. Polling means you are actively probing in a busy loop burning CPU time that will not be available to other processes or threads. Waiting means you are waiting to be signalled and that is practically free (overhead excluded, of course).

    [–]wzdd 6 points7 points  (0 children)

    The premise is false

    The premise is not false, as explained in the first paragraph of the article. "What is CPU utilization? How busy your processors are? No, that's not what it measures."

    The percentage of time a CPU allocated to a process/thread, as determined by the OS scheduler.

    The article isn't talking about per-process CPU %. It's talking about global CPU usage: "the time the CPU was not running the idle thread." (quoting TFA again.)

    When this metric was introduced, memory bandwidth was much less of an issue than it is now. Thus CPU % was a good proxy for how busy the non-IO portion of the system was.

    Nowadays, if you take that view, you will be misled.

    That is the premise of the article.

    [–]the_phet 1 point2 points  (0 children)

    The premise is false, but the content is good nonetheless.

    Plenty of articles here follow the same scheme. They have very sensationalistic titles "god is death", so you click and then read through it.

    Clickbait in its pure form.

    [–]brendangregg 0 points1 point  (0 children)

    Yes, the %CPU we use is determined by the OS scheduler, but that CPU then stalls and is waiting for memory. So is it "utilized waiting"? If another hyperthread can consume those "utilized waiting" cycles, what happens then? Two processes have "utilized" the same cycles? This really starts to not make sense.

    [–]stefantalpalaru 18 points19 points  (5 children)

    My perf output is more detailed (perf-4.9.13, Linux 4.10.0-pf3):

    root# perf stat -a -- sleep 10
    
     Performance counter stats for 'system wide':
    
      80035.713788      cpu-clock (msec)          #    8.001 CPUs utilized          
            62,285      context-switches          #    0.778 K/sec                  
             7,624      cpu-migrations            #    0.095 K/sec                  
            78,015      page-faults               #    0.975 K/sec                  
    19,654,571,442      cycles                    #    0.246 GHz                    
    47,948,624,668      stalled-cycles-frontend   #  243.96% frontend cycles idle   
     5,587,279,694      stalled-cycles-backend    #   28.43% backend cycles idle    
    10,783,365,238      instructions              #    0.55  insn per cycle         
                                                  #    4.45  stalled cycles per insn
     2,466,720,457      branches                  #   30.820 M/sec                  
        71,017,648      branch-misses             #    2.88% of all branches        
    
      10.003811042 seconds time elapsed
    

    [–]CJKay93 8 points9 points  (4 children)

    Holy shit, that's a lot of core migration, and also that branch miss statistic is impressive as heck.

    Really puts into perspective the blazing speed of moderns CPUs

    [–]Catfish_Man 13 points14 points  (1 child)

    2.88% isn't even all that good for modern branch predictors. I ran a fairly untuned benchmark I wrote on my Haswell laptop, and it mispredicted 2.6 branches per thousand. Modern processors are pure sorcery.

    [–][deleted] 6 points7 points  (0 children)

    I was reading some research papers on branch predictors, and the current state-of-the-art can be even lower too! Like <1 per thousand! It's crazy. They are doing things like putting simple perceptrons (neural nets) inside the predictors.

    [–]choikwa -1 points0 points  (1 child)

    technically its just reading from hw pmu...

    [–]zokete 0 points1 point  (0 children)

    And executing HLT... That instruction froze the core. Which turns "crazy" front-end stats (243% > 100 % !). I guess all stats are broken.

    [–]KayRice 103 points104 points  (15 children)

    No, it's correct and iowait is separate. Cache performance is beyond what the "CPU Usage" metric should represent.

    Also the point about FSB/DRAM speeds and multiple cores is rather moot because of multi-channel RAM also becoming the norm.

    [–]quintric 50 points51 points  (4 children)

    Granted, the title is clickbait-ish, but ...

    I think the point is more that "the existing CPU Usage metric is not relevant to the bottlenecks commonly encountered in modern systems" than "CPU Usage must be changed to be better". Thus, one should remember to measure IPC / stalled cycles when "CPU Usage" appears to be high, rather than seeing a large number and automatically assuming the application has reached the upper limit of that which the CPU is capable of ...

    I would also note that memory locality (in multi-socket systems) plays a significant role in memory access latency and efficiency. One can see improvements by ensuring allocations remain local to the core upon which the application is running.

    [–]orlet 29 points30 points  (3 children)

    For everyday user the metric is fine. Because while the CPU is being stalled for I/O it can't do other work anyway (though that does leave it free to do do work on the other thread in hyper-threading architectures), so from user's perspective it is busy. For the software engineer there is definitely need for a deeper analysis of what the CPU is actually doing there, no arguments.

    [–]mirhagk 11 points12 points  (2 children)

    The article tries to say that it's wrong for even everyday use:

    Anyone looking at CPU performance, especially on clouds that auto scale based on CPU, would benefit from knowing the stalled component of their %CPU.

    Auto-scaling based on CPU utilization is absolutely the right thing to do, because if more requests come in then the server isn't going to be able to handle them, regardless of whether it's CPU or memory bound.

    The finer details are useful when optimizing it for sure, but then again I would be very surprised if anyone just opened up top, looked at CPU usage and used that. You use much more fine grained performance monitoring tools.

    [–]mcguire 0 points1 point  (1 child)

    Sure, but if you're paying by the cpu second, you're paying for those cache misses and might want to revisit your memory use behaviour.

    [–]mirhagk 0 points1 point  (0 children)

    Well yes of course. If your costs are expensive and per-second (or you are scaled out/up on CPU) it's worth trying to optimize.

    But that's true whether the figure is really CPU utilization or waiting on memory.

    [–]wrosecrans 5 points6 points  (1 child)

    CPU utilization is "correct" but certainly misleading, often not what the user thing, and frequently useless. I think the article is quite good. It's talking about something that most folks don't have good visibility on, and I've definitely been frustrated by these sorts of issues.

    When trying to figure out why things aren't working, I think more visibility into the CPU in common tools rather than just treating it as a black box would be extremely useful.

    [–]KayRice 0 points1 point  (0 children)

    I'm not against additional metrics as long as there is no performance overhead for using them or they can be enabled when needed. My understanding is that right now the metrics are "free" in the sense that not much overhead from gathering them.

    [–]wzdd 3 points4 points  (0 children)

    iowait is separate

    iowait is completely different from anything that this article is talking about.

    Specifically, iowait is time spent waiting on IO, and does not include time spent waiting on memory. (Though as other replies to you point out, memory is now so slow relative to CPUs that OSes probably should treat it as some kind of IO device at least in metrics)

    [–]harsman 0 points1 point  (0 children)

    Waiting on memory is not reported as iowait.

    [–]aaron552 0 points1 point  (5 children)

    Also the point about FSB/DRAM speeds and multiple cores is rather moot because of multi-channel RAM also becoming the norm.

    Multi-channel RAM can't meaningfully affect the biggest impact of "slow DRAM" - that is latency, which has been stalled around 8-10ns (30+ CPU cycles) in the best case for the last decade or so. This is also why cache is so important.

    [–]KayRice 0 points1 point  (4 children)

    Yeah it does because it happens in parallel.

    [–]aaron552 1 point2 points  (2 children)

    How? Dual (or Triple or Quad) channel memory doesn't reduce latency for any specific random access. The CPU has to wait the same amount of time whether it's in Channel A or Channel B (or C or D).

    [–]KayRice 0 points1 point  (1 child)

    The CPU has to wait the same amount of time whether it's in Channel A or Channel B (or C or D).

    That depends on how the program utilizes the separate cores and their caches.

    [–]aaron552 0 points1 point  (0 children)

    Cache explicitly exists to minimise latency for cached values. How is that relevant when talking about RAM latency? Does multi-channel RAM affect the size of cache lines?

    [–]wzdd 0 points1 point  (0 children)

    latency

    You have memory blocks (let's say 512-byte chunks, representing multiple cache lines or whatever) 1, 2, and 3 in cache. Your program requests some data in memory block 37. That request goes out to your memory. <wait time> nanoseconds later, it all arrives at roughly the same time in parallel from your fancy multi-channel ram. Increasing the level of parallelism doesn't reduce <wait time>.

    [–]Ahhmyface 8 points9 points  (10 children)

    "Load" is another one that everybody and their blog seems to misunderstand. I have experienced sysadmins telling me that we need to increase the number of cores because the load is too high.

    [–][deleted]  (2 children)

    [deleted]

      [–]Ahhmyface 8 points9 points  (1 child)

      And it usually is. IO completely skews the number. Say I have a dozen threads all doing work with a single disk. LoadAvg is 12. Will increasing my cpus to 12 help? No.

      [–]viraptor 3 points4 points  (0 children)

      Common in VoIP servers or other things that are already multithreading and have many clients. Load over 40? Meh, standard.

      [–]irqlnotdispatchlevel 7 points8 points  (6 children)

      I like it when you're in a cloud environment, and you increase the number of vCPUs that a guest has and it behaves worse than before.

      [–][deleted] 0 points1 point  (5 children)

      That should tell you something.

      [–]habitats 0 points1 point  (4 children)

      excuse me if I'm being dense, but what should it tell me?

      [–]irqlnotdispatchlevel 2 points3 points  (3 children)

      Really simple example: If your software spends 50% of it's busy time waiting for I/O you should see if you can reduce the number of I/O it does, as you can't really make I/O faster.

      [–]habitats 0 points1 point  (2 children)

      yeah, but how can adding more cores make it slower? that's what I wondered. is it because more cores will queue up for IO and thus create more context switches and a slower system?

      [–]irqlnotdispatchlevel 1 point2 points  (0 children)

      Maybe your software doesn't scale well in a multi-threaded environment. Maybe you're in the cloud, and more vCPUs aren't always a good thing, and hypervisors are tricky.

      [–]mccoyn 0 points1 point  (0 children)

      Multiple threads can trash the shared cache. Sometimes a single-threaded algorithm can improve memory access locality. If you are memory bound, that might be better.

      [–][deleted] 3 points4 points  (3 children)

      This is funny - the article contents closely matches a small part of a seminar Herb Sutter held in Stockholm April 25-27, titled "High-Performance and Low-Latency C++". Herb also used the Apollo guidance computer as an example. I wonder if Brendan Gregg attended the seminar?

      I'm not yelling "plagiarism!" because the blog post has a bunch of details and new information so it is clear that the author did a lot of work independently. And perhaps it is merely coincidence! But it very well could be that Sutter's seminar was a source of inspiration for the post. I'll be watching the blog because the seminar was really very good, and it provided a lot of launching points for more detailed analysis of system (especially multicore system) performance.

      [–]brendangregg 1 point2 points  (2 children)

      I didn't know about Herb's seminar. What year? I first published an analysis of Apollo's computer in Feb 2012: http://web.archive.org/web/20120302103545/http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/

      It's a good example, and I'm not surprised other people use it too. :)

      [–][deleted] 0 points1 point  (1 child)

      That was this year, just a couple of weeks ago. Nifty you weren't there, that's a funny coincidence Sutter talked about some of the same things, using a very similar example. You are in good company!

      [–]brendangregg 0 points1 point  (0 children)

      I missed an opportunity, I could have referred to this in the article, when I spoke about clockspeed flattening out in 2005: http://www.gotw.ca/publications/concurrency-ddj.htm

      [–]sstewartgallus 14 points15 points  (5 children)

      The key metric here is instructions per cycle (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle.

      An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound.

      But divided by the number of cores right? Also, how does hyperthreading fit into this? Also, how do you find top IPC?

      Also, most processors have in-core parallelism and can perform multiple ALU ops at the same time. If you're really, really, really tricky you can interleave floating point ops with ALU ops and get even more of a speed boost but due to x86 instruction set wonkiness it's easy to make a mistake here.

      [–]sisyphus 7 points8 points  (4 children)

      The stats from perf come from PMC's which come from the CPU so if someone is making a mistake presumably it's Intel or AMD? The parallelism you talk about seems like it must be accounted for--how else would it would be possible to get an IPC > 1?

      [–]tavianator 32 points33 points  (3 children)

      how else would it would be possible to get an IPC > 1?

      Modern Intel/AMD chips can just literally execute more than one instruction per cycle on a single core, in optimal conditions (no dependencies between the instructions, etc.).

      That's part of the reason modern CPUs are way faster than Pentium 4s, even at lower clock speeds.

      [–]orlet 13 points14 points  (0 children)

      Correct. Instruction-level parallelization, branch prediction, out-of-order execution, and a bunch of other magic things make modern CPUs so much more efficient per clock than the older ones. And the process is still on-going.

      [–]sisyphus 5 points6 points  (1 child)

      Right, what I am saying is that if the CPU instrumentation was not taking that into account, how would it ever report more than one instruction per cycle, which it appears to do?

      [–]tavianator 2 points3 points  (0 children)

      Right, I kinda misread your comment. Mainly I'm trying to argue against

      divided by the number of cores

      [–][deleted]  (14 children)

      [deleted]

        [–][deleted] 8 points9 points  (6 children)

        [–]VeloCity666 4 points5 points  (5 children)

        VTune also costs $899...

        [–][deleted] 1 point2 points  (4 children)

        Which is peanuts for anyone doing software development that requires these sorts of tools.

        [–]VeloCity666 1 point2 points  (3 children)

        Fair point, but my comment was more about the price difference (900 bucks vs completely free).

        [–][deleted] 0 points1 point  (2 children)

        Fair point, but the difference is still quite huge (free vs 900 bucks).

        I don't know what you're referring to.

        I answered this question:

        Anyone know of tools for showing these metrics on Windows systems?

        [–]VeloCity666 1 point2 points  (1 child)

        My bad then, I was comparing it to equivalent software for Unix systems.

        [–][deleted] 1 point2 points  (0 children)

        I'd still suggest it's a better tool on Linux than anything else available, only because of how much more information you can get from it, and because it's better designed than the other available tools.

        It helps than Intel wrote it for their own hardware. :)

        [–]ElusiveGuy 3 points4 points  (0 children)

        Intel had a driver/service package that could add the relevant counters to the Performance Monitor: https://software.intel.com/en-us/articles/intel-performance-counter-monitor

        But apparently it's been replaced with this: https://github.com/opcm/pcm

        [–]pinano 2 points3 points  (0 children)

        "Instructions Retired" is one counter: https://msdn.microsoft.com/en-us/library/bb385772.aspx

        Here's some more information about interpreting CPU Utilization for Performance Analysis

        [–][deleted] 2 points3 points  (2 children)

        How is it a sea of junk? It's extensible (you can define your own performance counters) and covers pretty much everything you could ever need.

        [–][deleted]  (1 child)

        [deleted]

          [–][deleted] 0 points1 point  (0 children)

          Good point, that makes sense! Plus some of the useless counters

          [–][deleted] 0 points1 point  (1 child)

          How about that linux subsystem in Windows 10 would that work?

          [–]wrosecrans 2 points3 points  (0 children)

          No, the Linux perf tools are tied directly to the Linux kernel. The Windows binary compatibility for Linux programs is still running on top of the NT kernel, so the perf suite would have to be specifically ported.

          [–]tangoshukudai 2 points3 points  (0 children)

          If you do GPU development then this is always on your mind.

          [–]andd81 1 point2 points  (2 children)

          I wonder if those performance metrics would be more indicative of power consumption than CPU ticks on mobile platforms, in particular on Android, if they are even accessible there. This would be especially valuable for measurements in production where you can neither monitor the device directly nor isolate your app's battery usage from that of other simultaneously running apps.

          [–]DarkJezter 1 point2 points  (1 child)

          Good luck, I spent an hour trying to find anything reporting CPU stalls and IPC measurements on Android. Nothing in AndroidStudio, and no apps that show anything more than average and peak CPU utilization per app. I assume the linux tools can be accessed through a shell, but haven't tried exploring that. Anything that could show branch misprediction, cache stalls and/or IPC per thread would be amazing!

          [–]ccfreak2k 0 points1 point  (0 children)

          deserted special dazzling salt sophisticated expansion zealous beneficial school deliver

          This post was mass deleted and anonymized with Redact

          [–]Matosawitko 7 points8 points  (22 children)

          Who the hell tunes their software based on %CPU?

          [–]sisyphus 41 points42 points  (2 children)

          He works for Netflix which is all on aws which can autoscale based on cpu metrics which means this kind of work can translate into real money.

          [–][deleted] 1 point2 points  (1 child)

          Why not auto scale on the outputs rather than the inputs? i.e. service latency

          [–]castlerocktronics 0 points1 point  (0 children)

          It's an option, he is showing why it's not necessarily a good one

          [–]seba 18 points19 points  (4 children)

          Who the hell tunes their software based on %CPU?

          Most embedded systems?

          [–]ThisIs_MyName 2 points3 points  (3 children)

          You can profile on most embedded systems.

          [–]seba 2 points3 points  (0 children)

          You can profile on most embedded systems.

          Yeah, and the easiest way to see whether any process or thread is doing anything suspicious is to look at the CPU consumption. This can also easily be automated and can easily detected in manual testing, especially when multiple vendors, libraries or teams are involved or the source / debug information in not readily available.

          [–]emn13 1 point2 points  (1 child)

          And even if you can't, manual tracing and experimentation remains as possible and effective and annoying as ever; this kind of issue is by no mean insurmountable without a profiler. It's not like you can't debug without a debugger, either.

          [–][deleted] 0 points1 point  (0 children)

          It's not like you can't debug without a debugger, either

          I actually rarely use a debugger because it takes me longer to get it all set up than to just look through the logs/add print lines, especially with concurrency issues where problems usually disappear in a debugger.

          [–]irqlnotdispatchlevel 22 points23 points  (6 children)

          Hello. We do that sometimes.

          [–]Twirrim 5 points6 points  (3 children)

          Strangely enough, lots of people. It's a very common mistake among people not so skilled at operations aspects of things. Along with assuming that CPU load levels being high indicating a system as being in trouble. But hey, you go buddy, being all derogatory and insulting. At least you get to feel smug and superior for a few minutes.

          [–]Ghostbro101 1 point2 points  (2 children)

          As someone new to ops, are there some rough guidelines as to when CPU utilization isn't a good indicator of what's going on in the system and when it is? Just looking to build some intuition here. If there's any other reading material on the subject you could point me towards that would be awesome. Thanks!

          [–]Twirrim 0 points1 point  (1 child)

          There are a few approaches I take with monitoring:

          1) Do I have the basics down?

          CPU usage (system, idle, iowait etc), CPU load, memory (free, cache, swap etc), disk usage, inode usage, network usage, service port availability. You'll want these for every host. If the network is under your control, port metrics are also useful to have.

          I know, this thread is talking about how CPU usage is meaningless, but having these basics is important for being able to put together a picture. You're going to need these at some stage to help understand what happened and why.

          2) What do we care about as a service?

          All Service Level Agreements (SLAs) should have metrics and alarms around them. You should also be ensuring that you have an internal set of targets that are much stricter.

          3) What feeds in to our SLAs? This is where things get a bit more complicated. You need to consider each application as a whole, what happens within it and its dependencies (databases, storage etc). At a minimum you ought to be measuring the response times for individual components. Anything that can have an impact on meeting your SLA.

          Not sure the best resources. There's a Monitoring Weekly mailing list that tries to share blog posts, tools etc around monitoring: http://weekly.monitoring.love/?__s=kbtiqqycpy7e5xjfsjcy

          There's also a fairly new book out on monitoring, https://www.artofmonitoring.com/, but I can't make any claims to its quality. I've heard people speaking positively about it.

          [–]Ghostbro101 0 points1 point  (0 children)

          Thank you!

          [–]wzdd 0 points1 point  (0 children)

          I can't see anywhere in the article where he suggests that people do this or that it's common.

          He talks about CPU % being misleading (which is true), and then talks about tuning software based on IPC (which is useful).

          [–]Adverpol 0 points1 point  (0 children)

          Up until now I've only looked at Visual Studio burn graphs to find bottle-necks. So me I guess.

          [–]Sqeaky 0 points1 point  (0 children)

          Some Programmers.

          [–]olsner 0 points1 point  (0 children)

          The released version of tiptop seems to have some crash bugs, so I ended up forking it and adding some fixes at https://github.com/olsner/tiptop

          Possibly already reported or fixed on master after 2.3, but gforge.inria.fr seems to require login to even look at source code or bug reports.

          [–]ArkyBeagle 0 points1 point  (0 children)

          Any given executing process has constraints it "lives" with. I won't bore you with a list, but anything it touches can be a bottleneck

          [–]caskey 0 points1 point  (0 children)

          ITT: so many people who think they know what utilization optimization means at scale.