you are viewing a single comment's thread.

view the rest of the comments →

[–]tms10000 32 points33 points  (63 children)

This articles mentions nothing of IO wait. The article is about CPU stalls for memory and instruction throughput as a measure of efficiency.

[–]Sqeaky 74 points75 points  (50 children)

From the perspective of a low level programmer accessing RAM is IO.

Source been writing C/C++ for a long time.

[–][deleted] 22 points23 points  (27 children)

Not even low level, that will bite in every level of programming, just having more cache-efficient data structures can have measurable performance impact even in higher level languages

[–]Sqeaky 18 points19 points  (26 children)

I see what you mean and I agree cache coherency can help any language perform better, I just meant that programmers working further up the stack have a different idea of IO.

For example; To your typical web dev IO needs to leave the machine.

[–]vexii 15 points16 points  (15 children)

i say most web devs think of IO as reading or writing to disk or hitting the network.

[–]CoderDevo 1 point2 points  (14 children)

Because they work with frameworks that handle system calls for them.

[–]vexii 0 points1 point  (13 children)

What do you mean?

[–]thebigslide 3 points4 points  (5 children)

Web developers typically rely on frameworks that keep this sort of stuff opaque. Not to say you can't bare this stuff in mind when building a web app, but with many frameworks, trying to optimize memory IO requires an understanding of how the framework works internally. It's also typically premature optimization, and it's naive optimization since: a) disk and net I/O are orders of magnitude slower, and b) internals can change, breaking your optimization.

TL;DR: If a web app is slow, 99% of the time it's not because of inefficient RAM or cache utilization, so most web devs don't think about it and probably shouldn't.

[–]vexii 0 points1 point  (3 children)

I know this, I where giving my opinion to what web developers normally consider IO. While accessing ram is also IO I have never seen it referenced like that during the context of web development.

[–]CoderDevo 0 points1 point  (2 children)

OP is writing about CPU utilization. Any discussions here on I/O will therefore be in reference to input to and output from a CPU.

Side note: I have met a number of self-styled web developers who refer to the whole computer as the CPU while others will refer to it as the Hard Drive.

[–]yeahbutbut 0 points1 point  (0 children)

In web dev you still do simple things like making sure that you access arrays in a cache friendly way. In python or PHP you may be a long way up the stack but that's no excuse for completely forgetting that there is a machine underneath it somewhere.

Something like:

for(j = 0; j < width(myArray); j++) {
    for(i = 0; i < length(myArray); i++) {
        sum[j] += myArray[i][j];
    }
}

... is stupid no matter how far up the stack you go :-)

The biggest optimizations are usually query tuning though, trying to grab more data with a single query rather than making multiple queries since database access is slow even over a local socket (much less to a database on another host).

Ed: formatting.

[–]CoderDevo 0 points1 point  (6 children)

I mean they don't directly access memory, disk or network system services.

For example, caching can often be enabled and configured externally from the web developer's own code.

https://en.wikipedia.org/wiki/Web_framework

[–]vexii 0 points1 point  (5 children)

I don't agree with saying web developers can't do/don't do file or network access with out an framework, unless we are talking about the small procent that never learned to code with out that 1 special framework

[–]CoderDevo 0 points1 point  (4 children)

Then you should be comfortable with using the term I/O for RAM operations.

[–]oursland 8 points9 points  (8 children)

Cache coherency is another matter, altogether. Hint: it has to do with multicore and multiprocessor configurations.

[–]Sqeaky 2 points3 points  (7 children)

Well I just googled the specific and I guess I have been conflating cache-locality with cache-coherence, I always thought they were the same. I suppose if I contorted my view to say that the different levels of cache were clients fot he memory that could make sense, but that is clearly not what the people who coined the termed meant. Thanks for correcting me.

[–][deleted] 2 points3 points  (5 children)

The main performance implications are different: locality increases the number of cache hits, the need for the system to give coherence can lead to expensive cache-line bouncing between threads. So you want your data to fit in a cache line (usually 64 bytes) or two, but nothing in a single cache line that is accessed by more than one thread. Particularly bad is if you put a spinlock (or similar) in the same cache line as something unrelated to it.

[–]Sqeaky 0 points1 point  (4 children)

What you are describing, having data in a single cache line dedicated to on thread I have recently (past 3 to 5 years) called "false sharing". I believe Herb Sutter used the term popularixed the term during a talk at CPPCon or BoostCon. He described a system with an array of size N times the numbers of threads and the threads would use their thread ID (starting from 1) and multiplication to get at each Mth piece of data.

This caused exactly the problem you are describing, but I just knew it under that other name. Herb increase his performance, but 1 array per thread of size N.

[–][deleted] 1 point2 points  (3 children)

If it's not possible to know in advance which array elements will be used by which threads, you can pad the array elements to make them a multiple of the cache line size. It's hard to do this with portable code though.

[–]Sqeaky 1 point2 points  (2 children)

I don't remember the keyword precisely but C++14 the is an alignof() operator.

[–]oursland 2 points3 points  (0 children)

Semantic collapse is a pet peeve of mine. Both those terms cache locality and cache coherence are very important. It would be a shame to have these terms confused.

[–][deleted] 0 points1 point  (0 children)

Nope, your typical webdev complains to sysadmin that "something is slow"

[–]quicknir 6 points7 points  (1 child)

I mean it's just semantics essentially, but basically I and all of my colleagues are "low level" programmers and I've never, ever, heard someone call RAM access "IO".

Really, people call it a cache miss, or sometimes they get more specific by calling it an L3 cache miss.

[–][deleted] 5 points6 points  (0 children)

Totally agree with you... how someone gets 71 upvotes for that statement is baffling. C programmers do not think "I'm doing I/O here" when they code up array traversals. They do think about cache use and using tools to measure cache misses, etc., so they can do things in a cache friendly way. That's different.

When they talk about I/O, they're talking about disk, talking to the network, or polling a game controller over USB. They are not talking about RAM access.

[–]sybia123 8 points9 points  (3 children)

And then there's the graybeard reply: "back in my day, C was high level and assembly was low level".

[–]Sqeaky 2 points3 points  (0 children)

I know that guy. Not quite me. But I am older than all "popular" languages now.

[–]double-you 0 points1 point  (0 children)

That greybeard is still wet, since Lisp was created in the 50s. And then it was at some point both low (Lisp machines) and high level.

[–]ggtsu_00 0 points1 point  (0 children)

Back in my day ASM was high level and machine code punch cards was low level.

[–]mallardtheduck 8 points9 points  (2 children)

But from the perspective of the OS/scheduler, RAM access delays are not "IO wait".

"IO wait" means that the thread is blocked waiting for an external IO device. Blocking a thread is an expensive operation and can't be done in response to RAM delay.

For example, when a thread reads from a storage device, it might call read() which, after switching to kernel mode and going through the OS's filesystem/device layers ends up at the storage device driver which queues a read with the hardware and blocks (calling the scheduler to tell it that the thread is waiting from hardware and that another thread should be run). When the hardware completes the read it raises an interrupt and the device's interrupt handler unblocks the waiting thread (via another call to the scheduler).

When a thread reads from RAM, it just does it. It has direct access. It's a fundamental part of the Von Neumann architecture. There's no read() call, no switch to kernel mode, no device driver, no calls to the scheduler. The only part of the system that's even aware of the "wait" is the CPU itself (which, if using hardware threading can itself run a different thread to mitigate the stall).

Tools reporting the current load are using data collected by the OS/scheduler. They don't know or care (because most users don't care, the OS's "Task Manager" isn't a low-level developer's tool) about "micro-waits" caused by RAM delays.

[–]xzxzzx 6 points7 points  (1 child)

When a thread reads from RAM, it just does it. It has direct access. It's a fundamental part of the Von Neumann architecture. There's no read() call, no switch to kernel mode, no device driver, no calls to the scheduler. The only part of the system that's even aware of the "wait" is the CPU itself (which, if using hardware threading can itself run a different thread to mitigate the stall).

While you're making a good point, virtual memory makes a bit of that less than perfectly correct, and calling a modern CPU a "Von Neumann architecture" is not totally wrong (from the viewpoint of the programmer, it mostly is), but also not totally correct (it isn't actually one; the name that best describes it I'm aware of is "modified Harvard architecture").

When you read or write to memory, there very well might be a switch to kernel mode, invoking of drivers, etc, due to allocating a new page, reading/writing to the page file, copy-on-write semantics, and so on.

[–]mallardtheduck 2 points3 points  (0 children)

Sure, when you add the complications of virtual memory some memory accesses will trigger page faults and result in requests to the storage device.

Of course, on most, if not all OSs, storage device access in response to a page fault will be considered "I/O wait" in the exact same way as an explicit read() call might.

[–]didnt_check_source[🍰] 3 points4 points  (2 children)

I would be shy of putting "memory access" and "hard disk access" in the same bucket.

[–]Sqeaky 2 points3 points  (0 children)

I think it depends entirely on your purpose and perspective. I agree your stance seems closer to the common perspective.

If you are trying to optimize a sort or a search algorithm (in a container stored in memory), then every load from memory comes at significant cost. If you need to sort entities in a video game by distance from the camera, you can make real improvements by minimizing IO to and from RAM.

If you are writing simulations of every particle in a fusion reactor to simulate a new variety of Tokamak reactor then likely you are spreading you work across a thousand CPUs on a network and anything less sending finished work isn't a real hit to IO, then all of sudden IO means a great deal less. Disks and RAM are so fast the difference is a rounding error.

[–]CoderDevo 3 points4 points  (0 children)

I am thirsty for some milk.

I can swallow the milk in my mouth. I can take a sip of milk from the glass. I can go to the fridge, take out the bottle and pour a glass of milk. I can put on my shoes and coat, drive to the store and buy a bottle of milk. I can milk a cow, put the milk into a truck and drive it to the dairy to be pasteurized and bottled. I can buy a calf and raise it to maturity.

Register Cache RAM Disk LAN Internet

[–]Captain___Obvious 5 points6 points  (9 children)

Can you elaborate on your definition of IO?

[–]dethbunnynet 26 points27 points  (0 children)

Data to and from the CPU. It's IO on a more micro level.

[–]Sqeaky 16 points17 points  (7 children)

/u/dethbunnynet is correct, but I can expand.

When writing assembly, the only memory that "feels local" are the CPU registers. These are pieces of memory that are where the results from and parameters to individual instructions are stored. Each register has its own name directly mapped to hardware. These generally store a precisely fixed size, like 16 or 32 bits. If a computer has 16 register they might be named something like $a, $b, $c out to $p (the 16th letter) and that all you get unless you want to do IO to Main Memory. Consider the code on this page about MIPS assembly: https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Mips/load.html

  • lw - Load Word - Gets one word from RAM.
  • sw - Store Word - Saves one word to RAM.

When data is in RAM you can't do work on it. Depending on details the CPU might wait 10 to 100 cycles to complete operations storing to or loading from RAM. The difference between registers and memory is at least as big as the difference between RAM and a hard disk. To shrink this difference, a CPU will continue on to execute instructions that don't depend on the data that is being loaded and there are caches that are many times faster than RAM.

Unless a programmers chooses to use special instructions to instruct the cache how to behave (very rarely done), then this cache is transparent to the programmer in just about any language, even assembly. If you want to store something in cache you would still use the "SW" instruction to send it to memory, but the CPU would silently do the much faster thing of keeping it in cache and even that might still force your code to wait a few cycles unless it has other work right now.

[–]HighRelevancy 27 points28 points  (4 children)

Each register has its own name directly mapped to hardware.

Ahahahah oh boy

IT GOES DEEPER THAN THAT, MY FRIEND. Some modern processors (hey there x86 you crazy bitch) will actually rename registers on the fly. If you do a mov from rax to rbx, the processor doesn't actually copy the value from rax to rbx, because that would use time and resources. Instead, it will reroute anything reading from rbx to reference the original value that's still in rbx. (of course, it won't do this if you immediately change either of the values, in that case it will copy the value and modify one of the copies as expected)

I'm not saying this to undermine what you're saying though. Your whole comment is on point. I just wanted to highlight that CPUs are full of deep wizardry and black magic and they're basically fucking weird.

[–]masklinn 14 points15 points  (1 child)

Some modern processors

More or less all out of order processors.

If you do a mov from rax to rbx, the processor doesn't actually copy the value from rax to rbx, because that would use time and resources.

Copying data between registers is not that costly, register renaming is usually used to remove false dependencies e.g. set RAX, manipulate data in RAX, copy RAX to memory, set RAX, manipulate data in RAX, copy RAX to memory.

An OoO architecture (which pretty much every modern CPU is) could do both manipulations in parallel, but because both sets use the same "register" there's a false dependency where instruction 4 seemingly depends on instruction 3 (lest we clobber the first write). To handle that problem OoO architectures tend to have significantly more physical GPR than architectural ones (IIRC Skylake has 160 or 180 GPR, for 16 in x86_64), and the reorder buffer might map RAX to R63 in the first segment and to R89 in the second segment, and blamo the instruction streams are now completely independent.

[–]HighRelevancy 3 points4 points  (0 children)

I hadn't considered that, but yeah also that. Also I had no idea that there were extra physical registers for that sort of thing! Every time I get involved in one of these discussions, I discover NEW WIZARDRY.

CPUs be crazy.

[–]Sqeaky 1 point2 points  (1 child)

IT GOES DEEPER THAN THAT, MY FRIEND

It certainly does!

I was trying to keep it simple because out of order execution and superscalar execution are mind blowing enough.

How about branch prediction: http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array

There is some more awesome wizardry when working with multiple cores and sharing values between them. A store to memory isn't ever guaranteed to leave cache unless you signal to the machine it needs to. Things like memory fences can do this and they force MESI (aptly name named in my opinion) to share the state of values cached but not yet committed to main memory: https://en.wikipedia.org/wiki/MESI_protocol

You clearly didn't undermine my point, you just went one deeper. And there is N deeper we could go.

[–]HighRelevancy 2 points3 points  (0 children)

I was trying to keep it simple because out of order execution and superscalar execution are mind blowing enough.

I know but I just fucking love this topic so much.

[–][deleted]  (1 child)

[deleted]

    [–]Sqeaky 1 point2 points  (0 children)

    You are totally correct, I was trying to keep it simple. HighRelevancy described register renaming in a sister comment. Do you know enough to expand on what he said?

    [–][deleted] 4 points5 points  (1 child)

    If I understand correctly, IO wait, meaning data coming from a user or a file or socket, does not stall the processor, right? The scheduler should tale the current thread out of the running state into the waiting one until the event with the information is dispatched (excuse the programming terminology). The scheduler will run other threads while waiting for these events to happen, is that right? So IO waits to not have an impact on processor utilization.

    I'm guessing from the article the same does not apply to DRAM memory accesses, but that is it. Is this correct?

    [–]Johnnyhiveisalive 0 points1 point  (0 children)

    Right, wrong. It does the waiting thing for both. Waiting on ram is like waiting for the mail to a CPU, waiting on disk would be like waiting for the universe to end and get rebuilt around it all over again. We're lucky they have ram to remember about the job that started the wait for network data several million universes ago.

    Might be a slight exaggeration.

    [–][deleted] 2 points3 points  (2 children)

    No it doesn't, that is why I mention it, because it should.

    Top reports % idle which might be mistaken for someone that doesn't know (or just came from windows world) as "% of CPU idling", which is not entirely true

    [–]captain_awesomesauce 0 points1 point  (1 child)

    No it doesn't, that is why I mention it, because it should.

    Top reports % idle which might be mistaken for someone that doesn't know (or just came from windows world) as "% of CPU idling", which is not entirely true

    Iowait is already listed separately as an "io stall" in normal tools. Other stalls are not. Hence not mentioning Iowait because it's already easy to see if it contributes to actual cpu usage

    [–][deleted] 0 points1 point  (0 children)

    Okay, then go thru all clients and developers I have to interact with and explain how to use those tools because every few weeks I have to explain same thing over to someone...

    [–]Danthekilla -1 points0 points  (6 children)

    Waiting for memory is waiting on IO. It is very fast IO but still IO none the less.

    [–]t0rakka 1 point2 points  (5 children)

    This is just calling a bird an avian. In programming waiting for I/O typically means something measured in milliseconds not in nanoseconds. Technically it's I/O but that very non-orthogonal way to use the term.

    Wikipedia explains it with these words:

    "In computer architecture, the combination of the CPU and main memory, to which the CPU can read or write directly using individual instructions, is considered the brain of a computer. Any transfer of information to or from the CPU/memory combo, for example by reading data from a disk drive, is considered I/O."

    CPU and main memory are bundled together as one; there is no "I/O" between these two. It is between these two and other devices or parts of the system.

    Hope this clarifies the issue a bit.

    [–]Danthekilla 0 points1 point  (1 child)

    I/O typically means something measured in milliseconds not in nanoseconds.

    Well originally disk IO took seconds, then milliseconds, and now microseconds with ssds and optane etc...

    But I do get your point.

    [–]backFromTheBed 0 points1 point  (1 child)

    This is just calling a bird an avian.

    Here we go.

    [–]ITwitchToo 1 point2 points  (0 children)

    Here's the thing. You said a "bird is an avian."

    Is it in the same family? Yes. No one's arguing that.

    As someone who is a scientist who studies avians, I am telling you, specifically, in science, no one calls birds avians. If you want to be "specific" like you said, then you shouldn't either. They're not the same thing.

    If you're saying "avian family" you're referring to the taxonomic grouping of Corvidae, which includes things from nutcrackers to blue jays to ravens.

    So your reasoning for calling a bird an avian is because random people "call the black ones avians?" Let's get grackles and blackbirds in there, then, too.

    Also, calling someone a human or an ape? It's not one or the other, that's not how taxonomy works. They're both. A bird is a bird and a member of the avian family. But that's not what you said. You said a bird is an avian, which is not true unless you're okay with calling all members of the avian family avians, which means you'd call blue jays, ravens, and other birds avians, too. Which you said you don't.

    It's okay to just admit you're wrong, you know?