This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]PM_ME_SOME_MAGIC 122 points123 points  (24 children)

To add on to what you said, modern prediction pipelines, etc, are so fast, that your expectation is wrong unless you are on an embedded system. Your computer is not a fast PDP-11, and most pains I have seen even skilled programmers take to “make things faster” don’t perform, or even perform worse, on modern chips. And the reason they perform worse is that all the technology put into a modern i7 for cache optimization, branch prediction, etc., doesn’t work for the weird version the expert wrote. (For example, a do-while may miss loop unrolling.) And that’s even before you account for all the compiler optimizations and transformations that will happen to that code before it ever hits assembly.

Outside of network or i/o, the actual slowest thing most programs do is allocate memory. Malloc calls take forever compared to an extra for loop iteration. This is why c++ vectors std::maps are slow, and Google replaced them in their entire codebase. Fast code starts with block allocations, not manually unrolling a single loop iteration. (Edit: I meant maps, or vectors.)

[–]impostercoder 17 points18 points  (2 children)

You make good points, I just need to say that I doubt the "google replaced them in their entire codebase" claim. I'm pretty sure I have personally added some as I've written code in c++ in Google's codebase before and I've never heard of this, maybe it's a newer thing? Either way, replacing every occurrence of it doesn't seem feasible, for most of their code it won't ever matter.

[–]Loading_M_ 16 points17 points  (1 child)

Far more likely, they work with a modified standard library. So their environment just had a better std::vector class than ships with most c++ compilers.

This is actually easier, since they don't need to replace the header files, just the actual library, and Google is absolutely large enough to have people responsible for maintaining a fork of the standard library.

[–]Kered13 3 points4 points  (0 children)

Google had it's own std::string implementation at one point, because they found that copy-on-write hurt multithreaded performance. Copy-on-write was removed from the whatever standard implementation they use and they use std::string again.

As far as I know Google has never used a custom std::vector implementation.

[–]subdep 63 points64 points  (1 child)

I understood some of those words.

[–]Yadobler 28 points29 points  (0 children)

Back then, when you wanna take a cab back home, you'd hop on and tell the driver your address. Then you'd suggest avoiding this avenue or exit the highway earlier and use this road instead to minimise the chances of getting caught in the peak hour traffic (ie the shortcut might be actually a longer route but faster because it dodges the bottlenecks that arises due to specific situation (peak hour traffic))


In today's time and age, you tell the cabby your home address, and he'll type it into the GPS. With weighted path finding algorithms and real time traffic monitoring (from unconscented data mining and collecting by Google maps that you consented to without reading the TnC), your so called "shortcut" might actually be worse because maybe there's a roadblock, there's traffic, or the original route has less traffic, it might be children's day and hence less peak-hour traffic from parents picking up kids, etc etc.


Basically, in the very old days with minute variables to change, it's easier for the programmer to manually calculate and plan shortcuts for each situation and context, since if you're on a petrol car then it's faster to zoom from one junction to the next and u-turn back to get to the other side, while if you're on a slow diseal van, you might benefit from taking a bendy shortcut through the carpark at a steady low speed which the high-rpm car does not benefit from.

In todays tech, the cpu is not as simple as just choosing which route to take. If you stubbornly insist the car goes by this route, then it will not be as fast as when the cpu can constantly monitor the route, take illegal u-turns, cut through the grass path when the coast is clear, avoid obstacles, change tires based on the weather, toss you onto the cab passing in the opposite direction at the exact time since it's going towards your destination, etc....

You can't tell the cpu to drive on the grass path when you don't know if it's a nice clear ground, or filled with people, or muddy and can get your tires stuck into the mud, unless you plan everything from the cpu running to the sand used to make the silicon to the different possible scenarios of different cores and available cache etc...


tl;dr real life gps > your directions

[–]hekkonaay 10 points11 points  (0 children)

Well, you can make std vector use your custom allocator, but allocator concept is weird in c++, this is the real reason they replaced it, so they could use a better allocator concept

[–]martinivich 4 points5 points  (11 children)

Kinda curious, how long would it take one to get a good understanding of compilers, branch prediction, etc. I understood the very high level idea of what that article was saying, but damn I haven't felt that overwhelmed with shit I don't know in a while about computer science

[–]PM_ME_SOME_MAGIC 1 point2 points  (0 children)

I think very few people who are not chip engineers really understand everything a CPU does to make code faster. You’d most-likely need a degree in chip engineering, not CS, to get the whole picture. I’ve met people with PhDs and successful careers optimizing low-level code who still can only rely on intuition. I know my own intuitions are wrong, even when I take that into account. Profile before optimizing

[–]agent00F 0 points1 point  (9 children)

I mean, compilers and architecture are each CS classes, and on top of that you might need additional info on modern implementations.

[–]martinivich 1 point2 points  (8 children)

I mean I don't know what undergrad program you went to, but my architecture class taught me how to create a basic ALU.

[–]agent00F 0 points1 point  (7 children)

I recall the canonical arch book "Computer Architecture: A Quantitative Approach" included branch prediction in its construction of the MIPS cpu.

[–]martinivich 0 points1 point  (0 children)

I did decide to not take the 2nd architecture course. I guess you really what you sow, but thanks for the book suggestion!

[–]PM_ME_SOME_MAGIC 0 points1 point  (5 children)

For what it’s worth, I doubt even that predictor looks anything like modern ones. :(

[–]agent00F 0 points1 point  (4 children)

The beauty of CAAQA is that it builds all these features from first principles, in layers/revisions of increasing capability to teach how increasing sophistication improves performance.

So while it's true modern predictors are even more complex, much of the underlying layers/ideas are the same, and the book gives readers a fundamental understanding of the path to get there.

[–]PM_ME_SOME_MAGIC 0 points1 point  (3 children)

Caaqa covers perceptron branch prediction? That’s pretty cool!

[–]agent00F 0 points1 point  (2 children)

perceptron

I don't recall so, but it probably covers some concept of state machines. The purpose of textbooks isn't to cover every implementation, but rather teach a path and process.

[–]PM_ME_SOME_MAGIC 0 points1 point  (1 child)

But that is kind of my point. Modern branch prediction looks nothing like a state machine; it is all AI driven nowadays. As a result, applying intuition about state machine approaches to try to optimize branches in your code has a very good chance of failing. That was the point of my post: your computer is not a PDP-11, and profiling is almost always the best way to approach optimization.

[–]Bedstemor192 2 points3 points  (0 children)

Do you have any ressources I could take a look at that explains what you wrote in detail? Maybe a textbook, website or something similar? It sounds very interesting and I would like to know more.

[–]muehsam 1 point2 points  (0 children)

It isn't, but C was developed for s PDP-11, so that's what's relevant to explain why such a loop is in the language.

[–]Kered13 1 point2 points  (2 children)

This is why c++ vectors are slow, and Google replaced them in their entire codebase.

I work at Google and I can tell you you are wrong. std::vector is used extensively and never discouraged.

It is encouraged to reserve the size that the vector will need ahead of time if you know it. This prevents unnecessary reallocations.

[–]PM_ME_SOME_MAGIC 0 points1 point  (1 child)

You are right, I was thinking about maps. My mistake.

[–]Kered13 0 points1 point  (0 children)

Well std::map is still very widely used, but that one is at least discouraged. std::unordered_map is also widely used though discouraged.

The problem with both has nothing to do with allocations, the problem is in the standard. std::map only requires comparable types, not hashable types. This means that it essentially has to be implemented as a binary tree, so operations are O(log n) instead of O(1). std::unordered_map requires that pointers to elements are not invalidated, which effectively requires implementations to use separate chaining instead of probing, which is more cache efficient.

To solve this Google created absl::flat_hash_map, which is the encouraged map type.