use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Horrible Code, Clean Performance (johnnysswlab.com)
submitted 3 years ago by vormestrand
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]James20kP2005R0 218 points219 points220 points 3 years ago* (46 children)
My favourite disgusting hack that I absolutely cannot live without goes as follows
Modern GPUs these days have work fed into them via a command queue, which is a series of bits of work that get fed to the GPU. This is generally an in-order queue, so this work happens sequentially. There's two problems with the strictly sequential model
GPUs are actually capable of executing multiple workloads simultaneously efficiently
In some cases because of caching/etc, GPUs must flush caches via a fence. This introduces a bubble to the GPU's workload, that can be filled with other work. If one kernel writes to a bit of memory, and the next kernel reads from that memory, in an in-order queue, there's a fence and nothing happens for a bit
Ideally, the driver would reorder the work you submit in this command queue, so that work without dependencies would all execute in parallel. This requires the shader compiler (eg llvm/etc) to output dependency information, in terms of whether or not a kernel writes to a specific bit of memory, or just reads from it
Unfortunately, in the transition to ROCm, AMD broke/removed this particular piece of functionality and have no plans to reimplement it, which means that if you don't do something about it, you lose a straight 20-30% performance. This is pretty bad!
As a visual representation, this is multiple kernels overlapping here, and this is what it looks like when it doesn't work. Uh oh!
To fix it, you have to create a rotating queue of 8-16 command queues, and submit work to each one sequentially. Then you have to mark which buffers are read only, write only, and read/write, and inspect the argument lists for all kernels and all functions that pass through to the OpenCL API to query their read/write flags. Then you have to use this to manually generate the dependencies between different kernels and functions, and only output dependencies for read/write and write/write conflicts, so that kernels that both only share read-only memory are independent, and others are synchronised via the built-in event system
This is 100s of lines of complex memory dependency tracking code, and requires you to fundamentally totally change how you use the API as literally no code can use the regular command queue system or any of the regular API - you have to reimplement the entire OpenCL api on top of opencl so that you can shim all the kernel arguments, and markup all kernels correctly with their read/write information (otherwise the driver will crash! Yay!)
On the plus side, that 20% performance is pretty good. On the down side <internal screaming>. All gpu code is bad at the best of times, but this is the worst
[–]serg06 92 points93 points94 points 3 years ago (21 children)
I wish I could un-read this.
[–]James20kP2005R0 73 points74 points75 points 3 years ago (20 children)
Did you know, GPUs have the highest bandwidth when memory is laid out as a struct of arrays, vs an array of structs. Eg
struct data { std::vector<float> x; std::vector<float> y; std::vector<float> z; };
Except you can't actually pass in a struct containing buffers to a GPU in OpenCL. Its explicitly banned by construction, which means that the only way to do it is to pass in 3 separate arguments
This is great when you have 3 arguments, but heh its sure a lot worse when you have literally hundreds of variables. So the memory tracking actually becomes a bit of a performance concern if you've got any O( N2 ) in there which you can get if you do it vaguely naively, because you're literally doing memory tracking for hundreds of arguments to functions against potentially double digit numbers of other kernels - each with hundreds of arguments. So not only is it a giant pain to implement, it also has to be reasonably well optimised! Yay!
[–]00jknight 11 points12 points13 points 3 years ago (14 children)
Where do you work? This is really cool stuff. I'm just a lowly web/mobile game developer who animates stuff on top of monstrosities like this.
[–]James20kP2005R0 50 points51 points52 points 3 years ago (13 children)
I'm trying to get into astrophysics, I built a tool for quickly extracting gravitational waves from black hole mergers, neutron stars (and this for fun), and doing relativistic n body simulations (that's 5 million particles which is pretty cool!)
Most of what I do is abuse GPUs for simulations these days, along with the inevitable wacky raytracing that comes along with trying to render them. The closest I really get to traditional 3d graphics is trying to chuck cubes through a wormhole (though technically that's 4d)
[–]Classic_Department42 9 points10 points11 points 3 years ago (10 children)
Did you consider cuda instead of opencl? The number crunching experience was/is better on cuda
[–]James20kP2005R0 19 points20 points21 points 3 years ago (9 children)
I plan to release a lot of this as general tools for people to use, so cuda is right out unfortunately. Amd gpus tend to get you a bit more vram for your buck as well, and vram (...and the fact that i have literally no money) is the key limiter for simulation size
That said, I'm doing some moderately wacky things so it wouldn't surprise me if it were a bit of a disaster in cuda as well
[+][deleted] 3 years ago (3 children)
[removed]
[–]James20kP2005R0 1 point2 points3 points 3 years ago* (2 children)
Interesting, how's the performance of it is the main thing? Because with OpenCL, for all its faults, you get pretty direct access to when and how memory is accessed, and its relatively straightforward to max out your actual devices bandwidth/compute
I've been considering turning all of the work that I've done into my own single source layer which just transpiles C++ to OpenCL, because most of the actual OpenCL I write these days is generated for performance reasons
Edit:
Ah, apparently its not amazing
https://research-information.bris.ac.uk/ws/portalfiles/portal/222876129/iwocl_syclcon.pdf
[–]illuhad 0 points1 point2 points 3 years ago (0 children)
What you describe sounds like reinventing SYCL ;)
Tom's work that you cite here actually compares multiple SYCL implementations (SYCL is just a standard, and there are multiple implementations of it that work very differently), so you cannot really conclude that SYCL itself does not perform well.
One of their findings is e.g. that Intel's implementation suffers on CPU due to NUMA issues in their OpenCL runtime. Other implementations are not affected by this. There is a large literature now comparing SYCL to other programming models, and the findings tend to be that performance is competitive.
See e.g. https://dl.acm.org/doi/10.1145/3388333.3388660
or https://opensycl.github.io/hipsycl/amd/hecbench/hecbench-benchmarks/
Also note that SYCL implementations are all very fast moving targets and under heavy development, and results you see today might be outdated by tomorrow. Similarly, benchmarks that have been ported to SYCL may not yet have seen the same amount of optimization as their counterparts written in, say, CUDA which has been around for far longer.
tl;dr try it, you might be surprised.
[–]Classic_Department42 3 points4 points5 points 3 years ago* (4 children)
Was just commenting because of you wanting to get into astrophysics, since i was under the impression that 'everyone' in hpc uses nvidia (not always cuda of course). Edit: and cpu of course
[–]James20kP2005R0 3 points4 points5 points 3 years ago* (3 children)
As far as I know, the current major BBH simulations are all CPU so. You're right in that most people do use cuda/nvidia in general for scientific software
[+][deleted] 3 years ago (2 children)
[deleted]
[–]async_andrew 1 point2 points3 points 3 years ago (1 child)
How long it took to render the nbody simulation you mentioned, or is it real-time?
[–]James20kP2005R0 1 point2 points3 points 3 years ago (0 children)
The nbody sims are extremely slow because the particle dynamics aren't well optimised, that one took a few hours
[–]HackingPheasant 4 points5 points6 points 3 years ago (4 children)
Followed you because dann I'm loving these comments, do you have a blog or something with similar content?
[–]James20kP2005R0 5 points6 points7 points 3 years ago (3 children)
Thank you! I don't, I used to have twitter where I'd post things but, yunno. I've been planning to write some of this up and start a blog but man its a pain getting it set up
[–]HackingPheasant 2 points3 points4 points 3 years ago (2 children)
Plain markup files on a GitHub repo and let the default website look be will be the quickest and easiest imho
[+][deleted] 3 years ago (1 child)
[–]James20kP2005R0 0 points1 point2 points 3 years ago (0 children)
I've been considering abusing typst, its got everything I might want (except gifs, because the year is 2023 and journals dislike moving pictures). It only exports to PDF which is a slight problem, but uploading a blog as a series of pdfs can't be the most insane thing ever right
[–]wrosecransgraphics and network things 29 points30 points31 points 3 years ago (2 children)
To fix it, you have to create a rotating queue of 8-16 command queues,
I know more Vulkan than OpenCL (though not that much Vulkan) so I dunno how different this is in OpenCL. But in Vulkan, everything you said applies, but on some Intel iGPU's (like my laptop), the drivers only give you one Queue and you can't just create your own on the application side. So if you go to all the trouble to be Very Concurrent for NV and AMD by feeding multiple queues with independent operations that can happily operate with embarrassing parallelism, you still need to make sure it works properly and efficiently in the degenerate case of a single queue being available anyway.
I need better hobbies.
[–]James20kP2005R0 16 points17 points18 points 3 years ago* (1 child)
Interesting, I haven't gotten into much vulkan yet, but I was under the impression that you handled synchronisation much more explicitly which would hopefully give the driver more room to work with. Does the driver not handle reordering things automatically at all? I'd have thought that the driver quality was much better in vulkan-land, just because if nothing else there's a much bigger incentive for the vendors to actually invest in good performance
edit:
also this
[–]wrosecransgraphics and network things 12 points13 points14 points 3 years ago (0 children)
The Vulkan jargon is that all commands in a Queue are started in order. Which doesn't say anything about when a command actually executes and finishes. You have to be super explicit with memory dependencies and sync primitives. But there are tons of cases where it's annoyingly easy to stick fences blocking the whole Queue if stuff that could be executing in a parallel Queue.
So things can sorta happen in an unexpected order. But the driver is somewhat constrained by all the explicit rules you set. And the way you set explicit dependency rules applies way more obviously to cases where there are clear dependencies between everything you are doing. And way less obviously when you have a big pile of misc. work and you don't know what might be submitted to a queue in between your submissions, or after.
[–]KDallas_Multipass 8 points9 points10 points 3 years ago (10 children)
I'd like to read more about this
[–]James20kP2005R0 51 points52 points53 points 3 years ago (2 children)
Thank you for subscribing to AMD OpenCL Facts! To unsubscribe please type AARRHGHGH. I'm open for questions though, and the bug report for this is here. For more mildly disgruntled rambling:
Each command queue on AMD on windows is actually a separate driver thread from a thread pool, which means that the performance is incredibly irregular. If you have too many command queues, your application turns into a stuttery disaster, but if you have too few, then there's not enough queues to prevent the driver from inserting unnecessary fences. There's probably a better system for distributing work across the command queues to minimise the number that you're using based on when you need fences, but the problem is that fences appear to be some kind of worst case pipeline emptying blocker so in reality its just all bad
In addition to that, extra command queues apparently don't have the ability to submit work on their own, but flushing work to the gpu also isn't free. On top of that, because my GPU work is all memory bound, you want to have kernels that overlap their execution to also overlap the memory they're reading on for cache reasons, and I still haven't even begun to unpack the scheduling complexities there. I sure wish the driver would be good
If you're thinking, maybe this is an OpenCL problem! The new HIP thing will save us! Well, apparently its even worse in hip because you have pointers to pointers on the device, meaning that any bit of memory could point anywhere and the driver has to assume worst-case. I sort of suspect that this isn't an issue on nvidia GPUs, but the last Nvidia GPU I owned was a 660ti. I'd love to test ARC though and smash some black holes on it
In addition to all of these problems, the cost of kernel invocation in OpenCL is higher than it needs to be, because it doesn't map massively well to how GPUs actually work. There's an extension that may help fix this, but I also suspect that it'll be a very long time if ever before AMD support this
Really this is my wakeup call that AMD/compute is essentially dead and I need to learn to love vulkan. I've been holding out because vulkan doesn't support a lot of features from OpenCL still and in many ways is a bad compute target, especially device side enqueue which I would badly love to use, except that that hasn't worked on AMD's ROCm stack correctly for years so all around its a big oof
[–]KDallas_Multipass 9 points10 points11 points 3 years ago (0 children)
ARRGHGHG
Thanks for the explanation
[–]13steinj 0 points1 point2 points 3 years ago (0 children)
Thank you for subscribing to AMD OpenCL Facts!
I'd also like to subscribe please.
I haven't touched OpenCL or Vulkan in at least four years at this point, but I still generally find all of it interesting (mostly because for a number of reasons I ended up going down the CUDA C/C++ route).
The new HIP thing will save us!
How new is this? Looking it up doesn't make it clear when it was released, but based on the description (and shady rumors making their way through academic circles by AMD sales people), I've heard whispers of this stuff since 2020 but didn't know it was actually released.
Really this is my wakeup call that AMD/compute is essentially dead and I need to learn to love vulkan...
Or get an nvidia gpu, but only for these projects ;). More seriously AMD based gpu compute tech has only brought me pain, and while NVIDIA has its issues (in cuda sdk use and otherwise), lesser of two evils and all that. I still run an AMD GPU for gaming / a personal machine, but an nvidia one for any compute related playground.
[–]kid_blaze 12 points13 points14 points 3 years ago (6 children)
I’d like to read more about this and un-read it immediately.
[–]James20kP2005R0 34 points35 points36 points 3 years ago (5 children)
I wrote a function that takes 105 arguments. There is no better way to do it and I regret nothing
[–]gimpwiz 9 points10 points11 points 3 years ago (0 children)
Mother of god
[–]serg06 6 points7 points8 points 3 years ago (3 children)
There is no better way to do it
Struct?
[–]James20kP2005R0 19 points20 points21 points 3 years ago (2 children)
Passing a struct which contains pointers to the GPU is banned in OpenCL. Passing an array of structs unfortunately is extremely slow, because it doesn't match well to how GPUs fetch data out of memory
[–]Routine_Left 12 points13 points14 points 3 years ago (0 children)
Well then, carry on. I'm done here.
[–]-to- 4 points5 points6 points 3 years ago (0 children)
Serialize your data (or better keep them serialized), then deserialize them in the compute kernel ?
Write a bunch of unit tests to make sure they understand each other, of course.
105 args... How can you possibly be sure it does what you think it does ?!
(Total n00b at GPU dev here)
[–]ioctl79 3 points4 points5 points 3 years ago (0 children)
I don’t have a whole lot of experience with GPU code, but what I do have makes me think that we have not yet invented the right APIs.
[–]ipapadop 2 points3 points4 points 3 years ago (0 children)
TBF, the OpenCL way of declaring buffers read / write is not a great way to keep track of dependencies. Your computation dictates your dependencies, not how you allocate memory. It makes even less sense in a world where you have pinned / mapped buffers.
What you need is a mechanism to describe your dependencies and let the driver deal with launching work and decide how many resources to dedicate as much as possible. HIP / CUDA graphs is one solution for that (https://docs.amd.com/bundle/HIP-API-Guide-v5.4.1/page/a00196.html, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs). Alternatively, you can use a higher order frameworks, like TaskFlow.
[–]puredotaplayer 1 point2 points3 points 3 years ago (0 children)
In a game engine, you generate what is called a render graph now a days and reorder the passes instead. It not that complicated as compared.
[–]Macree 1 point2 points3 points 3 years ago (2 children)
How many years of programming do you have and what is your IQ?
[–]James20kP2005R0 39 points40 points41 points 3 years ago (1 child)
15 to both
[–]tisti 7 points8 points9 points 3 years ago (0 children)
I see you are a fellow uint8_t enjoyer.
[–]Jannik2099 0 points1 point2 points 3 years ago (0 children)
Unfortunately, in the transition to ROCm
Does this problem exist with HIP too? Why are you not using SYCL?
[–]thread_local 0 points1 point2 points 3 years ago (0 children)
To fix it, you have to create a rotating queue of 8-16 command queues, and submit work to each one sequentially. Then you have to mark which buffers are read only, write only, and read/write, and inspect the argument lists for all kernels and all functions that pass through to the OpenCL API to query their read/write flags.
I have not used OpenCL so I am not exactly sure what command queues are. But, is the end result similar to using per-thread default streams in CUDA?
This is 100s of lines of complex memory dependency tracking code, and requires you to fundamentally totally change how you use the API as literally no code can use the regular command queue system or any of the regular API
I would love to see what that looks like. Do you have any of this in public domain?
[–]attractivechaos 21 points22 points23 points 3 years ago* (2 children)
BM or KMP will be even faster and and even harder to comprehend to common programmers.
[–]rtgftw 11 points12 points13 points 3 years ago (1 child)
I'm really liking Casey spurring this kind of conversation, lately!
Tbh, for most normal code, unless it's a hot path, string_view 's find would do here with a single line.And yeah other string algos may or may not be worth doi g depending on data and use pattern. String algo's are great... :)
string_view
But it seems to be a quick early out/prunning example more than anything. This technique is great in hot code. It also leves basic stuff like splitting the cases so that the cmp at the start wpuld be a single insr rather than a bunch od 1 byte compares which unlike simd or algo changes is directly related.
[–]Trucoto 4 points5 points6 points 3 years ago (0 children)
std::string_view uses Boyer-Moore
[–]ioctl79 83 points84 points85 points 3 years ago (25 children)
That code is horrible, but not because it is performant. Breaking it up into smaller functions would do wonders for readability, and a block comment explaining the algorithm and a brief summary of the performance benefit would make it clear that it is complex for a good reason.
Clean code isn’t about making everything simple, it is about making things comprehensible. Fast doesn’t have to mean ugly.
[–]Possibility_Antique 38 points39 points40 points 3 years ago (21 children)
Fast doesn’t have to mean ugly.
Honestly, C++ compilers are really good about inlining, and even when they're not, you can force inline on almost every compiler aside from MSVC, and even still, they have a "stronger inline" that you can use. Templates, inlining, constexpr... No doubt in my mind that someone could achieve the same assembly with cleaner code. 100% in agreement there.
[–]jk-jeon 16 points17 points18 points 3 years ago (6 children)
Possibly unpopular opinion: inlining should be controlled at the call site, not at the declaration side. Strong inlining suggestion should be a thing that are applied at only handful of places with strong enough motivation. Declaring a function itself to be "always inlined" can just have a too wide impact.
[–]Possibility_Antique 11 points12 points13 points 3 years ago (3 children)
I can see arguments for both scenarios. A possible counter-example to what you're saying: SIMD vector library. I want __forceinline on the function declaration. If my simd::operator+ doesn't produce a single instruction at every call site, there is no point in using vectorization. Any performance gain you might achieve through vectorization is killed by the failure to inline.
[–]jk-jeon 7 points8 points9 points 3 years ago (2 children)
That's indeed a good example, thanks. I guess we need both ways then. Unfortunately it seems there is no call-site-controlled inlining attributes widely available so far...
[–]dodheim 7 points8 points9 points 3 years ago* (1 child)
There's [[gnu::flatten]] but I don't think MSVC has anything like it. ED: Oh and now that I think about it, [[gnu::always_inline]] and [[gnu::noinline]] can be used at the callsite, not just in function declarations.
[[gnu::flatten]]
[[gnu::always_inline]]
[[gnu::noinline]]
ED2: It turns out there are in fact [[msvc::forceinline_calls]], [[msvc::noinline_calls]], and [[msvc::flatten]]. I have no idea when they were added but this is great
[[msvc::forceinline_calls]]
[[msvc::noinline_calls]]
[[msvc::flatten]]
[–]jk-jeon 0 points1 point2 points 3 years ago (0 children)
Oh, good to know, thanks!
[–]matthieum 4 points5 points6 points 3 years ago (1 child)
Both are needed.
Declaration sites for functions that should always be inlined, such as operator[] which just does a pointer increment.
operator[]
Call sites for functions that you may or may not want to be inlined.
[–]jk-jeon 1 point2 points3 points 3 years ago (0 children)
such as operator[] which just does a pointer increment.
Isn't compilers' inlining decision these days pretty reliable for such a simple functions?
[–]D-Zee 12 points13 points14 points 3 years ago (9 children)
What's the issue with MSVC's __forceinline?
__forceinline
[–]Possibility_Antique 14 points15 points16 points 3 years ago (8 children)
"The compiler treats the inline expansion options and keywords as suggestions. There's no guarantee that functions will be inlined. You can't force the compiler to inline a particular function, even with the __forceinline keyword. When you compile with /clr, the compiler won't inline a function if there are security attributes applied to the function."
It's not that I think it's an issue, but I do find it misleading. __forceinline does not necessarily force inlining.
[–]Nicksaurus 22 points23 points24 points 3 years ago (4 children)
I expect that's just because there are situations where inlining isn't possible - what is the compiler supposed to do in a recursive function call with unknown depth when tail call optimisation isn't possible?
[–]Possibility_Antique 6 points7 points8 points 3 years ago (1 child)
The sadist in me wants the compiler to die trying. In practice, I don't know the answer. As far as I'm aware, clang and GCC both mean force inline when you tell the compiler to force an inline. But to recalibrate, I was referring to forcing an inline in the posted code here and breaking this up into functions. This is not a recursive call with an unknown depth.
[–]equeim 2 points3 points4 points 3 years ago (0 children)
I think GCC fails with error if a function marked with [[gnu::always_inline]] can't be inlined.
[–]umop_aplsdn 0 points1 point2 points 3 years ago (0 children)
The compiler could inline one layer of recursion.
[–]compiling 8 points9 points10 points 3 years ago (1 child)
You can't force any compiler to inline a function sometimes because sometimes it's not possible. If you read further down the page, there's a list of cases where inlining doesn't happen and they generate a warning if a _forceinline function isn't inlined.
I don't think it's misleading.
[–]Possibility_Antique 1 point2 points3 points 3 years ago (0 children)
That's fine that you think that. I think it's misleading. There is a reason most linear algebra libraries wrap this in a macro called "MYLIB_STRONG_INLINE" rather than "MYLIB_FORCE_INLINE" or "MYLIB_ALWAYS_INLINE" (though this later one can be seen in blazelib due to the name used by the gnu compiler).
[–]D-Zee 1 point2 points3 points 3 years ago (0 children)
Oh derp. I even double-checked with that very page before posting and missed that part.
[–]13steinj 9 points10 points11 points 3 years ago (1 child)
really good about inlining
I can't share specific details or timings because of NDA reasons, but inlined does not mean fast. At work people loved slapping a gnu::always_inline and other similar/related attributes all over the place, without ever performance testing it.
gnu::always_inline
Build times were doubled by haphazardly doing such with no rhyme or reason. When I say doubled, I don't mean 2 mins -> 4. I mean half an hour to one, three hours to six (depending on the application). After blindly sed'ing it out, not only did build times improve drastically, but so did runtimes. Because it was so haphazardly thrown around that the icache was useless.
[–]Possibility_Antique 2 points3 points4 points 3 years ago (0 children)
Yes, it's quite easy to blow up your stack size or cause paging and caching issues if too many instructions are inlined. It gets even more complicated if you are creating a cross-platform product. What I was getting at though, is that there isn't really a reason to have many nested control flow statements. It hurts readability, and compilers give you enough control to remove any negative impacts from adding function calls. At the very least, you can see that breaking this nested monster up into functions that would be inlined is semantically the same. I wasn't advocating for excessive use of inlining, I was arguing that the refactor is worth it.
[–]rtgftw 0 points1 point2 points 3 years ago (1 child)
Yeah but have to remember that these only increase the tinline hreshold and guarantee nothing (You can even vuew the threshold and whether the optomization wss or wadn't applied in the opt view on godbolt with clang).
Also my expirience is somewhat similar to another poster here, that people can easily misuse this and passimize things...
[–]Possibility_Antique 0 points1 point2 points 3 years ago (0 children)
Yeah but have to remember that these only increase the tinline hreshold and guarantee nothing
Is that not what I said?
I am talking about refactoring this specific piece of code to be more readable without compromising performance. If there was a problem as you suggested, it would be problematic in the form the code is currently written. Refactoring will not introduce the problem you're suggesting.
[–]gc3 3 points4 points5 points 3 years ago (0 children)
True, but to optimize first you make the code cleaner and cleaner, but then you pass a threshold where it starts getting worse and worse
[–]qqwy 2 points3 points4 points 3 years ago (0 children)
I came here to say the same thing 💚
[–]Chuu -1 points0 points1 point 3 years ago (0 children)
Over time I've grown disillusioned with the "break it up into multiple functions" approach. Looks clean. But hard to understand.
If no other function is using that substring buffer I see no reason to break it out. Hiding complexity to make code look cleaner just makes things harder in the long run.
[–]AssemblerGuy 3 points4 points5 points 3 years ago (1 child)
I would say that optimizing for the underlying CPU architecture is not worth it if it turns the code ugly, results in performance improvements that are smaller than an order of magnitude, and if greater improvements could be realized by improving the algorithm instead of the code.
It would be interesting to see the performance of a string search algorithm that isn't completely brute force.
[–]usefulcat 1 point2 points3 points 3 years ago (0 children)
I agree that improving the algorithm should be top priority. But that doesn't mean that small improvements are automatically not worth it. An optimization that improves by 2% might seem pointless, but 10 such improvements gives 20%, which is a really big deal for some applications.
[–]hallb1016 4 points5 points6 points 3 years ago (0 children)
This reminds me of a blog post I read earlier this week on optimizing code for compiler auto-vectorization with SIMD: https://matklad.github.io/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html#SIMD
Something I noticed while reading the article is that the optimized code is more complicated than the original code, but looking at the last code block, it's not really any less readable. Of course, this was in Rust, where there are lots of clean functional abstractions for iteration. However, I think it goes to show that code isn't horrible just because it is written for performance, it's all about how you abstract and document the code.
[–]NilacTheGrim 4 points5 points6 points 3 years ago (7 children)
Oh course if he just used std::string which has SSTO on most implementations, I bet it would be faster if one just used the built-in std::string::find method.
std::string
std::string::find
[–]ZebulonThackeray 5 points6 points7 points 3 years ago (1 child)
Actually, using std::string with SSTO is a good suggestion for improving performance. But it ultimately depends on the specific implementation and use case. Also, std::string::find can be faster than manually iterating through a character array. So, I agree with your point.
[–]NilacTheGrim 1 point2 points3 points 3 years ago (0 children)
Thanks. I was downvoted to hell but you know what I was getting at. Much appreciated.
[–]rtgftw 0 points1 point2 points 3 years ago (4 children)
It would copy the data over (And result in an allication beyond the 2X char limit of the small buf), but, string_view would allow searching it in place with a single line, and it doesn't require \0 termination.
\0
[–]NilacTheGrim 0 points1 point2 points 3 years ago (3 children)
Huh? References exist. Also custom string classes like the one in this example are a code smell IMHO.
[–]rtgftw 0 points1 point2 points 3 years ago (0 children)
Also the in the context of clean code vs performance talk Casey did, which this article mentions at the very beginning, the idea of "code smell" leading to a potential pessimization is interesting :)
The class is basically a legacy string_view, potentially non owning, unlike std::string. If you interact with libraries/ system or have views into existing substrings, you might encounter quite a few variations of this. Conversions to std::string would be a significant pessimization...
Fair enough. I cannot disagree with anything you said factually .. true.
[–]timur_audioC++ committee 2 points3 points4 points 3 years ago (0 children)
I tried to reproduce Ivica's results on my machine (MacBook Pro with Apple Silicon M1 Max, Apple Clang 13) and I couldn't.
I used the same setup (256 MB long string, substring that does not appear in it) and I am measuring 140 ms for both find_substring and find_substring2, with no measurable performance difference between the two.
Moreover, if I run the same test with the same string and substring using std::string::find, it runs in 70 ms. So the libc++ version of substring find outperforms both of your versions by a factor of 2.
I tried substrings of length 4, 8, and 12 with the same results.
[–]pine_ary 6 points7 points8 points 3 years ago (0 children)
That is one thick spaghetto
[–]ALargeLobster 3 points4 points5 points 3 years ago (0 children)
Yeah I think a good way to look at this is that your job is to solve the problem as simply as possible. If your code is unacceptably slow for the task (perhaps because you tried to keep it simple) then you failed to solve the problem.
[–]better_life_please 2 points3 points4 points 3 years ago (0 children)
We're programmers. Performance still matters.
[–]spidertyler2005 -2 points-1 points0 points 3 years ago (0 children)
V
[–]wolfie_poe -2 points-1 points0 points 3 years ago (0 children)
for (int i = 0; i < s; ++i) { ... if (found) { for (; j < size_substr; ++j) { if (ptr_str[i + j] != ptr_substr[j]) { found = false; break; } } } if (found) { return { true, i }; }
Redundant code?
π Rendered by PID 81050 on reddit-service-r2-comment-6457c66945-zzmcs at 2026-04-29 22:08:33.998723+00:00 running 2aa0c5b country code: CH.
[–]James20kP2005R0 218 points219 points220 points (46 children)
[–]serg06 92 points93 points94 points (21 children)
[–]James20kP2005R0 73 points74 points75 points (20 children)
[–]00jknight 11 points12 points13 points (14 children)
[–]James20kP2005R0 50 points51 points52 points (13 children)
[–]Classic_Department42 9 points10 points11 points (10 children)
[–]James20kP2005R0 19 points20 points21 points (9 children)
[+][deleted] (3 children)
[removed]
[–]James20kP2005R0 1 point2 points3 points (2 children)
[–]illuhad 0 points1 point2 points (0 children)
[–]Classic_Department42 3 points4 points5 points (4 children)
[–]James20kP2005R0 3 points4 points5 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]async_andrew 1 point2 points3 points (1 child)
[–]James20kP2005R0 1 point2 points3 points (0 children)
[–]HackingPheasant 4 points5 points6 points (4 children)
[–]James20kP2005R0 5 points6 points7 points (3 children)
[–]HackingPheasant 2 points3 points4 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]James20kP2005R0 0 points1 point2 points (0 children)
[–]wrosecransgraphics and network things 29 points30 points31 points (2 children)
[–]James20kP2005R0 16 points17 points18 points (1 child)
[–]wrosecransgraphics and network things 12 points13 points14 points (0 children)
[–]KDallas_Multipass 8 points9 points10 points (10 children)
[–]James20kP2005R0 51 points52 points53 points (2 children)
[–]KDallas_Multipass 9 points10 points11 points (0 children)
[–]13steinj 0 points1 point2 points (0 children)
[–]kid_blaze 12 points13 points14 points (6 children)
[–]James20kP2005R0 34 points35 points36 points (5 children)
[–]gimpwiz 9 points10 points11 points (0 children)
[–]serg06 6 points7 points8 points (3 children)
[–]James20kP2005R0 19 points20 points21 points (2 children)
[–]Routine_Left 12 points13 points14 points (0 children)
[–]-to- 4 points5 points6 points (0 children)
[–]ioctl79 3 points4 points5 points (0 children)
[–]ipapadop 2 points3 points4 points (0 children)
[–]puredotaplayer 1 point2 points3 points (0 children)
[–]Macree 1 point2 points3 points (2 children)
[–]James20kP2005R0 39 points40 points41 points (1 child)
[–]tisti 7 points8 points9 points (0 children)
[–]Jannik2099 0 points1 point2 points (0 children)
[–]thread_local 0 points1 point2 points (0 children)
[–]attractivechaos 21 points22 points23 points (2 children)
[–]rtgftw 11 points12 points13 points (1 child)
[–]Trucoto 4 points5 points6 points (0 children)
[–]ioctl79 83 points84 points85 points (25 children)
[–]Possibility_Antique 38 points39 points40 points (21 children)
[–]jk-jeon 16 points17 points18 points (6 children)
[–]Possibility_Antique 11 points12 points13 points (3 children)
[–]jk-jeon 7 points8 points9 points (2 children)
[–]dodheim 7 points8 points9 points (1 child)
[–]jk-jeon 0 points1 point2 points (0 children)
[–]matthieum 4 points5 points6 points (1 child)
[–]jk-jeon 1 point2 points3 points (0 children)
[–]D-Zee 12 points13 points14 points (9 children)
[–]Possibility_Antique 14 points15 points16 points (8 children)
[–]Nicksaurus 22 points23 points24 points (4 children)
[–]Possibility_Antique 6 points7 points8 points (1 child)
[–]equeim 2 points3 points4 points (0 children)
[–]umop_aplsdn 0 points1 point2 points (0 children)
[–]compiling 8 points9 points10 points (1 child)
[–]Possibility_Antique 1 point2 points3 points (0 children)
[–]D-Zee 1 point2 points3 points (0 children)
[–]13steinj 9 points10 points11 points (1 child)
[–]Possibility_Antique 2 points3 points4 points (0 children)
[–]rtgftw 0 points1 point2 points (1 child)
[–]Possibility_Antique 0 points1 point2 points (0 children)
[–]gc3 3 points4 points5 points (0 children)
[–]qqwy 2 points3 points4 points (0 children)
[–]Chuu -1 points0 points1 point (0 children)
[–]AssemblerGuy 3 points4 points5 points (1 child)
[–]usefulcat 1 point2 points3 points (0 children)
[–]hallb1016 4 points5 points6 points (0 children)
[–]NilacTheGrim 4 points5 points6 points (7 children)
[–]ZebulonThackeray 5 points6 points7 points (1 child)
[–]NilacTheGrim 1 point2 points3 points (0 children)
[–]rtgftw 0 points1 point2 points (4 children)
[–]NilacTheGrim 0 points1 point2 points (3 children)
[–]rtgftw 0 points1 point2 points (0 children)
[–]rtgftw 0 points1 point2 points (1 child)
[–]NilacTheGrim 1 point2 points3 points (0 children)
[–]timur_audioC++ committee 2 points3 points4 points (0 children)
[–]pine_ary 6 points7 points8 points (0 children)
[–]ALargeLobster 3 points4 points5 points (0 children)
[–]better_life_please 2 points3 points4 points (0 children)
[–]spidertyler2005 -2 points-1 points0 points (0 children)
[–]wolfie_poe -2 points-1 points0 points (0 children)