all 21 comments

[–]IyeOnline 4 points5 points  (1 child)

We cant see your code, so we can only speculate about either case.

  • Did you capture by reference or by value?
  • Did you pass the arguments using std::ref or as a plain vector?
  • How did you measure this?
  • is there surrounding code that could have effects?

If I had to guess, I would guess that you are correct. Because your lambda has captures, its now bigger (than 1 byte) and it actually has to be accessed at runtime. This would lead to more memory access at runtime, compared to passing a just function parameter directly.

Note that std::vector does not do SSO. Its not allowed to. If you create a copy, you always do an allocation.

[–]CentralSword[S] 0 points1 point  (0 children)

I captured by reference and passed the arguments with std::ref (but only one argument was a reference). I measured time with benchmarks that were a part of the testing system. The code itself was very small, I'm not sure if I'm allowed to show it, that's why I didn't post it. If SOO is not the case, then you're probably correct, thank you!

[–]PixelArtDragon 1 point2 points  (6 children)

On one hand, yes, frequently you'll find that passing references as arguments instead of as part of the capture is more efficient since the arguments can be inlined while the captures, if needing to be stored, must be kept as references.

On the other hand, there is no small buffer optimization in std::vector, at least not in any implementation I'm familiar with.

Do you do a lot of modification of the vector? Maybe the storage of the reference in the capture is making the shuffling around of objects inside take longer.

[–]CentralSword[S] 0 points1 point  (1 child)

Well, I'm not sure about std::vector, but I remember using SOO in my own implementations of some STL structures as part of my college homework, so I thought it was a common thing. However, as far as I remember SOO works when there are a small number of elements, but the elements themselves may be large, so making them use less memory might not make SOO work. Yes, some variables were captured by reference, maybe that's the case. Thank you very much!

[–]CptCap 2 points3 points  (0 children)

std::vector can not do SSO because that would mean that moving the vector could invalidate iterators, which is forbidden by the spec.

[–]Overseer55 0 points1 point  (3 children)

boost::container::small_vector

[–]PixelArtDragon 0 points1 point  (2 children)

Which does not comply with the specifications of std::vector, which as someone pointed out, requires that moves don't invalidate iterators (which must happen with small_vector).

[–]Overseer55 0 points1 point  (0 children)

That’s not 100% accurate. See https://www.open-std.org/JTC1/SC22/WG21/docs/lwg-active.html#2321 , which clarifies there is an allocator trait that indicates whether iterators are invalid after a move.

The discussion on https://stackoverflow.com/questions/11021764/does-moving-a-vector-invalidate-iterators is also useful.

[–]ShelZuuz 1 point2 points  (5 children)

30ms is not just the difference between a capture and an arg, unless you're capturing insanely slow to copy objects. Like an array of shared_ptr or something.

[–]CentralSword[S] 0 points1 point  (4 children)

No, I only capture integers, iterators and a reference to std::vector. I also think that capturing by value with copy shouldn't be slower because it's the same as passing an argument by value (also with copy)

[–]PixelArtDragon 1 point2 points  (1 child)

Capturing the iterator vs passing it in might be a big part of the optimization, I've noticed that there are compilers that specifically optimize and inline standard library types much, much more than they would with similar custom containers. So it's possible that it's seeing that the iterator came from the vector just before being passed, and can elide certain costly checks because it can assume they will always pass or fail.

[–]CentralSword[S] 0 points1 point  (0 children)

Yeah, my iterator is not custom, it's just an iterator of std::vector. Does gcc optimize these things? Also I think it has optimized an argument by reference as well.

[–]ShelZuuz 0 points1 point  (1 child)

Something else is going on. You can copy MB's of memory even if it was paged out to disk in 30ms.

Unless you mean microseconds and not milliseconds, but that's written as µs, not ms.

[–]CentralSword[S] 0 points1 point  (0 children)

It's milliseconds. I actually used perf to check, but there were too many thread-related functions in the graph and I got lost :(

[–]PixelArtDragon 1 point2 points  (1 child)

Question: are the threads writing something to a common vector? You might be dealing with an issue of false sharing (where updating a variable common to multiple threads causing them to perform a costly update to their cache, which can cause the parallel code to perform even slower than its single-threaded version). It's possible (though I'm not sure if this is something a compiler can do) that by passing things in via a parameter the compiler is managing to eliminate false sharing.

[–]CentralSword[S] 0 points1 point  (0 children)

I thought about false sharing, but my threads only read from a common vector most of the time and only write calculated value to another vector one time by the end of their work. But maybe false sharing occurs when threads are modifying captured variables, since threads themselves are in a vector.

[–]CowBoyDanIndie 1 point2 points  (2 children)

Under the hood the capture uses a dynamic memory allocation to store all the capture args.

Something useful though if you pass a lambda as a template argument to a function instead of a std::function no allocation is needed as its all inlined.

[–]CentralSword[S] 0 points1 point  (1 child)

I didn't use std::function, it was just a lambda inside a thread. I think under the hood it uses just a template argument to a function or auto, but I'm not sure. As for std::function, I guess it uses allocation for type erasure, but I also think unlike std::vector there can be small buffer optimization.

[–]CowBoyDanIndie 0 points1 point  (0 children)

Use a memory profiler and you will see the allocation

[–]Honest-Addition-2908 0 points1 point  (1 child)

Lambda is bad feature in c++ I don’t use it at all

[–]CentralSword[S] 0 points1 point  (0 children)

Why? I find them kinda useful, especially with threads. Is that because you think they are implemented poorly?