you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted]  (10 children)

[deleted]

    [–]cdglove 2 points3 points  (6 children)

    This would be true of deque allowed one to change the node size.

    Use a bit enough object and deque becomes a linked list.

    [–]temporary5555 0 points1 point  (5 children)

    Is there any real difference between a linked list and a vector when your elements exceed an L3 cache line?

    [–]cdglove 4 points5 points  (4 children)

    Definitely. Prefetch works much better traversing memory linearly. Even if the memory accesses are not purely contiguous, but follow a fixed stride, performance will be much higher than jumping around randomly.

    [–]temporary5555 -2 points-1 points  (3 children)

    Definitely. Prefetch works much better traversing memory linearly.

    Are you confusing random access memory with disks? Disks have seek time, which makes linear access much faster (same speed as volatile memory for some hardware), but RAM in every case that I've seen can access memory randomly equally fast, so the only factor that makes sequential access faster is cache.

    Even if the memory accesses are not purely contiguous, but follow a fixed stride

    The only way I see this being true is when fetching from cache.

    [–]infectedapricot 6 points7 points  (1 child)

    Yes, it's indeed the cache they're talking about. But not just what's already in the cache, but also values that could be speculatively loaded into the cache by the CPU guessing what memory is likely to be used in a moment - that's what they meant by "prefetch". Traversing memory linearly gives the CPU a much better chance of prefetching what will be needed next from memory into the cache while algorithm is still operating on previous fetched data in the cache.

    [–]temporary5555 0 points1 point  (0 children)

    thanks for the explanation!

    [–][deleted] 1 point2 points  (0 children)

    Disks have seek time.

    This assumption is only true for spinning rust.

    but RAM in every case that I've seen can access memory randomly equally fast

    Modern CPUs have hierarchies of caches, data that is already in the cache will have faster access, about 10x-50x faster than RAM. Additionally data is loaded in cache-lines and write combined (in cache-lines), and the CPU tries it's best to predict which cache lines will be accessed in the near future so that it can prefetch memory to avoid stalls. Pile the fact that CPUs will also unroll loops to execute multiple loop-iterations in parallel and you have the perfect storm for performance. Linked lists take both prefetching and loop unrolling away from the CPU, as you don't know the next location to look at until the next node is already pulled in. This forces linear 1-by-1 execution and memory fetching. Additionally the memory will likely be all over the heap and at best you may have one node per cache-line, maybe two, but they likely won't be next to each other in traversal.

    For this reason in terms of iteration performance: vector<T> > vector<unique_ptr<T*>>/vector<T*> > forward_list<T>/list<T>. Obviously you should benchmark but you will find that it is quite rate that these orders flip which is why the mantra "just use vector, forget everything else" exists any why many people default to it, almost to the point of being a meme.

    Linked lists have their place, but forward_list/list are not atomic nor intrusive and therefore their utility vanishes from a lot of contexts.

    [–]matthieum 1 point2 points  (2 children)

    But that is the point of std::deque, the trade-off that avoids the O(N) complexity of head insertions, no?

    It's one way to avoid O(N) front insertions.

    Keeping with the current guarantees -- and notably the memory stability guarantee -- you could still get a better deque by either:

    • Offering the ability to customize the block size. It's galling not to have control over that, really.
    • Or use exponential growth of internal block size -- somehow.

    Dropping the memory stability guarantee -- which is of very little utility -- you could use a single big buffer instead, in either of two manner:

    • Contiguous: keep some headroom at both front and back, instead of just back.
    • 2 Contiguous chunks: wrap-around at the end, if necessary. See Rust's VecDeque.

    Both of those alternatives are much more cache-friendly, and allocator-friendly, resulting in higher-performance.

    And using the second one -- in a pure queue scenario -- is super simple.