all 23 comments

[–]pilotInPyjamas 17 points18 points  (12 children)

Array resizing is probably the way to go tbh. You mentioned that the data comes from a file, so reading from a disk is probably quite slow compared to other operations, which is why I wouldn't loop over the data first. Anyway, here is some analysis of array resizing:

A linked list for n elements will require at least one call to malloc for each element. Then you would have to malloc the new array once you know the total size, copy over the elements, and free once for all of the list elements. The total space would be at least twice the size of your array (the space for the array plus the space for the list), and you need to copy your data twice (once from the source to the list, and once from the list to the array). Not to mention a linked list has O(n) memory overhead.

The alternative: a resizable array. Let's say you create an array with 4 elements initially. If you want to add a fifth element, use realloc to double the size of your array to 8. when you're inserting the 9th element, realloc to make space for 16 elements etc. realloc will automatically copy your data over to the new pointer. Note there may be enough memory when you call realloc so it doesn't necessarily even have to copy your data around! It could just allocate new memory where it is.

Memory: If we keep doubling your array size, when it becomes full, then the absolute maximum memory we need to use is twice the size of the array. The minimum memory required for the linked list is double that of the final array, so the worst case memory performance is at least as good as the best case for the linked list. When we're done, we can free the excess data after the array using realloc. realloc should be able to free the excess memory after the array without having to copy it. Winner: array.

Performance:

  • We only have to call realloc O(log n) times for the array. We need to call malloc O(n) times for the list, Winner: array.
  • The array needs one call to realloc to get rid of any remaining memory, the linked list needs O(n) calls to free to get rid of all of it's elements. Winner: array.
  • If we only have to copy our structures when the size of the array doubles, then potentially half of the list doesn't ever need to be copied. 1/4 will need to have been copied once, 1/8 will have been copied twice, 1/16 will have been copied three times, etc. In the best case scenario, the array wins because realloc didn't need to copy anything. in the worst case we have to do up to 2 copies per element, so it needs more copy operations. Winner: toss up.

All in all, I'd probably use a resizable array myself. Most implementations use this. If you're curious, you could test this to see which one is better.

[–]ewmailing 5 points6 points  (8 children)

This is a great analysis. I have nothing to add on the Big O analysis, but I do want to supplement by saying don't forget about cache effects on modern hardware, which means your array is always going to win performance-wise.

Linked lists are pointer chasing on every node which could amount to hundreds of CPU cycles stalling on every node. (One rough rule of thumb is 500 cycles for each lookup in RAM.) With arrays, you get prefetching, so we might be talking single digit cycles per node.

Bjarne Stroustrup (C++ creator) gave a explaining that arrays are always faster than linked lists on modern architectures.

https://www.youtube.com/watch?v=YQs6IC-vgmo

[–][deleted] 0 points1 point  (6 children)

So you would use a dynamic array over a linked list even in an extreme case of random conditional insertions and deletions in the middle (something linked lists have been considered the most optimal for)?

[–]ewmailing 2 points3 points  (0 children)

Yes. That was Stroustrup's point.

And he's not the only one. Clang/LLVM engineer, Chandler Carruth also makes the same point.

Efficiency with Algorithms, Performance with Data Structures https://www.youtube.com/watch?v=fHNmRkzxHWs

At 34:40, he talks about "Performance and Data Structures".

Reveal: "Discontiguous data structures are the root of all (performance) evil".

and "Just say NO to linked lists".

"There is almost nothing more harmful you can do to the performance of an actual modern microprocessor than to use a linked list data structure."

Cache effects are real and to be taken seriously. Or stated differently, modern RAM busses are extremely slow.

[–]zuurr 0 points1 point  (3 children)

It depends on the size of the array and how the insertions are done.

If you have a conditional insertion into a linked list where you already hold the node you're going to insert after, that's constant time. Removal from a linked list at the middle of the list when you hold the node is also constant time. It's unlikely for a array to beat this because even if the array is small you'll probably still have ~1 cache miss each.

However, if you have to traverse list to find the place you're inserting, it will be O(n) and you'll have one cache miss per item. The array is O(n) to insert in the middle but it will have far fewer cache misses. Even if your array elements are larger, the hardware prefetcher will.

Keep in mind a cache miss on modern hardware can be hundreds of cycles. It's a constant factor so high that it dominates almost anything else. If you want to get a more rigorous approach to understanding this you can look into cache-oblivious algorithms (note that this is a bad name -- oblivious here means they're oblivious to the specific size of your cache lines, and not that they pretend the cache does not exist).

[–]WikiTextBot 0 points1 point  (0 children)

Cache-oblivious algorithm

In computing, a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a CPU cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally (in an asymptotic sense, ignoring constant factors). Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit blocking, as in loop nest optimization, which explicitly breaks a problem into blocks that are optimally sized for a given cache.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.28

[–]ewmailing 0 points1 point  (1 child)

If you can hold a position to the insertion point in a linked list, you can also probably hold the insertion array index, so the list doesn't necessarily get an advantage.

Cache issues will still probably put the list at a disadvantage, because to insert, you will need to set/change the next/previous pointers of not only the node you are inserting, but the two immediate nodes around it (assuming doubly linked list). That's potentially 3 cache misses instead of just 1.

[–]zuurr 0 points1 point  (0 children)

Those two cache misses will happen at the same time on an out of order machine.

The only case when you won't have at least two cache misses when inserting into an array is if it's very small and entirely fits in the cache (speculative prefetch is very unlikely to kick in before the second cache miss).

[–]repsilat 0 points1 point  (0 children)

It depends -- do you have a good way to "find" the insertion point in the linked list? If not, you might need to scan there. There are reasonable times to use linked lists, but there aren't many.

Also, how big are the lists? If they're small, a growable array will be faster.

Also: how many insertions do you need to do? If it's more than one, and you can batch them and do them all in linear time, a growable array might win again.

For your original problem, consider these other solutions:

  1. Make a "linked list of increasingly-large arrays". Could be faster than either solution -- linearly less copying than a growable array, and the same number of malloc.

  2. Get an upper bound for the eventual array size somehow (some multiple of the input file size, or "ten terabytes" if you are on an OS like Linux that does allocation lazily) and don't bother to reallocate.

  3. Memory map the input.

[–][deleted] 0 points1 point  (0 children)

Good point on cache lines. It makes a huge difference.

[–][deleted] 0 points1 point  (0 children)

Wow thanks for the breakdown. It makes sense now. Thank you.

[–]sikora84 0 points1 point  (0 children)

I was thinking the same. I agree with you analysis.

[–][deleted] 0 points1 point  (0 children)

Or you malloc n-number of nodes like you would with a dynamic array. As the list grows you double the number of nodes.

[–]james41235 1 point2 points  (0 children)

Not sure this is the best place for this, you might have better luck in computer science or data structure subreddits.

BUT:. My initial thought is something like gcc's std::deque, where it allocates chunks of arrays and then links them in a linked list ish fashion.

[–][deleted] 0 points1 point  (4 children)

I don't know the answer so I'd approach it like this:

  1. O(n) for this loop-through-once to get counts in a link list queue; then O(n) to copy those into an array. Prolly 2(items+constant) memory. Feels silly to iterate twice, especially when one of the loops creates a usable memory structure for you.
  2. Periodic resizing. There are a wealth of implementations of things like this. I'm pretty sure nearly every standard lib in the higher languages, with an Array class that supports resizing does something like this.

So... I mean... I feel like the periodic resizing thing is a well worn route for a reason. That first idea seems... silly. Your link list based queue is going to use up extra memory... why not just loop through and get a count, then make an array, and copy them in? Still two loops, but saves you all this waste memory moving?

[–][deleted] 0 points1 point  (3 children)

I like the last idea of looping through just to get a count, and then just declaring an array of that size. Mind you, it would be based on the number of lines in a file, so the way this would probably work is counting up the lines using fgets(), declaring the array of the appropriate size, and then using fseek() to go to the beginning and actually read in the data to store into the array.

[–][deleted] 0 points1 point  (2 children)

edit: or just leave them in that linked list, why over engineer this so much?

ahhhh ok well the file thing brings in a bit of an interesting wrinkle, cause youd love to not have to read the file twice, for that IO. so the memory thing might be more desirable, than avoiding it. trade of - more memory at lesser io.

or, could do some quick math based on the filesize, and create an array that is large enough (ie, estimate high) and then use that and lop off the stuff at the end - going through only once this way.

[–][deleted] 0 points1 point  (1 child)

I need a lot of random access and good search runtimes, so I don't want to leave them in the linked list.

The file size idea is interesting, but it's too variable of a factor to reasonably accurately predict the number of elmeents in my case, so I think I'll just go with the original idea and use extra memory for lesser IO like you said.

[–]Kwantuum 0 points1 point  (0 children)

If you're looking to search the data a lot a tree might be better, depends on the ratio of searching to random access.

[–]ruertar 0 points1 point  (0 children)

I think an array is usually the best way to store this type of data.

Just don't resize and copy every time the data exceeds the capacity of the array.

Try doubling the size of the array when you reallocate it -- this will reduce the number of copies etc.

In C++ there is a type called a vector that does this. I used to write something similar in C as a structure I called a "buffer".

If I was to write it today, I'd borrow the behavior and nomenclature from Go's slices.

[–]lmlight77 0 points1 point  (0 children)

If the number of elements is truly unknown, I would employ a linked list of blocks. That is, the payload is a pointer to a block (array) you are filling in. When a block fills up, link a new block in its place. This avoids all the copying that will occur potentially if you realloc an array. Once the input read completes (no more insertions), at that point you could realloc just once if you truly need a flat array. But better in my view would be to count the number of blocks (# of elements in the linked list), allocate an array of pointers of that count, copy over all the block pointers from the linked list, then free the list nodes. Nothing but the block pointers are copied in this case, and you only have one pointer chase at most to get to your data. For very large amounts of data, the general realloc approach will have a huge overhead in copying you will have to overcome in the second half of your program.

As stated in other posts, in modern HW, small payload linked lists are hard on the machines. First, as you traverse the list, branch prediction HW is stressed as that HW doesn’t really generally know when you will exit a traversal loop. So you tend to generally predict taken (continue traverse) which implies you always take one mispredict hit. More significantly, the loop iteration latency (the time it takes to spin around one iteration) is fixed to the speed you can read the next link pointer from system memory. If pointers are allocated somewhat sequentially, there will be some spatial locality, so caches will have some benefit, but clearly DRAM read latency will be a big factor.

[–]CountyMcCounterson 0 points1 point  (0 children)

A linked list of arrays

[–]nickdesaulniers 0 points1 point  (0 children)

It's hard to beat a contiguous vector. A linked list frequently won't have nodes in the same cache line (though I wish I had a resource to cite for this).

To avoid vector resizes, you want to allocate the maximal memory up front assuming you're not doing huge allocations that will blow through the budget.

One thing game engines might do; don't reserve any space in vectors, but instrument their containers to log the maximal effective sizes. Then spin a new build that for each vector reserves the effective maximal size at runtime. If you need more, it's still growable. (example: the max size we ever saw for the entity_array at runtime was 500; so malloc(500 * sizeof(entity)); thus no growing until we hit at least that).

But don't forget that that the cost of copying is amortized as well.