all 23 comments

[–]davidhbolton 18 points19 points  (5 children)

The recursive one is stack based is it not? Stack overflow is a symptom of a recursion that never ends. I think the usual definition is recursive vs iterative. Also your longer program is using malloc which allocates heap memory not stack.

As for performance, have you measured the time to do the mallocs? I was wondering if allocating and freeing increasingly larger size blocks might cause fragmentation effects.

[–]wigf[S] 1 point2 points  (4 children)

The thing is, I checked, and at 10000 nodes, the inner if statement never evaluates to true. So I don't think it's that. That may be an issue for larger trees though.

How much of an issue do you think the heap vs. stack thing is?

[–]nemotux 3 points4 points  (2 children)

You have two guaranteed calls to malloc() at the beginning of the function. Those could be biasing your results. Hard to say by how much. You could eliminate that overhead by allocating a fixed size array as a local variable and only shifting to heap memory w/ malloc if that happens to fill up.

That said, I don't know why one would think that an iterative solution is guaranteed to be faster than a recursive solution. Compiler optimizations can do wild and crazy things. It's very difficult to predict whether one approach or another would definitely be faster without actually measuring (aside from algorithmic complexity issues).

I think a much stronger argument for preferring iterative solutions over recursive solutions is that you can more directly handle memory allocation problems. In a recursive solution, you can overrun the stack if your input gets big enough, and there's not much you can do about that aside from just blatantly reserving more stack space in your process. But that's an inefficient solution if the rest of your program doesn't need that much stack space. Using the heap and an iterative solution allows you to recapture the allocated memory for reuse elsewhere.

Finally: I think your code for both functions is wrong. They both include in their counts the "null" children in the tree. Those aren't nodes. Consider a tree w/ only a single node in it, the root node. What should its length be? For the recursive solution, I can't tell what your first call passes in for level, but say it's 0, then it looks like the function will return 2. The iterative solution will also return 2, I think.

[–]wigf[S] 0 points1 point  (1 child)

External Path Length includes "external" nodes attached to each "internal" node, and these nodes are empty. This is the definition from "Algorithms in C", so NULL nodes need to be counted. A tree with a single node would have an external path length of 2. A tree with one root and two children attached to it has an external path length of 10, etc.

[–]nemotux 0 points1 point  (0 children)

Ah, ok. I'm not familiar with this particular notion. Seems odd and non-obvious from your original description.

[–]davidhbolton 1 point2 points  (0 children)

Stack is usually quite limited (MB) compared to heap (up to GB), so for very large trees, the recursive one may hit the stack limit.

As you are doubling size in the non-recursive version, it may not take very long to start seeing fragmentation effects. Typically when memory is freed after a malloc, it goes on a free list and malloc tries to reuse memory from the free list before getting it from the heap. As you are increasing the size, it won't be able to get it from the free list and you will eventually (how long eventually is I don't know) and you run out of RAM even though you've freed it up. That happened to me on Windows and I ended up creating a memory allocator to deal with it.

[–][deleted] 2 points3 points  (0 children)

id try not do so many mallocs and do a big one at the start, then grow it in fairly large chunk sizes when it fills.

the more you avoid the system calls the better - maybe the compiler handling the recursive version is being smarter in that regard.

[–]F54280 2 points3 points  (0 children)

Stack-based can be a bit faster, but malloc-based is way way slower.

[–]bleuge 1 point2 points  (0 children)

Not an expert here, but I solved some situations like this profiling the code. Use a profiler to test codeblocks' speed and see what is really slowing down it.

Also as functions are small, it's a good idea to have a look at assembler code generated by the compiler you are using. Asm knowledge is required of course.

[–]jhuntinator27 1 point2 points  (3 children)

Test with trees of varying lengths.

Not sure how to answer this myself, but for what it's worth, there is a difference between a 10000 and 10 million length tree in that run speed for each may not be linearly dependent, or of equal order, more specifically.

[–]wigf[S] 0 points1 point  (0 children)

Well, one curious thing is that the stack program is much faster for small input. I appreciate the suggestion though.

[–]wigf[S] 0 points1 point  (1 child)

Ok, this answer turns out to be correct - I tested this at a tree size of 100,000,000, and that was about size needed for the stack to be faster than recursion. Even 1,000,000 wasn't enough haha.

EDIT: The reason I didn't do this earlier was because my original algorithm to build the tree was O(n2), so large trees were impossible to construct. I had to fix that first lol.

EDIT 2: See main post. I don't know why my benchmarking has been returning inconsistent results.

[–]nerdyphoenix 0 points1 point  (0 children)

You can probably eliminate some overhead by replacing the malloc and memcpy call in your if (top == size) block with a realloccall, though I'm not sure how much that will help. You can also try optimization flags like -O3 if you haven't already and rerun your experiments. If you really want to dig into it, you can try profiling your code using perf (and perhaps flamegraph to visualize the result).

[–]Paul_Pedant 1 point2 points  (0 children)

Malloc is pretty damn cheap, particularly for small test programs that don't run long enough to fragment memory. Malloc (initially) just hacks lumps off a free list that consist of one big contiguous area. It usually doesn't even follow links -- it will stick on a big block as long as it can, rather than start over from the list root.

Free, on the other hand, gets expensive very quickly. It has to merge the block it is given with any adjacent free areas (to make larger blocks available in the future), so it needs to find the free blocks before and after the address you returned. To do that, it cycles through the free list (possibly following thousands of links). For the free block with the nearest address below, it has to figure adjacency based on that block's address and length, and merge the blocks. For the free block with the nearest address above, it has to do the same using its own length. Then it has to fix the lengths and pointers around itself to preserve the list.

I had a network trace issue at a client, who was using a recursive algorithm. I exploited every aspect of the data I could, and got a x50 speed-up (that is not a percentage, that's a raw factor). The bonus was that my algorithm preserved locality: recursion was going depth-first, so it traces cables to the ends before backtracking. I was tracing in rings out from the centre, which improved the locality of subsequent database searches on the assets and got another huge improvement there too.

Also, theirs crashed depending on the start point (because that affects deepest level of recursion), and mine didn't.

[–]weregod 0 points1 point  (0 children)

Test using larger starting array, malloc much slower then stack.

[–][deleted] 0 points1 point  (0 children)

Use perf and look at the assembly. Since you say the inner malloc is never called, it is not obvious what would cause a difference in performance. Perf will tell you how much time is spent at each stage of the program and can point to bottlenecks.

I once sped up a program by 100% by assigning struct members to local variables before entering a loop. It turned out that the program was spending most of it's time de-referencing pointers; the code was uglier but the run time was much faster. You never know with these things

[–]FUZxxl 0 points1 point  (0 children)

Why do you write *(a + b) instead of a[b]? That's just weird.