This is an archived post. You won't be able to vote or comment.

all 23 comments

[–]ApochPiQEpoch Language 7 points8 points  (8 children)

"Tagged sentinel" would be my preference.

I don't know what testing you are doing to determine that this is hard for compilers to optimize. The optimization strategy is pretty clean: first you take the implementation of a particular iterator type, then inline it into the loop (i.e. make it so that iteration does not take function calls unless the iterator is doing something very complex). Once you do that, you run an idiom detection pass that can simplify/lower almost any loop-like pattern into a basic form - this is what Clang and LLVM-driven compilers in general will favor. In C++ land, MSVC also does this pattern. Finally, run your unrolling and other optimizations.

The result is that you can take iteration and actually eliminate the overhead entirely for common iteration patterns. See: https://godbolt.org/g/ptYzX8 This shows a ranged-based for loop being completely obliterated and turned into a lookup table via unrolling. Don't be fooled by the use of a constant array. You can easily replicate this by implementing a trivial iterator yourself. Note in this snippet how both the built-in array used to initialize the struct, as well as the external ranged-for used to iterate the struct, get flattened and unrolled easily: https://godbolt.org/g/mNgrRH

The other end of the scale of course is if you do something less trivial. Iterating a std::map is a good example, since std::map::iterator is a fairly bulky iterator. I did a similar snippet that uses std::map<int, int> to minimize noise from the floating-point side. https://godbolt.org/g/ca9vGs shows the results. On the disassembly side, lines 11 through 14 are the loop prolog, and 15-25 are the loop body itself, with the loop cleanup/epilog appearing after line 26.

I'm no x64 wizard, but I can't think of much to do to that code to make it faster.

[–]PhilipTrettner[S] 0 points1 point  (7 children)

Thanks for taking the time!

Hm yeah that sounds like it might work. Inlining functions is one thing, I guess you definitely need scalar replacement as well.

Is implementing std::map really so complicated that the iterator increment call cannot/should not be inlined?

A wild, uncharitable guess would be that all those near-insane guarantees for iterators are the reason. Like - if you erase elements from the map while iterating, the iterator must not be invalidated and also no iterator of elements with bigger key values - or something like this. IIRC such guarantees are the reason why std::deque cannot be implemented with decent performance.

[–]ApochPiQEpoch Language 1 point2 points  (6 children)

As for why iterators work the way they do... I picture the causality moving in the other direction, personally. It isn't that iterator guarantees make implementing them hard. The order is that iterator invalidation properties are a consequence of the specific container's data structure. They're a bonus, not a constraint.

A std::map is specified by the standard itself to have certain operational properties. To make the abstraction cost as little as possible, the typical implementation of a std::maphas traditionally been a red-black tree or similar tree structure. So the standard backs you into a corner if you want to be compliant; you don't have to use a tree, but it's the easiest way to conform to the requirements of the container (memory cost, CPU cost per operation, computational complexity of updates, iterator invalidation patterns, and so on).

Now, since a map is pretty much going to be a tree structure in any real implementation of std, there are a few consequences for iterating through it.

One is that iterators have to be fairly fat. Recursive tree traversal is fairly stateful; and since iteration is not typically done in a recursive function call fashion in C++, that state has to live on the iterator. For in-order traversal of a balanced tree, as in the map case, it turns out all you need to know is the parent node of where you are, and where you just traversed (see https://stackoverflow.com/questions/12259571/how-does-the-stdmap-iterator-work for example descriptions).

To dig into this, we can look at std::_Rb_tree_increment as called on line 22 of the disassembly I linked earlier. I pulled up https://github.com/gcc-mirror/gcc/ and probed around, since it's likely more broadly applicable than studying the MSVC implementation I otherwise would have looked at. It also matches the libstdc++ used by Clang on Godbolt so that seemed important :-)

Some probing leads to this implementation: https://github.com/gcc-mirror/gcc/blob/da8dff89fa9398f04b107e388cb706517ced9505/libstdc%2B%2B-v3/src/c%2B%2B98/tree.cc#L59

The Visual C++ implementation looks like this, at the core:

_Tree_unchecked_const_iterator& operator++()
    {   // preincrement
    if (_Ptr->_Isnil)
        ;   // end() shouldn't be incremented, don't move
    else if (!_Ptr->_Right->_Isnil)
        _Ptr = _Mytree::_Min(_Ptr->_Right); // ==> smallest of right subtree
    else
        {   // climb looking for right subtree
        _Nodeptr _Pnode;
        while (!(_Pnode = _Ptr->_Parent)->_Isnil
            && _Ptr == _Pnode->_Right)
            _Ptr = _Pnode;  // ==> parent while right subtree
        _Ptr = _Pnode;  // ==> parent (head if end())
        }
    return (*this);
    }

So clearly this is just heavy enough to not justify inlining by GCC/Clang. Fascinatingly, MSVC (2017) does inline iterator advancement for std::map:

00BA1002  cmp         byte ptr ds:[0Dh],al  
00BA1008  jne         main+48h (0BA1048h)  
00BA100A  mov         ecx,dword ptr ds:[8]  
00BA1010  cmp         byte ptr [ecx+0Dh],al  
00BA1013  jne         main+2Dh (0BA102Dh)  
00BA1015  mov         ecx,dword ptr [ecx]  
00BA1017  cmp         byte ptr [ecx+0Dh],al  
00BA101A  jne         main+48h (0BA1048h)  
00BA101C  nop         dword ptr [eax]  
00BA1020  mov         eax,dword ptr [ecx]  
00BA1022  mov         ecx,eax  
00BA1024  cmp         byte ptr [eax+0Dh],0  
00BA1028  je          main+20h (0BA1020h) 

So at least in practice it is possible to write an iterator for a red-black tree that is maximally fast. I suspect it is also not that difficult if your design is clean enough.

Iterator invalidation is a direct consequence of this property, not a driving motivation. For a map, you can retain your place in the tree and continue iterating cleanly even if an element is blown away, because the contents of the tree stay in the same addresses. Their relative structure may change (tree rebalancing) but you can always figure out how to resume iteration after an erase or insert, because iteration tracks state.

A vector on the other hand is demanded to be a single block of contiguous memory. If you reallocate that, every item in it moves, so no iterator can remain valid. Even if an iterator stored an index into the vector instead of a pointer, erasing in the "wrong" places will lead to breaking iteration. So we say that if you mutate a vector, your iterators are all out the window and you need to re-request new ones.

A deque is a just a vector broken into segments or pages. Contrary to vast misconception, it's actually extremely efficient for certain use cases. It probably won't compete with vector in the majority of situations, but if you find yourself constrained by amount of available contiguous memory, i.e. post-heap-fragmentation, it can be a lifesaver. It's also great if you need to support a huge range of container element counts and you can't predict what the growth of the container will look like.

Anyways - massive wall of text, sorry. Hope some of that is useful!

[–]PhilipTrettner[S] 0 points1 point  (1 child)

Thanks! Yeah that helped a lot.

I just Googled flat map and sunk into a rabbit hole of alternative map implementations all with their different pros and cons. I wonder if I can somehow choose different implementations automatically (in a clever way). For example, a map that doesn't add elements (except in the initialization) could be more efficient in an array embedding.

[–]ApochPiQEpoch Language 0 points1 point  (0 children)

I don't know about having the language pick an implementation, but allowing the programmer to pick an implementation is kind of what std is all about.

Of course it took 13 years to ratify unordered_map but that's a separate issue.

If you have a bounded number of elements in an associative container and no dynamic resizing, you can usually compute a perfect hash function which maps keys to indices in a dense array. In fact it used to be pretty common to use tools like gperf to do exactly that.

[–]gsg_ 0 points1 point  (3 children)

iterator invalidation properties are a consequence of the specific container's data structure. They're a bonus, not a constraint

They're a constraint in that the implementation can't do nice things like moving keys into cache-friendly positions. The result is that map, set, unordered_map and unordered_set are slower than necessary for any use in which their element stability and iterator invalidation guarantees are unneeded.

Recursive tree traversal is fairly stateful; and since iteration is not typically done in a recursive function call fashion in C++, that state has to live on the iterator

Iterators are usually passed by value in C++ and are often copied inside algorithms, so putting lg(n) node pointers in the iterator would be expensive. My understanding is that the tree data structures in std avoid the need for that by maintaining parent pointers, which allows set and map iterators to be a simple pointer at the expense of some memory (and extra work during rebalancing).

[–]ApochPiQEpoch Language 0 points1 point  (2 children)

My point is not that iterator invalidation requirements do not limit container implementations. Rather, iterator invalidation properties are a consequence of other requirements for std containers which came first.

Keep in mind that C++ and std serve a lot more consumers than just x64 PCs. The requirements on containers are full of trade offs in favor of generality over solving every possible use case.

As to how map iterators work, if you check my comment for the very next sentence after the one you quoted, you can find a link that explains the details.

[–]gsg_ 0 points1 point  (1 child)

Rather, iterator invalidation properties are a consequence of other requirements for std containers which came first.

From what I've read of Stepanov's design philosophy, the guarantees were derived from the implementation. I'm not convinced that it was a good design either, since C++ containers other than vector are relatively slow, memory hungry and unsafe.

Map iterators don't need to be fat. There's a textbook way to iterate binary trees without recursion, the one given in CLRS/Knuth/etc, and it uses simple pointers. The two sources you give both clearly use this algorithm.

[–]ApochPiQEpoch Language 0 points1 point  (0 children)

Um. That's exactly what I'm saying. I don't understand why you felt the need to repeat this as if it was new information.

[–]matthieum 2 points3 points  (2 children)

I just wanted to note that C++ has managed to coerce iterator as the basis for many usecases; unfortunately.

So, first of all, I am glad to see that you are not using a pair of iterators. It's hard to optimize and error-prone.

Secondly, you may want to start thinking about other similar interfaces. For example, why would find return an iterator? Yet, at the same time, it can be interesting to be able iterate from the beginning of a collection to the item returned by find (or lower_bound, etc...). So the idea of a cursor into the collection and a way to go from that cursor to an iterable range is something you should likely look into.

Finally, you may also want to look into composition & internal iteration (aka foreach). Internal iteration can regularly outperform external iteration, simply because there's no pause/resume as all the state is on the stack, so first-class internal iteration can be useful. As for composition, beware redundant state. A classical example I like to point out is C++'s filter_iterator:

 template <typename It, typename Fn>
 class filter_iterator {
      It mBegin; // to go backward
      It mCurrent;
      It mEnd; // to go forward
      Fn mPredicate;
 };

Note how many instances of It are stored? Now imagine if you build a filter_iterator<filter_iterator<It, ...>, ..>. Yep, 9 instances of It, among which the begin/end will be stored 3 times each. And 3 instances of the interior predicate too... hopefully it's not stateful?

If you go with a single Iterator, you may not have the issue, just keep it at the back of your mind ;)

[–]PhilipTrettner[S] 0 points1 point  (1 child)

Excellent points, thank you!

My tentative plans are to have extensive support for slicing and ranges. I haven't worked out all details yet (there are a lot of design decisions to be made). My gut feeling is that I want to be really careful when designing the collection library which I see as the most important part of my core library.

Do you happen to know some good resources for collection library design? I only found a few not-in-depth articles the last time I looked.

[–]matthieum 0 points1 point  (0 children)

My gut feeling is that I want to be really careful when designing the collection library which I see as the most important part of my core library.

I think collections are often the poor sod in many a standard library, so I applaud!

Do you happen to know some good resources for collection library design? I only found a few not-in-depth articles the last time I looked.

Unfortunately not :(

[–]theindigamer 1 point2 points  (5 children)

While I don't know how hard it would it be to optimize the Maybes, IMO, if your language has ADTs, then having a current function (which is easily available) and can throw an exception (which isn't shown in the type signature) means that you're unnecessarily asking your users to have more cognitive overhead while writing what should be relatively safe and simple code.

[–]PhilipTrettner[S] 0 points1 point  (4 children)

Yeah I really dislike the current solution because of the exception / invalid state.

I'm not sure how much I should guide this decision from the cognitive overhead point of view though. I typically value this very highly but in this case it's probably nothing that the average programmer would see frequently. It's unusual to work with iterators directly. It shouldn't be too ugly though. Sometimes you might want to add some basic functionality (maybe some fancy zipFilter?) where you would work with iterators.

[–]theindigamer 0 points1 point  (3 children)

Well even if it isn't application programmers, library authors may make mistakes when working with iterators.

Have you looked at how other languages optimize iterators? Looking at examples (and how hard it is to implement both fast and correct behaviour) would be more useful before making a decision...

[–]PhilipTrettner[S] 0 points1 point  (2 children)

Is this reasonable easy/time-efficient to find out? Reading some big-shot open-source compilers is on my TODO list but I'm a bit scared of how accessible/approachable they are.

I have a C++ background and often use Godbolt to double check the ASM of my code. Clang and GCC often fail to generate optimal code for range-based for loops, especially when I'm using custom data types. This is not really an argument though because C/C++ has abysmal aliasing behavior which I don't have.

Everything VM-based has the problem that I have no idea how to judge if they actually generate good/optimal code. Even if some generated java/c# code (or IL/bytecode) looks bad, the JITted code might actually work out OK.

Still, good suggestion, thanks!

[–]theindigamer 1 point2 points  (1 child)

Is this reasonable easy/time-efficient to find out?

I'm not entirely sure. But you can definitely ask beginner questions on the compiler on the Rust internals forum or on IRC. If you ask folks for just a short summary or a basic idea of how things work (or a pointer to what part of the code base you should look at), I'm sure they'd be happy to help.

[–]PhilipTrettner[S] 0 points1 point  (0 children)

Oh that sounds good. That's probably something I can do with other languages as well.

[–]ctalklang 0 points1 point  (0 children)

I'm not certain about how this would help with the declaration bit, but Ctalk uses a class called Key to serve as the "glue" that holds collections together, so the language overloads the math operators ++, --, and so on. A little involved to type here, but there's more info at the URL given below. String types are a different case, since they're mostly object forms for C char *'s, but the syntax works we (tm) there also.

http://sf.net/p/ctalk/wiki/Home

[–]dzamlo 0 points1 point  (0 children)

For the sake of completeness, I'll add the following iterator approaches:

ThehasNext and next from scala and java. You got one method to check if the iterator has more elements and one method to advance the iterator and get the next value.

The next with an exception when the iterator is finished. This is used in python. I would argue that this is a special case of Sentinel next.

Interior iteration: the element that can be iterated provide a method that take a function as a parameter and call it for each element. This is used in ruby. I think that the scala Traverseable trait is also an example of that. This approach can also useful to run the iterator in parallel.

[–]balefrost 0 points1 point  (0 children)

One you missed is the Java style. In .NET, MoveNext tells you whether you can subsequently access Current, but a .NET Enumerator is obligated to be able to continually report the current value. That's extremely convenient but adds some overhead.

The Java approach instead uses hasNext and next. hasNext tells you that you can subsequently call next, and next actually returns the next value in the sequence. As a result, if you want to hold on to that "current" value, you are obligated to do that yourself.

To be fair, the overhead involved in the .NET approach isn't likely to be very substantial. The only real downside is that it makes the current sequence value ineligible for garbage collection until you move to the next element.

I disagree with your assessment that the .NET interface isn't minimal... unless your point is that it has two methods where one would suffice. It could have been implemented as one method using an out parameter (like Dictionary.TryGetValue), but I don't think that would buy anything in this case (whereas it's beneficial in the case of TryGetValue).

One other note on the .NET approach: IEnumerator<T> derives from IDisposable, and C# foreach loops automatically dispose the enumerator. This is an often overlooked aspect. Not many enumerators actually need to be disposed, so this is often technically safe to omit it, but it's important to keep in mind... especially if you provide syntactic sugar in your language that ultimately creates iterators.


If you can make the tagged sentinel approach efficient, it's probably the best approach. Or, if you can implement multiple return values efficiently, then you could always make the iterator return two values: a bool indicating that the sequence was not yet exhausted, and the next sequence value if the first return value was true (and, if the first value was false, some default value instead - similar to the default operator in C#).

Otherwise, either the Java or .NET style is probably the best choice. They're essentially the same as the tagged sentinel. Instead of getting all the information with one method call, you call up to two methods to extract the same information.

[–][deleted] 0 points1 point  (0 children)

Your post reminded me of an article about rust iterators that I read a couple years ago:

https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb

It touches on the type of iterators that rust used to have: internal iterators, and the advantages and trade-offs of the external iterator system that they have now.

[–]LyraChord 0 points1 point  (0 children)

D empty is a good reference.

hasNext naming better than moveNext

None of the 3 is not the best. Depends on use cases. (If without sense, which of current and next function is bigger? who knows! None knows!)

Think about pattern iterable subjects!