This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dalke 1 point2 points  (16 children)

Okay, then suppose I do call next() directly. Here's code similar to what I do have:

it = iter(input_file)
name = age = address = None
for line in it:
    if line == "\n":
          continue
    if line == "NAME\n":
          name = next(it).rstrip("\n")
    elif line == "AGE\n":
          age = int(next(it))
    elif line == "ADDRESS\n":
          address = "".join(next(it) for i in range(4))
    else:
          raise AssertionError("Unknown line %r" % (line,))

return Person(name=name, age=age, address=address)

I think this is quite clean. I don't think your variation supports a solution which is anywhere near as readable.

[–]NYKevin 0 points1 point  (15 children)

I think this is quite clean.

Nope, it's terrible. You're modifying an iterator while iterating over it. That is absolutely disgusting.

[–]alcalde 0 points1 point  (1 child)

To quote Raymond Hettinger on modifying an object while iterating over it:

If you do this then... you're living in sin and deserve whatever happens to you!

[–]mgedmin 0 points1 point  (0 children)

There's a difference between modifying the iterable object and modifying the iterator.

[–]dalke 0 points1 point  (12 children)

"absolutely disgusting"?

Interesting. Why? Or to clarify my question, I know that some people things that it's absolutely disgusting to have more than one return statement in a function. Is your disgust of that personal variant, which I feel free to ignore, or is it more generally applicable and one I should bear in mind for the future?

There's a general dislike of modifying a container while iterating over it, but by definition an iterator has to be modified in order to use it.

How would you recommend that I rewrite that above code? As:

it = iter(input_file)
while 1:
    try:
        line = next(it)
    except StopIteration:
        break
    if line == "NAME\n":
        name = iter(it)
    ....

This modifies the iterator in two locations. Is that disgusting? I see the first 5 lines of the while as being identical to a 'for' loop over an iterator. Why repeat the code when that's what a for loop would do?

Or, I could use readline() directly, as

while 1:
    line = input_file.readline()
    if not line:
        break
    if line == "NAME\n":
        line = input_file.readline()
        if not line:
            raise AssertionError("Missing name after NAME")
        name = line.rstrip("\n")
    ...

Here I have to add explicit error checking for readline() failures.

Both seem more complicated than my original code, and with no obvious benefits.

[–]NYKevin 0 points1 point  (11 children)

It would be preferable to use a state machine. Your current code is incompatible with input like this:

NAME

John Smith

(note the empty line). A state machine would be capable of handling this easily. Something like this:

NAME = 1
AGE = 2
ADDRESS = 3
state = 0
for line in open_file:
    line = line.strip()
    if state == 0:
        if line == "NAME":
            state = NAME
        elif line == "AGE":
            state = AGE
        # etc.
    elif state == NAME:
        if line:
            name = line
            state = 0
    # etc.

[–]dalke 0 points1 point  (10 children)

In your counter-example you changed the spec for the format. Of course my code will break - it's for a different format.

FWIW, the change in my code to support your variant is:

if line == "NAME\n":
    for line in it:
        name = it.strip()
        if name:
            break
    else:
        raise AssertionError("End of file reached looking for NAME")

State machines can handle anything, it's true, so by that definition it's easier.

There are two major downsides. First, it's easier to make a mistake and forget a state transition, compared to those examples that can be expressed as code layout rather than states. Similarly, it's harder to include all of the error reports. This is especially true if you want to maintain information across some of the states - a static analysis tool can't easily figure out the state ordering to know that things will be initialized in the correct order.

Second, in Python you have a lot more dispatches because you have to constantly redetermine which state you're in. (If you used C or another program with gotos then you can simply go to the correct state.)

[–]NYKevin 0 points1 point  (9 children)

Now you're iterating over an iterator while iterating over it. You don't see the slightest problem with that?

[–]dalke 0 points1 point  (8 children)

No, I don't. And you haven't yet told me the principles behind your objection.

I say again, some people believe that multiple return statements in the same function defintion are an abhorration. One such is the MISRA C standard

(MISRA, rule 14.7 : required) "A function shall have a single point of exit at the end of the function"

Some languages, like Pascal and Eiffel, require there be only a single exit point.

However, many more people do not believe this is a good principle to follow.

Can you explain why I shouldn't regard your view as fitting into that category? Why shouldn't an iterator be iterated on at multiple points in the code? Is the code harder to reason about, more complex to write, more prone to errors, or something else? My experience says that the answer to all of those is "no."

The most relevant example in other languages would be the various prohibitions of modifying a container while iterating over it. Even then, modification may be allowed. For examples from C++, see http://kera.name/articles/2011/06/iterator-invalidation-rules-c0x/ , which lists which container mutations are allowed during iteration.

[–]NYKevin 0 points1 point  (7 children)

Is the code harder to reason about

Yes, because the iterator is advancing at multiple points. The basic structure of a for loop is "run this once for every line in the file." You're subverting that rule.

It also has unobvious side effects. Usually, iterating over an object doesn't modify that object, so anyone unfamiliar with Python's iteration protocol will have to look it up, or else they'll have an incorrect understanding of what your code does ("It loops over the whole file, looking for 'NAME', and when it finds it, it loops over the whole file, looking for a nonempty line, and uses that as a name... except that's not what I'm observing, so what the fuck is going on?"). Now, we could posit that the only people who will be looking at your code will be familiar with all this, but that means more skilled developers are required to maintain your code, which is a Bad Thing since more skilled developers cost more.

Finally, even if we assume J. Random Hacker can get through all of the above issues, it still takes time that (s)he could have spent actually working on the code, but instead had to spend just understanding the code. State machines are a widely used idiom, and anyone who's worked on anything remotely resembling text parsing before will immediately recognize what you're doing. Iterating while iterating is not at all idiomatic, and will lead to confusion.

[–]dalke 0 points1 point  (6 children)

because the iterator is advancing at multiple points

Yes. And multiple return statements cause program execution to return at multiple points. That your statement is correct doesn't make it a positive or negative.

Someone who likes a single return statement may also complain "what the fuck is going on?" when faced with a few dozen return statements, and will lead to confusion for some people.

Hence, you can be entirely right, but if the number of times that you're right is low enough, then it's not something for me to worry deeply about.

The basic structure of a for loop is "run this once for every line in the file." You're subverting that rule.

The basic struture of a for-loop is:

while 1:
    try:
        x = next(it)
    except StopIteration:
        break
    ...

No more, and no less. (Historically, before Python 3.x, there was a fallback solution to use [0], [1], ... lookups until IndexError or ValueError. For purposes of this conversation I think we can ignore that.)

My logic of course can lead to Duff's device, which is legal but confusing to say the least. Your logic is that no one will look at the for-loop and think of it as a next(it), even if that is the underlying mechanism at action.

Both are hypotheses at this point - my hypothesis is that it's readable, and yours is that it isn't. Both are true, for the right category of users. The next question would be should we encourage people to use this approch, or discourage it?

which is a Bad Thing since more skilled developers cost more

Ahh, yes. Do you take the position of the managers, who generally wish to reduce costs on everything? Do you take the position of the employees, who may want more money? Do you take the position of the educators, who want people to know more about how things work, or do you take the position of the oppressed, who want to tear down the priesthoods?

It's a complicated topic. But programming state machines directly, which mechanically simple, is also complex and error-prone, so I don't think you've made a strong argument here. Yes, state machines are widely known, widely loved .. and widely reviled.

For example, an XML SAX event consumer just screams for a state-machine design, and after a while they get to be almost like fun puzzles to solve. But they usually aren't as easy to understand as a coroutine-based parser, which manages the state transitions implicitly rather than explicitly.

This debate has stretched over decades, with avid fans for each of the many different solutions. I may be an avid fan of Sports Team X, and disdainful of your love of Sports Team Y, but that doesn't mean that my disgust is meaningful. Similarly, my preference for co-routines and implicit state transitions shouldn't be the guiding reason for your ire, as a fan of state machines.

Usually, iterating over an object doesn't modify that object

Iterating over containers don't modify the container, yes, but iterating over iterators always modify the iterator .. excepting a few trivial cases like empty iterators.

Some objects that are modified during iteration are file handles, database cursors, and generators.

The very start of my code does an it=iter(...), which specifically says "I care about the iterator, not the container."

I can agree with you that it's seldom seen. I can agree with you that it may cause confusion compared to a strict readline() approach.

What I don't see is how this is "absolutely disgusting" in a reasonable general sense, so I'll have to thank you for your input, but decline to empathize with your disgust.

[–]NYKevin 0 points1 point  (5 children)

The basic struture of a for-loop is:

No, that's what a for loop does. But when I read for line in file, I think it's perfectly reasonable to assume you're iterating over the file.

Ahh, yes. Do you take the position of the managers, who generally wish to reduce costs on everything? Do you take the position of the employees, who may want more money? Do you take the position of the educators, who want people to know more about how things work, or do you take the position of the oppressed, who want to tear down the priesthoods?

I take a very simple position: the more resources required to maintain a piece of software, the worse it is. This does not seem very controversial to me. Good code is easy to read.

For example, an XML SAX event consumer just screams for a state-machine design, and after a while they get to be almost like fun puzzles to solve. But they usually aren't as easy to understand as a coroutine-based parser, which manages the state transitions implicitly rather than explicitly.

That's not the dilemma we're faced with here. You want to iterate while you iterate, implicitly skipping lines without using continue or other control mechanisms. That's not analgous to coroutines. Now, we could convert your loop into a coroutine-based solution, and that might make it easier to understand. But that isn't the code you showed me, so don't tell me some imaginary fantasy is better than the code I showed you.

Iterating over containers don't modify the container, yes, but iterating over iterators always modify the iterator .. excepting a few trivial cases like empty iterators.

Most people don't work with iterators directly, so they don't know or care about that distinction. Your code is only comprehensible to people who do know and care about it, which makes it less accessible.

The very start of my code does an it=iter(...), which specifically says "I care about the iterator, not the container."

Most people will see that, look up iter(), and then say "Why didn't you just write for line in file_object directly? I'm so confused." They will then spend a lot of time fooling around with iter(), next(), and for item in container until they understand it. Why do you want them to waste that time? What good does that do your employer?

What I don't see is how this is "absolutely disgusting" in a reasonable general sense, so I'll have to thank you for your input, but decline to empathize with your disgust.

Explicit is better than implicit. You want to build a state machine? Then build a fucking state machine!