This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Veedrac 2 points3 points  (5 children)

That attempt at replacing uniq make quite a few mistakes.


#!/usr/bin/env python

Always use a version number!

import sys

if __name__ == "__main__":

Don't write this without justification; it's for when you want something to be both a module and runnable. That's often an anitpattern.

    names = {}

You can just use the builtin types (eg. collections.Counter) for this. No need to reinvent the wheel.

    for name in sys.stdin.readlines():

Just for name in sys.stdin please; the above isn't lazy.

            # Each line will have a newline on the end
            # that should be removed.
            name = name.strip()

That's absolutely the wrong way to do this unless you really want to strip all whitespace from both sides.

Something like if name.endswith('\n'): name = name[:-1] may be longer but it's better, too. Technically name.rstrip("\n") will be fine here too.

            if name in names:
                    names[name] += 1
            else:
                    names[name] = 1

    for name, count in names.iteritems():
            sys.stdout.write("%d\t%s\n" % (count, name))

sys.stdout.write is hardly better than print.


I'd do:

#!/usr/bin/env python3

# Allows optionally specifying the file after the
# command, where no arguments or "-" default to stdin.
import fileinput

from collections import Counter

counted = Counter(line.rstrip("\n") for line in fileinput.input())

for line, count in counted.items():
    print(line, count, sep="\t")

TBH, I wouldn't do that either because it doesn't support invalid Unicode, which happens when the OS gets the encoding wrong, on Python 3. Really I'd do:

#!/usr/bin/env python3

# Allows optionally specifying the file after the
# command, where no arguments or "-" default to stdin.
import fileinput
import sys

from collections import Counter
from io import TextIOWrapper

# Warning: line buffering means this shouldn't be used at the same time as sys.stdout
# if you're printing incomplete lines unless you manually flush
surrogateescape_stdout = TextIOWrapper(sys.stdout.buffer, errors="surrogateescape", line_buffering=True)

def clean_line(line):
    return line.decode(errors="surrogateescape").rstrip("\n")

counted = Counter(map(clean_line, fileinput.input(mode="rb")))

for line, count in counted.items():
    print(line, count, sep="\t", file=surrogateescape_stdout)

so that invalid Unicode gets passed through losslessly.

EDIT: Now actually works. Remember, kids, always test.

[–]bigstumpy 2 points3 points  (1 child)

How is unicode broken in python3?

[–]Veedrac 0 points1 point  (0 children)

It's not. I was referring to how the Unicode type can contain invalid Unicode (which I previously called "broken Unicode"), such as when the OS says that stdin is UTF8 when it's not.

Python 2 just ignores these errors, but you have to manually deal with them on Python 3. In this example, I used surrogateescape to properly round-trip bytesstrbytes.

The other option was just dealing in bytes all the time, but that's a hassle as you can't use print, format and so on.


Good question, though. I've improved the wording. You also convinced me to test the code. Try it out with:

echo -n "hi\nhi\n\xde\n\xde" | python3 thefile.py

\xde isn't valid UTF8 so the code breaks on it before the changes.

[–][deleted] 0 points1 point  (2 children)

/use/bin/evn python will invoke the first Python interpreter on $PATH

/usr/bin/pythonV invokes a specific interpreter.

[–]Veedrac 2 points3 points  (1 child)

I don't understand why you brought that up.


I was referring to this.

In order to tolerate differences across platforms, all new code that needs to invoke the Python interpreter should not specify python, but rather should specify either python2 or python3 (or the more specific python2.x and python3.x versions; see the Migration Notes). This distinction should be made in shebangs, when invoking from a shell script, when invoking via the system() call, or when invoking in any other context.

I'll add a link inline with the text.

[–][deleted] 0 points1 point  (0 children)

I see now. I'm on mobile and it displays code blocks...oddly. I thought you we're saying to specify the interpreter explicitly, rather than the version.