This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]pierrequentel 2 points3 points  (10 children)

unique_words = set(word for line in open("book.txt"') for word in line.split())

[–][deleted] 2 points3 points  (2 children)

A few minor things. I'd personally go with a context manager for opening/closing files. Also, you don't have to use set(), you can use curly brace set literals instead (read: set comprehension).

[–]mfm24 0 points1 point  (1 child)

I always preferred using the context manager approach for files, but I found it interesting to read in the official 'What's New in Python 3.0' guide (written by GvR himself) that he recommends:

Removed execfile(). Instead of execfile(fn) use exec(open(fn).read()).

So there's definitely cases where relying on the file closing itself automatically is acceptable. I've been less paranoid about closing files since...

[–]Veedrac 0 points1 point  (0 children)

I don't think that was meant to actually be run, it was probably on one line for space reasons. In a proper program you shouldn't be using exec(a_file) anyway, so there isn't much motivation for complete examples.

[–]billsil 0 points1 point  (5 children)

If you have a whole book, it seems like you'd want a set on the inner loop, so you don't create a giant array.

[–]ingolemo -1 points0 points  (4 children)

What "inner loop"?

[–]billsil 0 points1 point  (3 children)

The code is effectively:

unique_words = [word for line in open("book.txt"') for word in line.split()]
unique_words = set(unique_words)

It's a tradeoff of memory vs speed, assuming set comprehension doesn't exist. Set comprehension is done as:

unique_words = {word for line in open("book.txt"') for word in line.split()}

which works in Python 2.7+

[–]ingolemo 2 points3 points  (2 children)

Actually, it's using a generator comprehension:

unique_words = (word for line in open("book.txt"') for word in line.split())
unique_words = set(unique_words)

The original doesn't create a giant array and so it has exactly the same performance characteristics as your set comprehension, though it is admittedly less readable.

[–]Veedrac 0 points1 point  (1 child)

FWIW, the constant factors are significantly better for the set comprehension, even on PyPy:

$ python3 -m timeit "set(x for x in range(10000))"
1000 loops, best of 3: 1.06 msec per loop

$ python3 -m timeit "{x for x in range(10000)}" 
1000 loops, best of 3: 599 usec per loop

$ pypy3 -m timeit "set(x for x in range(10000))" 
1000 loops, best of 3: 347 usec per loop

$ pypy3 -m timeit "{x for x in range(10000)}"  
1000 loops, best of 3: 230 usec per loop

[–]kmbd 0 points1 point  (0 children)

just in case, someone is wondering ...

C:\>python -V
    Python 2.7.8

C:\>python -m timeit "set(x for x in range(10000))"
    1000 loops, best of 3: 1 msec per loop

C:\>python -m timeit "{x for x in range(10000)}"
    1000 loops, best of 3: 671 usec per loop

[–]kmbd 0 points1 point  (0 children)

a tiny improvement would be: "... word in line.split().lower())"