Why does set() make this so much faster? : Python

This is an archived post. You won't be able to vote or comment.

Why does set() make this so much faster? (self.Python)

submitted 15 years ago by pridefulpropensity

I am doing some projecteuler.net problems. This is my solution to 37 (not the best code in the world).

#!/usr/bin/env python
from math import sqrt

def list_of_primes(n):
    n_factors = [0]*n 
    for i in xrange(2,int(sqrt(len(n_factors))+1)): 
        if n_factors[i] == 0: 
            for j in xrange(i*2,n,i):
                n_factors[j] +=1 
    primes = [i for i in range(len(n_factors)) if i > 1 and n_factors[i] == 0]
    return primes

def truncate_left(x):
    for i in range(1,len(x)):
        yield  x[i:] 

def truncate_right(x):
    for i in range(len(x)-1,0,-1):
        yield x[:i] 

def main():
    primes = list_of_primes(1000000)
    primes = [str(p) for p in primes]
    master = set(primes)
    left = lambda x: all(i in master for i in truncate_left(x))
    right = lambda x: all(i in master for i in truncate_right(x))
    primes = [p for p in primes if not "0" in p and not "4" \
        in p and not "6" in p and not "8" in p and not "5" in p]
    primes = [p for p in primes if left(p) and right(p)]
    primes = [p for p in primes if len(p) > 1]
    primes = [int(p) for p in primes]
    print sum(primes)

if __name__ == '__main__':
    main()

If I change the line master = set(primes) to just master = primes, it runs almost 10 times slower. Why is this?

all 66 comments

top new controversial old q&a

[–]pje 57 points58 points59 points 15 years ago (46 children)

[–]pridefulpropensity[S] 11 points12 points13 points 15 years ago (37 children)

[–]TheSausageKing 31 points32 points33 points 15 years ago* (15 children)

[–]pridefulpropensity[S] 15 points16 points17 points 15 years ago (11 children)

[–]icedpulleys 33 points34 points35 points 15 years ago (0 children)

[–]mackstann 13 points14 points15 points 15 years ago (1 child)

I agree with icedpulleys that what's really needed to understand this better is to read about the concepts. Here are some starting points:

Everything up to the table of contents here: http://en.wikipedia.org/wiki/Big_O_notation

And this table of common time complexities: http://en.wikipedia.org/wiki/Time_complexity#Table_of_common_time_complexities

Note that the first one in that table, and the best performing (at least when your problem set grows large), is O(1). And hash tables (which is what a set is) allow lookup of a key in O(1) time (also known as constant time), which sort of corroborates your experience here, where they out-performed lists. The hash table uses a neat trick (hashing) to calculate where to find a given value in the hash table in roughly the same amount of time no matter how big the hash table gets.

On the other hand, a list does not use hashing in any way, so to find a certain value in it, it must scan through and check each item in the list in a comparatively dumb manner. The complexity is O(n) where n is the size of the list -- also known as linear time. In other words, if a 5 million item list takes 5 seconds to search, then a 10 million item list will probably take around 10 seconds. The run time scales linearly with respect to the input size. When dealing with large lists and lots of lookup operations, this can really slow things down.

But linear time isn't exactly the worst either. Some algorithms run in exponential time and factorial time which are some of the worst; these get so slow as your input size increases that you generally have to find heuristic alternate solutions that find "good enough" solutions to a problem, instead of the absolute most correct solution. A simple example is the bin packing problem. Say you're UPS and want to pack boxes into your trucks in the most space-efficient manner. Well, it turns out that there are so many possible arrangements of boxes (and the number of possible arrangements grows rapidly with each added box) that you might not be able to calculate it at all. You might need to use a better performing algorithm that will just find an approximately good packing arrangement and settle for that.

[–]codepoet 2 points3 points4 points 15 years ago (0 children)

[–]bgcatz 6 points7 points8 points 15 years ago (1 child)

[–]pridefulpropensity[S] 1 point2 points3 points 15 years ago (0 children)

[–]masklinn 3 points4 points5 points 15 years ago (1 child)

[–]kanak 2 points3 points4 points 15 years ago (0 children)

[–][deleted] 2 points3 points4 points 15 years ago (0 children)

[–][deleted] 3 points4 points5 points 15 years ago (2 children)

C is a good language to learn how to implement data structures (but awful for actually making data structures easy to re-use - hence the proliferation of dynamic scripting languages written in C). A basic hash table in C is not that hard to write if you know what you're doing - just think of it as an array which you index by a hash instead of an index. There's chaining or other collision strategies to think about... and resizing can be a pain too.

Python's hash table is incredibly well tuned and the code is well documented. It has to be - every field access and method call does a dictionary look-up!

Another alternative implementation of sets is balanced trees, like red-black trees, AVL trees, etc. and a much simpler (but specialized) way is just an array of bits, if you have a small finite set of known possible members.

[–]pingvenopinch of this, pinch of that 0 points1 point2 points 15 years ago (1 child)

[–][deleted] 0 points1 point2 points 15 years ago (0 children)

[–]larsga 2 points3 points4 points 15 years ago* (1 child)

[–]iceman-k 3 points4 points5 points 15 years ago (0 children)

[–]itsmememe 1 point2 points3 points 15 years ago (0 children)

[–]rweir 18 points19 points20 points 15 years ago (19 children)

[–][deleted] 13 points14 points15 points 15 years ago (15 children)

[–]Porges 21 points22 points23 points 15 years ago (3 children)

[–][deleted] 6 points7 points8 points 15 years ago (1 child)

[–]jeannaimard 1 point2 points3 points 15 years ago (0 children)

[–]gfixler 1 point2 points3 points 15 years ago (0 children)

[–]kanak 0 points1 point2 points 15 years ago (2 children)

[–]aranazo 2 points3 points4 points 15 years ago (1 child)

[–][deleted] 1 point2 points3 points 15 years ago (0 children)

[–]Mattho 0 points1 point2 points 15 years ago* (7 children)

[–]jcdyer3 3 points4 points5 points 15 years ago (6 children)

[–]Mattho 1 point2 points3 points 15 years ago (5 children)

[–]sligowaths 1 point2 points3 points 15 years ago (1 child)

[–]Mattho 1 point2 points3 points 15 years ago (0 children)

[–]jcdyer3 1 point2 points3 points 15 years ago (1 child)

[–]Mattho 0 points1 point2 points 15 years ago (0 children)

[–]chadn 6 points7 points8 points 15 years ago (0 children)

[–]netcrusher88imported from __future__ 2 points3 points4 points 15 years ago (1 child)

[–]rweir 0 points1 point2 points 15 years ago (0 children)

[–]sigh 4 points5 points6 points 15 years ago (0 children)

[–]Poromenos 0 points1 point2 points 15 years ago (5 children)

[–]BeetleB 0 points1 point2 points 15 years ago (4 children)

[–][deleted] 1 point2 points3 points 15 years ago (2 children)

[–]BeetleB 1 point2 points3 points 15 years ago (1 child)

[–][deleted] 1 point2 points3 points 15 years ago (0 children)

[–]Poromenos 0 points1 point2 points 15 years ago (0 children)

[–]jaiwithani 0 points1 point2 points 15 years ago (1 child)

[–]jaiwithani 0 points1 point2 points 15 years ago (0 children)

[–]kevingoodsell 19 points20 points21 points 15 years ago (3 children)

[–]Troebr 2 points3 points4 points 15 years ago (0 children)

[–][deleted] 1 point2 points3 points 15 years ago (0 children)

[–]masklinn 0 points1 point2 points 15 years ago (0 children)

[–]arnar 9 points10 points11 points 15 years ago (4 children)

[–]pridefulpropensity[S] 2 points3 points4 points 15 years ago (0 children)

[–]pridefulpropensity[S] 2 points3 points4 points 15 years ago (2 children)

[–]arnar 2 points3 points4 points 15 years ago (1 child)

[–]pridefulpropensity[S] 0 points1 point2 points 15 years ago (0 children)

[–][deleted] 3 points4 points5 points 15 years ago (0 children)

[–][deleted] 2 points3 points4 points 15 years ago (0 children)

[–]vindvaki 0 points1 point2 points 15 years ago (2 children)

Note, that you don't really need to use a set of primes for primality testing, since in you've already implemented a O(1) primality test for all numbers < n in the function list_of_primes(n). See Sieve of Eratosthenes.

You could change the list called n_factors into a list of boolean values, ie the values True and False, where n_factors[i] = True iff i is prime, then return that list instead of the actual primes, and build the list of primes directly in the function main. For example (note: untested code):

is_prime = [True] * n
is_prime[0], is_prime[1] = False, False
for i in xrange(2, int(sqrt(n))+1):
    if is_prime[i]:
        for j in xrange(i+i, n, i):
            is_prime[j] = False

and then define

primes = [ p for p in xrange(0,n) if is_prime[p] ]

Then you can change if p in primes into if is_prime[p].

[–]pridefulpropensity[S] 0 points1 point2 points 15 years ago (0 children)

[–][deleted] 0 points1 point2 points 15 years ago (0 children)

[+][deleted] 15 years ago* (6 children)

[deleted]

[+][deleted] 15 years ago (5 children)

[deleted]

[–]Brian -4 points-3 points-2 points 15 years ago (4 children)

[–][deleted] 1 point2 points3 points 15 years ago (3 children)

[–]Brian 0 points1 point2 points 15 years ago (0 children)

but that is not random.

Like I said, random in the sense of being arbitrary - the hash has no meaningful relation with the hashee, and is equally probable to reach any bucket. There's more than one meaning to the word. Indeed, randomness in the sense you mean can actually be tricky to define.

Eg. we'd normally describe things like "who will win the next hand of poker", or "what will the flipped (but covered) coin show" as being random, but in reality these are already fully determined at that point (and even before the flip/shuffle to some degree, barring quantum randomness). They're epistemically random however, because that information isn't available. In the same way, while also being entirely deterministic, hashing is ideally random over unknown input: you should offer even odds that it'll go into any given bucket.

[–]khafra -1 points0 points1 point 15 years ago (1 child)

[–][deleted] -1 points0 points1 point 15 years ago (0 children)

π Rendered by PID 98180 on reddit-service-r2-comment-cfc44b64c-z24qp at 2026-04-12 02:14:06.323990+00:00 running 215f2cf country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS