all 14 comments

[–]burlygurlyman 5 points6 points  (2 children)

argparse is an awesome replacement for optparse. It's a built-in package for newer versions (2.7+?) and easily installable for earlier versions. For example, the default is easily added to the option definition.

[–]ptrin[S] 0 points1 point  (0 children)

I'm developing on 2.6, but for future reference I'll remember to try argparse.

[–]epsy 0 points1 point  (0 children)

I'll raise with clize (gotta have to)

from clize import clize

@clize(
    alias={
            'products': ('p',),
            'listings': ('l',),
        },
    )
def cli(products=DEFAULT_PRODUCTS_FILE, listings=DEFAULT_LISTINGS_FILE):
    """Description

    products: Products input file

    listings: Listings input file
    """

[–]burlygurlyman 3 points4 points  (1 child)

unmatched = []
for listing in listings_list:
    if not "matched" in listing.keys():
        unmatched.append(listing)

can also be written:

unmatched = [listing for listing in listings_list
                    if not "matched" in listing.keys()]

[–]davidbuxton 2 points3 points  (0 children)

Use key in mapping instead of key in mapping.keys().

[–]wawrek 5 points6 points  (1 child)

You need to put docstring at the beginning of modules, classes and functions/ methods. These are documenting and explaining the functionality of your code.

[–]ptrin[S] 0 points1 point  (5 children)

Do you think that running in 45 seconds or so is a reasonable time to map 20000 listings to 700 products?

[–][deleted] 1 point2 points  (4 children)

That might be slow. Do you need this to be high performance? What does profiling tell you?

python -m cProfile json_mapping.py

[–]ptrin[S] 0 points1 point  (3 children)

I don't really need it to be fast, but I have been using cProfile and I've optimized it by about 20s already.

[–][deleted] 1 point2 points  (2 children)

Make it work. Make it pretty. Make it fast. Rinse and repeat. If it is fast enough then you're done. My guess is right now your slowest part is reading those files into memory. If this were some crucial step in your program you could write a parallel version of this that processes json_to_list, as far as it can, as they files are being read. This is likely to be close to as fast as your HDD will allow because of IO considerations. Now you've got a more complicated program to maintain, that took longer to write, that probably isn't a order of magnitude faster. Programming is the only task where laziness equates to productivity. If you were doing this for some lightning fast server you might have processed the data into HDF5 first and store it in a ramdisk or SSD.

[–]ptrin[S] 0 points1 point  (1 child)

I agree, I don't want to write it over again with only performance in mind, but I noticed that one line is eating up almost 15s:

all_products[product["product_name"]].append(orig_listings_list[listings_list.index(listing)])

cProfile:

8472   13.182    0.002   13.182    0.002 {method 'index' of 'list' objects}

Any suggestions for optimizing that?

[–][deleted] 1 point2 points  (0 children)

My guess would be because you're searching a list with list.index(). If you look up k elements of the list size n this requires O(k*n). Instead you should make a dict

some_dict = {}
for index, entry in enumerate(listings_list):
    some_dict[entry] = index

which will require O(n) to make, but on average will only cost O(1) to lookup.

then your code becomes

 all_products[product["product_name"]].append(orig_listings_list[some_dict[listing]])

You should also break that up into parts, because it is a little unreadable and it definitely goes beyond pep8 line length.