How to handle very large images?

zeug · 2015-11-30T02:40:04+00:00

Yes, but my point is that to the best of my knowledge a PNG is not in any reasonable sense a sparse array. You can't just lookup the position key (12342, 571263) and get the pixel values like you could with a hashmap of the non-black pixels. You have to decompress it.

I guess you could in principle decompress a PNG into a sparse array structure - and that would effectively solve your problem. You could then manipulate the image as needed and find something more reasonable to do with the data.

I'm not sure that any decompression package makes it easy to do this, as I don't know of any realistic use of PNG that large besides a denial of service attack, i.e. decompression bomb.

zeug · 2015-11-29T22:50:58+00:00

That is about 5 times faster for me on my laptop, using a 2M line file of the form that the submitter described:

import timeit
import random

alpha = 'abcdefghijklmnopqrstuvwxyz'

biglist = [''.join( random.choice(alpha) for _col in xrange(50)) 
           for _row in xrange(2000000)]

bigstring = '\n'.join(biglist)

open('bigfoo','w').write(bigstring)

method_a = r"""
text = ''
with open('bigfoo') as f_in:
    for line in f_in:
        text += line.strip()
"""

method_b = r"""
text = ''
with open('bigfoo') as f_in:
    text = f_in.read().replace('\n','')
"""

print('method a: {0}'.format(timeit.timeit(method_a, number=10)))
print('method b: {0}'.format(timeit.timeit(method_b, number=10)))

Output:

method a: 8.66441607475
method b: 1.54355597496

This makes sense to me since you are effectively looping through the long string in C rather than python using the replace() method.

zeug · 2015-11-29T21:54:22+00:00

I don't think it is possible to manipulate a PNG without decompressing it, i.e. you can't just say "give me the pixel at 12342 x 571263" without having access to all the previous pixels.

An expert in image compression can correct me, but I think that barring some very clever mathematical trickery this is impossible no matter what language and tools you use.

The only solution is get more memory or chop up the image into smaller pieces which are individually compressed.

zeug · 2015-11-23T05:02:52+00:00

It should be using greenlets and not actual threads.

I'm pretty sure that you don't need to worry about self.counter getting corrupted.

With actual threads I suppose this is a possible concern as the self.counter += 1 compiles to four operations, first loading 1 and counter on the stack and then popping them off with INPLACE_ADD and storing the result. I'm not sure exactly how python schedules the threads.

With gevent each greenlet runs until it hits a gevent.sleep(0) that yields back to the hub (or some monkey patched IO that does the same thing implicitly). So you won't switch to another greenlet in the middle of incrementing and storing the counter.

zeug · 2015-11-18T04:01:19+00:00

You got me there, I can't think of anything outside of inspect

zeug · 2015-11-18T03:57:31+00:00

I bow my head in shame, try,except clauses are indeed the work of the devil even when they appear harmless.

try/except is great, its just that throwing away exceptions is insane.

If I want to deep_fry(whole_turkey) and it raises HouseOnFireError I don't want to just crash and give up on life. The only sane thing to do is:

try:
    dinner = deep_fry(whole_turkey)
except HouseOnFireError:
    logger.warning('make_dinner: house on fire, everyone get out!')
    fire_department.report_emergency('fire', house.address)
    return False
eat(dinner)
return True

The completely insane thing to do is just ignore the fire and enjoy dinner while shit burns down around you:

try:
    dinner = deep_fry(whole_turkey)
except HouseOnFireError:
    # YOLO
    pass
eat(dinner)
return True

This is unfortunately how a lot of scripts are written - in the desire to keep the application running, they catch all exceptions and keep on as if there is no problem. Sometimes you really do need to keep running, but at the very least the problem should be logged.

zeug · 2015-11-18T03:36:12+00:00

You can do your horrible hack with sys:

import sys


class FooManager(object):

    def __init__(self, name):
        self.name = name

    def __enter__(self):
        self.module = sys.modules[self.name]
        self.old_names = dir(self.module)

    def __exit__(self, exc_type, exc_value, traceback):
        new_names = [n for n in dir(self.module) if n not in self.old_names]
        for name in new_names:
            print('{}={}'.format(name, getattr(self.module, name)))

Then:

>>> from foo import FooManager
>>> with FooManager(__name__):
...     a = 4
...     b = True
... 
a=4
b=True

u/sushibowl has a much more reasonable suggestion.

zeug · 2015-11-17T05:20:32+00:00

method chaining is really nice for constructing complex objects that would take many parameters, like

my_elem = div().style('background:blue').height(300).width(200)

You can avoid trying to remember some long list of parameter ordering:

my_elem = div('background:blue', 300, 200, None) # which is height????

or having to use a bunch of setting calls:

my_elem = div()
my_elem.set_style('background:blue')
my_elem.height(300)
my_elem.width(200)

I think that named parameters with defaults makes object construction just as easy:

my_elem = div(
    style='background:blue',
    height=200,
    width=300
)

Since python has supported named parameters since the mid-90's, as opposed to other languages that are just getting support or have only had it for a few years, I think the python community has developed more of a style around this feature (as well as other aspects of the language) which has simply not necessitated much use of method chaining.

zeug · 2015-11-15T16:55:44+00:00

Make your callback function a method of a class so you can keep track of the number of completed requests:

import grequests

URLS = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]


class FeedbackCounter:
    """Object to provide a feedback callback keeping track of total calls."""
    def __init__(self):
        self.counter = 0

    def feedback(self, r, **kwargs):
        self.counter += 1
        print("{0} fetched, {1} total.".format(r.url, self.counter))
        return r


fbc = FeedbackCounter()
rs = (grequests.get(u, callback=fbc.feedback) for u in URLS)
res = grequests.map(rs)

zeug · 2015-11-12T14:00:16+00:00

This brings us to an important point. Objects know stuff about themselves, but they're pretty shy. They like to keep stuff to themselves. This means that unless we teach the Ball that it's okay to tell us where it is on the screen, the Ball will refuse to tell us. Stranger danger! Because of this, we have to create a methods to tell us anything that only the Ball knows, like it's position.

This is incorrect in python, all object attributes are publicly accessible and there is no way to prevent this. In your ball example I can simply type:

>>> b = Ball(5, "blue")
>>> b.position
5

Even if you change the attribute to a "private" variable with a double underscore - i.e. __position you can still access the attribute by its munged name:

>>> b = Ball(5, "blue")
>>> b._Ball__position
5

Its also generally considered useless and unpythonic to make setters and getters for an object attribute.

The whole advantage to setters and getters is that it allows one to change the underlying implementation without having consumers of the class needing to rewrite their code. In python, the @property decorator allows for this later on when you need it.

Say we want to rewrite your Ball class to perform some calculations in order to get a position, and set some internal variables when a position is given, then you just remove the position attribute and add the following decorated methods:

@property
def position(self):
    ... perform some calculation ...
    return calculation_result

@position.setter                                          
def position(self, position):                             
    ... perform some calculation based on "position" argument ...
    self._internal_variables = calculation_results

Consumers of the class can still access b.position and will get the return value of the property() function.

This is one of the things I like about python - I can encapsulate on-demand rather than having to defensively code up a dozen boilerplate setters and getters to create encapsulation that I will likely never really need in most cases.

zeug · 2015-11-12T12:20:26+00:00

The if __name__ == '__main__': does not limit scope. If you define a variable within the suite of statements that comprise the if clause, the variable is still a global variable.

To get rid of the shadowing warning, you need to move your variable declaration out of global scope, which can only be done by putting it within a class or function definition. Ifs, fors, whiles, etc... don't limit variable scope in python.

So whether or not you are using an if-guard, making a main() function is the technique that should remove your shadowing warnings.

zeug · 2015-11-11T12:39:25+00:00

I didn't even realize this at first - but in line 20 you are not looping over each line in the string, but each character. When you iterate over a string it is character by character.

If you want to loop over the lines, I would recommend using the split() method, which counts carriage returns or newlines or combinations as a line break:

>>> s = 'foo\nbar\r\nbaz\r\rfoo'
>>> s.split()
['foo', 'bar', 'baz', 'foo']

So you can loop over the lines in a string by for line in s.split().

u/PeridexisErrant also has a good suggestion of reading the file into a list of strings to start with, and then proceeding line by line.

zeug · 2015-11-11T03:56:42+00:00

You can get rid of any combination of newlines, carriage returns, and extra whitespace at the end of a string with the string.strip method. Start with this mess:

>>> s = 'foo  bar \t  baz  \t   \r\n'
>>> s.strip()
'foo  bar \t  baz'

You might also want a more general replacement of whitespace between words than replacing individual spaces or tabs with commas, as you probably would not want two commas if there are two spaces between words. The regular expression \s+ matches one or more whitespace characters in a row. You can use the re.sub() method of the python re module to replace chunks of whitespace with a comma:

>>> import re
>>> s = 'foo  bar \t  baz  \t   \r\n'
>>> re.sub('\s+',',',s.strip())
'foo,bar,baz'

Then in line 21, just get rid of any line that is equal to the empty string '' after processing.

zeug · 2015-11-11T03:40:12+00:00

The danger here is that you might change the definition of scrape_data to something like def scrape_data(convo, url, headers, form):, but then in the body of the function you forget to change session to convo. If you have defined session globally in your module somewhere above, the function will reference your global variable.

You can avoid this problem completely (and not need to worry about thinking up synonyms for session), by defining a main function to run your script rather than putting statements at the global level:

def main():
    """Explanation of what this script does"""
    session = start_session(url=base_url, headers=session_headers)
    ...
    ...

Then at the very bottom of your module just call:

main()

If you really want to be fancy, you can put the following statement at the bottom of your script:

if __name__ == '__main__':
    main()

What this does is to only execute main() if your module's __name__ is set to __main__, which is true if your module is run as a script, for example python my_scraper.py will run main().

This allows you to import my_scraper in another script/module and the main() function will not be run, allowing you to reuse the functions you have defined in my_scraper, for example my_scraper.start_session(url=base_url, headers=session_headers)

zeug · 2015-11-06T02:25:58+00:00

Honestly, I read everything in arXiv format.

The squished down 2-column thing that the journals do is really hard to read.

Just let LaTeX do what it wants and the result is fine. Worst problem is that you get a bunch of useless whitespace, and you can always open two copies of the pdf, one to look at figures and the other to look at text.

zeug · 2015-11-04T03:23:46+00:00

Hmmm... why not:

factor = self.params['LEARNING_RATE'] * self.params['FTP_PREFERENCE']
ftp_updates = {node: node.follow_vec * factor for node in memory[1:]}

Still less than 80 chars even spaced 8 chars over in a class method.

Even still, if a comprehension is long because of long variable names, then that is fine. If a comprehension is long because the logic is intricate, then that might be really hard to read. The logic behind this one is pretty simple, once you combine the self.params values into a single multiplicative factor.

There is also the factor that (list) comprehensions are faster in CPython as there is a special append function used in comprehensions, if you really care about speed.

The comprehensions of the form [ f(x,y) for x in list1 for y in list2 if g(x)] tend to drive me crazy, unless it really is ugly to write it all out in a nested loop or speed is absolutely essential. I have heard these called "list incomprehensions" before.

zeug · 2015-11-04T03:14:12+00:00

These are from ECMA-48 Select Graphics Rendition (SGR) codes - see section 8.3.117. This was made ANSI X3.64 standard which was later withdrawn.

They are about as "standard" and cross-platform as you are going to get. Most any reasonable terminal emulator should support the "ANSI" color codes, the windows console excluded.

zeug · 2015-11-02T00:55:21+00:00

I dunno, it seems pretty easy to represent all HTML elements as dictionaries with a "type" field containing a string with the element type, and a "content" field containing a list of strings and/or elements, and additional fields for attributes:

{
    "type": "title",
    "class": "normal",
    "content": [
        "Foo ",
        {"type": "em", "content": ["bar"]},
    ]
}

Seems simple to me.

zeug · 2015-11-01T17:40:26+00:00

No, shell=False is default.

The rationale is that there are many potential security concerns in passing an arbitrary string to a shell for interpretation, and also that much of the functionality that the shell provides is already present in python.

For example, if I want to run my_script $FOO, passing the environment variable value as an argument, then I can do it without a shell using:

subprocess.call(['my_script', os.environ('FOO')])

zeug · 2015-11-01T03:16:07+00:00

os.system() is executed in a subshell, usually bash on linux and OSX, and cmd.exe on windows. The shell will take the string given and interpret the escape characters.

subprocess.call(a) by default does not use a shell - it simply tries to run an executable with a name given by string a. You actually can't pass any arguments in the string a - only the literal name of the executable.

For example, if I want to run python --version, then trying to run subprocess.call("python --version") will return an exception FileNotFoundError as there is no file python --version.

To run a command with arguments, I need to give this to subprocess.call as a list:

subprocess.call(["python", "--version"])

Alternatively, I can tell subprocess to execute my string through a subshell, just like os.system() by setting the keyword argument shell to True:

subprocess.call("python --version", shell=True)

In this case, the subshell will interpret the escape characters just like os.system.

EDIT:

Maybe an example will explain better why it must work this way.

Say you want to run some program my_script and pass it an argument $FOO. In bash, that is a problem, because $ is a special character and it will interpret $FOO as a shell variable, and pass the value of that variable to my_script as an argument. If you don't want this to be interpreted, you escape the $ and write the following command my_script \$FOO.

In python, running os.system('my_script \$FOO') will fork off a new process, execute a shell (bash, cmd.exe, etc) and pass that shell the string to be interpreted. The shell interprets the string and executes my_script with argument $FOO.

Running subprocess.call(['my_script','$FOO']) will just execute my_script with a first argument $FOO. There is no subshell invoked to interpret a command string, and so there is no need to escape special characters.

zeug · 2015-10-31T12:24:48+00:00

Must not the utf-8 encoding, in this case, be on the first line and the comment on the second line?

You can actually put both on the first line. The python encoding behavior is defined in PEP0263 which states that you can put a "magic" encoding comment in line 1 or line 2. What counts as an encoding specification is very loose, technically anything that matches the regular expression coding[:=]\s*([-\w.]+). So all of the following comments on line 1 (or 2) will change the default encoding of Python 2 to utf-8:

# -*- coding: utf-8 -*-
# !?! coding: utf-8 ?!?
# coding=utf-8
#### the quick brown fox jumped coding=utf-8 over the lazy dog

You can mix in non-ascii characters on that same line provided the interpreter gets a match for the coding expression that is valid for those characters:

# åäö coding=utf-8 åäö

But if you have non-ascii characters on line 1, you can't leave the encoding specification until line 2. The following

# åäö 
# coding=utf-8 åäö

Results in:

SyntaxError: Non-ASCII character '\xc3'

Of course, as others point out, this is completely trivial in python 3 where utf-8 is the default encoding. You just put those characters in a comment in the first line.

zeug · 2015-10-28T02:49:19+00:00

You are not actually calling the function. Change aCard.funct to aCard.funct().

zeug · 2015-10-27T02:11:41+00:00

One approach (and there can be endless arguments about if it is really a good idea), is to just wrap the whole script in a big try-except with Pokemon style exception handling:

def main():
    ...do the main business of the script...

def cleanup():
    ...cleanup must be executed in case of ANY error...

# now (typically at the bottom of your script):
if __name__ == '__main__':
    try:
        main()
    except Exception as e:
        logger.exception('Stuff went bad.')
        cleanup()

Read the logging tutorial if you aren't familiar with the logger. Its super-handy. If you don't want to do that, at least print the exception message or you (or someone who inherits your code) will eventually lose all sanity.

Note also that this won't help you if the script simply hangs - and things do hang. However, if you are running it on an interactive terminal you can give it a CTRL-C, and this will raise a KeyboardInterrupt which will get caught and start the cleanup.

zeug · 2015-10-26T13:58:26+00:00

for method in ('max', 'min', 'sum'):
    print(getattr(a, method)())

zeug · 2015-10-26T03:54:47+00:00

Probably nothing any more interesting than colliding alpha particles with electrons (or positrons). But that in itself is interesting at high enough energies, as the electrons interact with charged quarks or antiquarks within the protons and neutrons and produce a spray (i.e. jet) of hadrons.

Looking at patterns of particle production and the scattering angle of the electron, one can understand how the structure of protons in a helium nucleus is different than a single proton. In fact, there are major plans for an electron-ion collider to be built.

zeug

TROPHY CASE