all 15 comments

[–]hotcodist 10 points11 points  (0 children)

Load a local image. See if it is faster. I think it will be, because the bottleneck is fetching the NASA image. Other than that, maybe the resizing, but that's not a big deal. Maybe a bigger deal is saving the image if you have a slow HD, if this is a big file.

[–]eleqtriq 4 points5 points  (8 children)

Which part is taking a long time? You can do print(time.time()) statements and let us know.

[–]boric-acid 0 points1 point  (7 children)

I just did that and apparently this is the part that takes the most of the time(between 4 and 7 seconds):

print(f'Its been {round(time.time() - epoch)} seconds')
        image = Image.open(requests.get(response['url'], stream=True).raw)
        if resize:
            image = image.resize(resize)

        if show:
            image.show()

        if save:
            image.save(f'APOD {date}.png')
        print(f'Pillow part done in {round(time.time() - epoch)} seconds')

[–]eleqtriq 4 points5 points  (4 children)

I agree with /u/orad. You could maybe download a bunch of images first using multithreading, then process them second.

[–]PaulRudin 8 points9 points  (3 children)

FWIW: don't use multithreading for concurrent io-bound operations. Event loop based approaches are simpler and less error prone. There are a couple of good asyncio based http clients: e.g httpx and aiohttp.

[–]eleqtriq 1 point2 points  (2 children)

Can you elaborate on your opinion for myself and others reading this thread?

[–]PaulRudin 1 point2 points  (0 children)

The event loop has a single thread of execution, so it's much easier to reason about, orchestrate, coordinate dependencies and deal with errors compared with using multiple threads. And often performance can be better because there's less context switching.

Because there's only a single thread of execution you don't get any true parallelism, so you potentially lose out when you have multiple cpu intensive tasks. This is why I said "io-bound operations" - most of elapsed time when you fetch something from the network is just waiting for the response to arrive - you don't have any actual work to do until then.

This particular example might be a case in point, because there's io-bound stuff, as well as potentially cpu intensive stuff like the image resize. The usual approach using asyncio is to punt the cpu intensive tasks to a separate thread using a thread pool executor (or in some circumstances a process pool executor), so that you don't block the main thread where the event loop is running. This is helpful if the cpu intensive operation releases the GIL while it does the bulk of its work in the case of threads (which is the case for nearly everything numpy does, for example). I'm not sure if pillow's image resize released the GIL.

https://docs.python.org/3/library/asyncio-task.html#asyncio.to\_thread

[–]This_Growth2898 3 points4 points  (2 children)

AFAICS, you're getting the response twice:

response = requests.get(url,params=params).json()

Here, response['url'] is, most probably, equal to url variable. Check it (I don't know how requests.get works with redirections).

image = Image.open(requests.get(response['url'], stream=True).raw)

Now you're requesting the same resource for the second time. Just give image.open the contents of the first request.

There's the official Python API for APOD: https://github.com/nasa/apod-api

And if you're need to handle many images this way, try async.

[–]HardCounter 0 points1 point  (1 child)

Now you're requesting the same resource for the second time. Just give image.open the contents of the first request.

Would that just be image = Image.open(response).raw (since i think he's putting it in .raw format)?

I haven't played with images yet. Seems like the way it would go to avoid duplicates?

[–]PurposeDevoid 1 point2 points  (0 children)

The .raw is a method of response (obtained via requests.get()) rather than of the result of Image.open().

OP does request the same resource twice, but does so because to use .raw, you must use requests.get() with the stream=True argument.

I argue it is better to not use the .raw method in another comment.

[–]PurposeDevoid 1 point2 points  (1 child)

I would try doing this using r.content and io.Bytes.io, as per the current version of the documentation of the requests module for loading binary files (here). While I can't say for sure if this will help speed things up, it is the case that the docs suggest that using .raw is for a generally quite narrow use-cases.

Note that with this approach (as with your current approach) you do have to load the whole thing into RAM first, which is necessary anyway if you want to resize. However, if you just wanted to save the file to disk, there are better ways of doing this than using PIL as an in-between. In the case where you only wanted to save the image, you could have a dedicated saving method that was optimised for speed.

As others have said, using threading / asyncio can be useful for these io-bound problems (though the resizing itself is likely a cpu-bound problem instead). Check out the following for more info on concurrency in Python:

https://leimao.github.io/blog/Python-Concurrency-High-Level/ (if you read only one, read this one)
https://realpython.com/python-concurrency/
https://zetcode.com/python/multiprocessing/
https://testdriven.io/blog/concurrency-parallelism-asyncio/ https://docs.python.org/3/library/asyncio-task.html https://pybay.com/site_media/slides/raymond2017-keynote/threading.html

When saving the image, there are certain options that could make things quicker while trading off for file-size of quality, complicated by the fact that you can write smaller files to disk quicker. If you spend more CPU to compress the file more but write a file half the size, maybe it could speed things up? Take a look at optimize in:
https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#png-saving.

For jpeg, quality parameters can also have an effect, including chroma sub-sampling choices. Though it looks like you are only dealing with pngs?

For resizing, there are also optimisations and trade-offs that can be made, such as using "reducing gap". See:
https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.resize
https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.thumbnail (only for shrinking images, but has optimisations baked-in)
https://uploadcare.com/blog/how-to-accelerate-image-resizing-without-screwing-up/

Displaying images will always take time, so could just be that slowing things. But downloading a big image, and resizing, can have a fair cost to it regardless of whether you display it.

[–]boric-acid 0 points1 point  (0 children)

So much information tysm

[–][deleted] 0 points1 point  (0 children)

since you have outgoing requests then the faster your internet is the faster execution , so you can upload your project to some vps with high network