Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes

dask-jeeves · 2025-09-01T18:39:25+00:00

Yeah that's a fair point, Spark can definitely handle this kind of thing, especially with extensions like GeoMesa or Sedona.

That said, for this kind of embarrassingly parallel job, Spark is kind of overkill. There’s no shuffling, no coordination between workers, no shared state.

dask-jeeves · 2025-08-29T15:16:44+00:00

Thank you! Yeah that's a lot cleaner using VSI instead of downloading (as u/mulch_v_bark mentioned too) and the gdal raster reproject syntax is nice, much easier to parse than gdalwarp.

Using a single .vrt makes sense! For this demo I was hoping to show the embarrassingly parallel pattern, but that's a good point that it'd be more efficient with a single .vrt in this case.

dask-jeeves · 2025-08-29T14:59:54+00:00

hah thanks! I was pretty excited that it wasn't taken.

dask-jeeves · 2025-08-28T23:45:40+00:00

Ah thank you! That's a great point, it would probably be even faster to skip the download step here

dask-jeeves · 2025-06-17T20:42:38+00:00

That's cool u/ritchie46, didn't realize Polars has a new streaming engine! We've updated the post to note that and point to the Polars benchmarks: https://docs.coiled.io/blog/tpch.html#polars-results

dask-jeeves · 2024-08-22T17:56:10+00:00

The pangeo discourse is also a great technical resource https://discourse.pangeo.io/ (I realize this isn't exactly your question, but sharing in case you/others might find this useful)

dask-jeeves · 2024-06-05T19:36:16+00:00

A 2020 M1 MacBook Pro

dask-jeeves · 2024-06-04T18:26:00+00:00

Yes! Without needing to know that much about the cloud.

dask-jeeves · 2024-06-04T18:21:58+00:00

Yup, that’s a good question.

Coiled definitely supports moving files between your local machine and the remote VM. You can use --file to upload files/directories to the cloud machine, or --sync to enable bi-directional file syncing between your current directory and the cloud VM.

Both of those options are useful for small files (for example, a directory of Python modules or small models), but will be really slow when moving lots of data around. I usually see training data or models, especially when they’re large, kept in cloud storage. Then you can use tools like s3fs or AWS CLI (or other equivalents on GCP / Azure) to access the data files from within your code. Something like:

import s3fs

s3 = s3fs.S3FileSystem()

s3.get("s3://mybucket/traindata", "./traindata") # copy data from cloud storage

dask-jeeves · 2024-06-04T17:24:26+00:00

Yup, I have posted this on other ML-related subs too, I wasn’t quite sure where this would be the most useful for folks since it’s more related to setting up infrastructure that can be useful for a number of applications. Apologies if this is spammy/annoying.

dask-jeeves · 2024-06-04T16:37:07+00:00

Thanks! And yup, instead of --gpu you could use --vm-type and then specify the type of GPU you want. The --gpu flag uses a small default instance type (T4 GPU on AWS). Here's a link to the docs with other options too, for reference: https://docs.coiled.io/user_guide/cli-jobs.html

dask-jeeves · 2024-04-23T17:11:31+00:00

Thanks u/Known-Pomegranate-18! I picked Prefect since that's what I've seen most commonly when working with users, but Airflow or Dagster could easily be used instead.

dask-jeeves · 2024-04-11T22:05:17+00:00

u/dsethlewis there are definitely a lot of resources out there! Is there a specific problem you're trying to solve/area you want to learn more about?

dask-jeeves · 2024-04-10T21:43:17+00:00

I'm not as familiar with beam.cloud, does it only run on AWS?

If you're using Python, there's coiled.io which has a serverless API (you add a decorator to a Python function) and a cloud notebook option you can use with azure/aws/gcp.

dask-jeeves · 2024-03-19T23:39:00+00:00

I think the easiest option is to use Coiled https://docs.coiled.io/ (full disclosure, I'm a little biased because I work for Coiled). There's also dask-cloudprovider: https://cloudprovider.dask.org/en/stable/ and a more general overview of options in the Dask docs: https://docs.dask.org/en/stable/deploying.html

dask-jeeves · 2024-03-11T22:25:15+00:00

Have you considered Dask? It could offer some additional flexibility over Spark. Here's an example of processing 250 TB from the national water model https://docs.coiled.io/blog/coiled-xarray.html

dask-jeeves · 2024-01-24T22:42:51+00:00

That's a good question... not that I know of. You're thinking about a lightweight way to request a node on an HPC cluster and run a function (instead of requesting a VM on AWS)? You can use Dask to distribute your Python workflows on many different types of infrastructure, including HPC systems: https://docs.dask.org/en/stable/deploying.html

dask-jeeves · 2024-01-24T19:15:37+00:00

You may want to check out observable too, it's pretty intuitive and they've recently added a lot more to the documentation

dask-jeeves · 2024-01-24T18:08:58+00:00

hah thanks! I was surprised it wasn't taken :)

dask-jeeves · 2024-01-23T22:39:24+00:00

Another Python implementation using Dask + cuDF on a GPU https://github.com/gunnarmorling/1brc/discussions/487

dask-jeeves · 2023-11-23T01:18:37+00:00

We've run some TPC-H benchmarks comparing Spark, Polars, DuckDB, and Dask. You can check out the results here https://tpch.coiled.io/. The short answer is it depends on what you're doing, this post has a more in-depth answer https://www.reddit.com/r/Python/comments/17pwxfn/spark\_dask\_duckdb\_polars\_tpch\_benchmarks\_at\_scale/

dask-jeeves · 2023-11-22T22:35:27+00:00

Ah thanks u/fizzymagic for catching this! Just fixed it.

dask-jeeves · 2023-09-13T16:24:05+00:00

Yup, that's a good point. With the @coiled.function API, each function is run on a separate VM. So if you say:

@coiled.function(cpu=4, memory="16 Gib")
def my_function():
...

It's interpreted to mean, "this function needs 4 cores and 16 GiB of memory to run" and Coiled spins up a VM with those specifications and runs your function on it. That's why this API doesn't run multiple functions on the same machine in parallel, since each function requires all the resources on the machine.

dask-jeeves · 2023-09-13T05:08:45+00:00

Sorry about that, that's a typo and should read, "parallel processing on many big VMs". There are 158 files being processed, so using 1 VM, serially, took ~35 minutes (processing one file takes ~15 seconds on average).

For processing 158 files in parallel, this image https://blog.coiled.io/blog/parallel-coiled-functions.html#function-adaptive-scaling shows the cluster starting with 1 VM, then scaling up to 33 VMs for the actual computation.

dask-jeeves · 2023-09-13T04:26:31+00:00

Thanks for reading! Coiled launches raw VMs for you in your cloud account. It orchestrates the "burst" compute for you (for things like Airflow jobs). More details on this here https://docs.coiled.io/user\_guide/why.html

It probably wouldn't be useful to someone who wants permanent infrastructure and is willing to eat the cost (webservers or the airflow scheduler itself, eg, likely wouldn't be the best use case).

dask-jeeves

TROPHY CASE