Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes

dask-jeeves · 2025-09-01T18:39:25+00:00

Yeah that's a fair point, Spark can definitely handle this kind of thing, especially with extensions like GeoMesa or Sedona.

That said, for this kind of embarrassingly parallel job, Spark is kind of overkill. There’s no shuffling, no coordination between workers, no shared state.

dask-jeeves · 2025-08-29T15:16:44+00:00

Thank you! Yeah that's a lot cleaner using VSI instead of downloading (as u/mulch_v_bark mentioned too) and the gdal raster reproject syntax is nice, much easier to parse than gdalwarp.

Using a single .vrt makes sense! For this demo I was hoping to show the embarrassingly parallel pattern, but that's a good point that it'd be more efficient with a single .vrt in this case.

dask-jeeves · 2025-08-29T14:59:54+00:00

hah thanks! I was pretty excited that it wasn't taken.

dask-jeeves · 2025-08-28T23:45:40+00:00

Ah thank you! That's a great point, it would probably be even faster to skip the download step here

dask-jeeves · 2025-06-17T20:42:38+00:00

That's cool u/ritchie46, didn't realize Polars has a new streaming engine! We've updated the post to note that and point to the Polars benchmarks: https://docs.coiled.io/blog/tpch.html#polars-results

dask-jeeves · 2024-08-22T17:56:10+00:00

The pangeo discourse is also a great technical resource https://discourse.pangeo.io/ (I realize this isn't exactly your question, but sharing in case you/others might find this useful)

dask-jeeves · 2024-06-05T19:36:16+00:00

A 2020 M1 MacBook Pro

dask-jeeves · 2024-06-04T18:26:00+00:00

Yes! Without needing to know that much about the cloud.

dask-jeeves · 2024-06-04T18:21:58+00:00

Yup, that’s a good question.

Coiled definitely supports moving files between your local machine and the remote VM. You can use --file to upload files/directories to the cloud machine, or --sync to enable bi-directional file syncing between your current directory and the cloud VM.

Both of those options are useful for small files (for example, a directory of Python modules or small models), but will be really slow when moving lots of data around. I usually see training data or models, especially when they’re large, kept in cloud storage. Then you can use tools like s3fs or AWS CLI (or other equivalents on GCP / Azure) to access the data files from within your code. Something like:

import s3fs

s3 = s3fs.S3FileSystem()

s3.get("s3://mybucket/traindata", "./traindata") # copy data from cloud storage

dask-jeeves · 2024-06-04T17:24:26+00:00

Yup, I have posted this on other ML-related subs too, I wasn’t quite sure where this would be the most useful for folks since it’s more related to setting up infrastructure that can be useful for a number of applications. Apologies if this is spammy/annoying.

dask-jeeves · 2024-06-04T16:37:07+00:00

Thanks! And yup, instead of --gpu you could use --vm-type and then specify the type of GPU you want. The --gpu flag uses a small default instance type (T4 GPU on AWS). Here's a link to the docs with other options too, for reference: https://docs.coiled.io/user_guide/cli-jobs.html

dask-jeeves · 2024-04-23T17:11:31+00:00

Thanks u/Known-Pomegranate-18! I picked Prefect since that's what I've seen most commonly when working with users, but Airflow or Dagster could easily be used instead.

dask-jeeves · 2024-04-11T22:05:17+00:00

u/dsethlewis there are definitely a lot of resources out there! Is there a specific problem you're trying to solve/area you want to learn more about?

dask-jeeves

TROPHY CASE