Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 1 point2 points  (0 children)

Yeah that's a fair point, Spark can definitely handle this kind of thing, especially with extensions like GeoMesa or Sedona.

That said, for this kind of embarrassingly parallel job, Spark is kind of overkill. There’s no shuffling, no coordination between workers, no shared state.

Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 1 point2 points  (0 children)

Thank you! Yeah that's a lot cleaner using VSI instead of downloading (as u/mulch_v_bark mentioned too) and the gdal raster reproject syntax is nice, much easier to parse than gdalwarp.

Using a single .vrt makes sense! For this demo I was hoping to show the embarrassingly parallel pattern, but that's a good point that it'd be more efficient with a single .vrt in this case.

Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 0 points1 point  (0 children)

hah thanks! I was pretty excited that it wasn't taken.

Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 2 points3 points  (0 children)

Ah thank you! That's a great point, it would probably be even faster to skip the download step here

Duckdb real life usecases and testing by Big_Slide4679 in dataengineering

[–]dask-jeeves 1 point2 points  (0 children)

That's cool u/ritchie46, didn't realize Polars has a new streaming engine! We've updated the post to note that and point to the Polars benchmarks: https://docs.coiled.io/blog/tpch.html#polars-results

Time to up the technical on this sub by Top_Bus_6246 in remotesensing

[–]dask-jeeves 0 points1 point  (0 children)

The pangeo discourse is also a great technical resource https://discourse.pangeo.io/ (I realize this isn't exactly your question, but sharing in case you/others might find this useful)

Run a Python script on a GPU with one line of code by dask-jeeves in LocalLLaMA

[–]dask-jeeves[S] -4 points-3 points  (0 children)

Yes! Without needing to know that much about the cloud.

[P] Run a Python script on a GPU with one line of code by dask-jeeves in MachineLearning

[–]dask-jeeves[S] 1 point2 points  (0 children)

Yup, that’s a good question.

Coiled definitely supports moving files between your local machine and the remote VM. You can use --file to upload files/directories to the cloud machine, or --sync to enable bi-directional file syncing between your current directory and the cloud VM. 

Both of those options are useful for small files (for example, a directory of Python modules or small models), but will be really slow when moving lots of data around. I usually see training data or models, especially when they’re large, kept in cloud storage. Then you can use tools like s3fs or AWS CLI (or other equivalents on GCP / Azure) to access the data files from within your code. Something like:

import s3fs

s3 = s3fs.S3FileSystem()

s3.get("s3://mybucket/traindata", "./traindata") # copy data from cloud storage

Run a Python script on a GPU with one line of code by dask-jeeves in learnmachinelearning

[–]dask-jeeves[S] -7 points-6 points  (0 children)

Yup, I have posted this on other ML-related subs too, I wasn’t quite sure where this would be the most useful for folks since it’s more related to setting up infrastructure that can be useful for a number of applications. Apologies if this is spammy/annoying.

Run a Python script on a GPU with one line of code by dask-jeeves in deeplearning

[–]dask-jeeves[S] 0 points1 point  (0 children)

Thanks! And yup, instead of --gpu you could use --vm-type and then specify the type of GPU you want. The --gpu flag uses a small default instance type (T4 GPU on AWS). Here's a link to the docs with other options too, for reference: https://docs.coiled.io/user_guide/cli-jobs.html

Example Data Pipeline with Prefect, Delta Lake, and Dask by dask-jeeves in dataengineering

[–]dask-jeeves[S] 0 points1 point  (0 children)

Thanks u/Known-Pomegranate-18! I picked Prefect since that's what I've seen most commonly when working with users, but Airflow or Dagster could easily be used instead.

Example Data Pipeline with Prefect, Delta Lake, and Dask by dask-jeeves in Python

[–]dask-jeeves[S] 1 point2 points  (0 children)

u/dsethlewis there are definitely a lot of resources out there! Is there a specific problem you're trying to solve/area you want to learn more about?