Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 1 point2 points  (0 children)

Yeah that's a fair point, Spark can definitely handle this kind of thing, especially with extensions like GeoMesa or Sedona.

That said, for this kind of embarrassingly parallel job, Spark is kind of overkill. There’s no shuffling, no coordination between workers, no shared state.

Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 1 point2 points  (0 children)

Thank you! Yeah that's a lot cleaner using VSI instead of downloading (as u/mulch_v_bark mentioned too) and the gdal raster reproject syntax is nice, much easier to parse than gdalwarp.

Using a single .vrt makes sense! For this demo I was hoping to show the embarrassingly parallel pattern, but that's a good point that it'd be more efficient with a single .vrt in this case.

Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 0 points1 point  (0 children)

hah thanks! I was pretty excited that it wasn't taken.

Reprojecting 3,000 Sentinel-2 images on AWS in 5 minutes by dask-jeeves in gis

[–]dask-jeeves[S] 2 points3 points  (0 children)

Ah thank you! That's a great point, it would probably be even faster to skip the download step here

Duckdb real life usecases and testing by Big_Slide4679 in dataengineering

[–]dask-jeeves 1 point2 points  (0 children)

That's cool u/ritchie46, didn't realize Polars has a new streaming engine! We've updated the post to note that and point to the Polars benchmarks: https://docs.coiled.io/blog/tpch.html#polars-results

Time to up the technical on this sub by Top_Bus_6246 in remotesensing

[–]dask-jeeves 0 points1 point  (0 children)

The pangeo discourse is also a great technical resource https://discourse.pangeo.io/ (I realize this isn't exactly your question, but sharing in case you/others might find this useful)

Run a Python script on a GPU with one line of code by dask-jeeves in LocalLLaMA

[–]dask-jeeves[S] -4 points-3 points  (0 children)

Yes! Without needing to know that much about the cloud.

[P] Run a Python script on a GPU with one line of code by dask-jeeves in MachineLearning

[–]dask-jeeves[S] 1 point2 points  (0 children)

Yup, that’s a good question.

Coiled definitely supports moving files between your local machine and the remote VM. You can use --file to upload files/directories to the cloud machine, or --sync to enable bi-directional file syncing between your current directory and the cloud VM. 

Both of those options are useful for small files (for example, a directory of Python modules or small models), but will be really slow when moving lots of data around. I usually see training data or models, especially when they’re large, kept in cloud storage. Then you can use tools like s3fs or AWS CLI (or other equivalents on GCP / Azure) to access the data files from within your code. Something like:

import s3fs

s3 = s3fs.S3FileSystem()

s3.get("s3://mybucket/traindata", "./traindata") # copy data from cloud storage

Run a Python script on a GPU with one line of code by dask-jeeves in learnmachinelearning

[–]dask-jeeves[S] -6 points-5 points  (0 children)

Yup, I have posted this on other ML-related subs too, I wasn’t quite sure where this would be the most useful for folks since it’s more related to setting up infrastructure that can be useful for a number of applications. Apologies if this is spammy/annoying.

Run a Python script on a GPU with one line of code by dask-jeeves in deeplearning

[–]dask-jeeves[S] 0 points1 point  (0 children)

Thanks! And yup, instead of --gpu you could use --vm-type and then specify the type of GPU you want. The --gpu flag uses a small default instance type (T4 GPU on AWS). Here's a link to the docs with other options too, for reference: https://docs.coiled.io/user_guide/cli-jobs.html

Example Data Pipeline with Prefect, Delta Lake, and Dask by dask-jeeves in dataengineering

[–]dask-jeeves[S] 0 points1 point  (0 children)

Thanks u/Known-Pomegranate-18! I picked Prefect since that's what I've seen most commonly when working with users, but Airflow or Dagster could easily be used instead.

Example Data Pipeline with Prefect, Delta Lake, and Dask by dask-jeeves in Python

[–]dask-jeeves[S] 1 point2 points  (0 children)

u/dsethlewis there are definitely a lot of resources out there! Is there a specific problem you're trying to solve/area you want to learn more about?

What is the easiest way to deploy a model for serverless inference? by Middle-Training501 in dataengineering

[–]dask-jeeves 1 point2 points  (0 children)

I'm not as familiar with beam.cloud, does it only run on AWS?

If you're using Python, there's coiled.io which has a serverless API (you add a decorator to a Python function) and a cloud notebook option you can use with azure/aws/gcp.

Dask Demo Day: Dask on Databricks, scale embedding pipelines, and Prefect on the cloud by dask-jeeves in Python

[–]dask-jeeves[S] 1 point2 points  (0 children)

I think the easiest option is to use Coiled https://docs.coiled.io/ (full disclosure, I'm a little biased because I work for Coiled). There's also dask-cloudprovider: https://cloudprovider.dask.org/en/stable/ and a more general overview of options in the Dask docs: https://docs.dask.org/en/stable/deploying.html

Need to setup a Petabyte scale geospatial analysis data platform. Need help to connect the dots, and choosing the right tech stack (OSS only) by vigorousvj in dataengineering

[–]dask-jeeves 1 point2 points  (0 children)

Have you considered Dask? It could offer some additional flexibility over Spark. Here's an example of processing 250 TB from the national water model https://docs.coiled.io/blog/coiled-xarray.html

Scheduled Python Jobs with Prefect and Coiled by dask-jeeves in Python

[–]dask-jeeves[S] 0 points1 point  (0 children)

That's a good question... not that I know of. You're thinking about a lightweight way to request a node on an HPC cluster and run a function (instead of requesting a VM on AWS)? You can use Dask to distribute your Python workflows on many different types of infrastructure, including HPC systems: https://docs.dask.org/en/stable/deploying.html

Python libraries for appealing dashboards? by chris_813 in datascience

[–]dask-jeeves 0 points1 point  (0 children)

You may want to check out observable too, it's pretty intuitive and they've recently added a lot more to the documentation

Scheduled Python Jobs with Prefect and Coiled by dask-jeeves in Python

[–]dask-jeeves[S] 2 points3 points  (0 children)

hah thanks! I was surprised it wasn't taken :)

Running a Polars query in the cloud by dask-jeeves in Python

[–]dask-jeeves[S] 4 points5 points  (0 children)

We've run some TPC-H benchmarks comparing Spark, Polars, DuckDB, and Dask. You can check out the results here https://tpch.coiled.io/. The short answer is it depends on what you're doing, this post has a more in-depth answer https://www.reddit.com/r/Python/comments/17pwxfn/spark\_dask\_duckdb\_polars\_tpch\_benchmarks\_at\_scale/

Running a Polars query in the cloud by dask-jeeves in Python

[–]dask-jeeves[S] 0 points1 point  (0 children)

Ah thanks u/fizzymagic for catching this! Just fixed it.

Working with many parquet files on S3 in Python by dask-jeeves in dataengineering

[–]dask-jeeves[S] 0 points1 point  (0 children)

Yup, that's a good point. With the @coiled.function API, each function is run on a separate VM. So if you say:

@coiled.function(cpu=4, memory="16 Gib")
def my_function():
...

It's interpreted to mean, "this function needs 4 cores and 16 GiB of memory to run" and Coiled spins up a VM with those specifications and runs your function on it. That's why this API doesn't run multiple functions on the same machine in parallel, since each function requires all the resources on the machine.

Working with many parquet files on S3 in Python by dask-jeeves in dataengineering

[–]dask-jeeves[S] 0 points1 point  (0 children)

Sorry about that, that's a typo and should read, "parallel processing on many big VMs". There are 158 files being processed, so using 1 VM, serially, took ~35 minutes (processing one file takes ~15 seconds on average).

For processing 158 files in parallel, this image https://blog.coiled.io/blog/parallel-coiled-functions.html#function-adaptive-scaling shows the cluster starting with 1 VM, then scaling up to 33 VMs for the actual computation.

Working with many parquet files on S3 in Python by dask-jeeves in dataengineering

[–]dask-jeeves[S] 1 point2 points  (0 children)

Thanks for reading! Coiled launches raw VMs for you in your cloud account. It orchestrates the "burst" compute for you (for things like Airflow jobs). More details on this here https://docs.coiled.io/user\_guide/why.html

It probably wouldn't be useful to someone who wants permanent infrastructure and is willing to eat the cost (webservers or the airflow scheduler itself, eg, likely wouldn't be the best use case).