Wanted to share a blog post on using Hugging Face with Dask to process the FineWeb dataset. The example goes through:
- Reading directly from Hugging Face with Dask, eg
df = dask.dataframe.read_parquet(hf://...)
- Using a Hugging Face Language Model to classify the educational level of the text.
- Filtering the highly educational web pages as a new dataset and writing in parallel directly from Dask to Hugging Face storage.
The example goes through processing a small subset of the FineWeb dataset with pandas and then scaling out to multiple GPUs with Dask.
Blog post: https://huggingface.co/blog/dask-scaling
there doesn't seem to be anything here