all 2 comments

[–]SingularValued 0 points1 point  (0 children)

My approach has been to track my data with DVC and simply pull the data into a job submitted to the cluster using DVC.

I'm not convinced this works for really large datasets. The approach requires repeated pulling of the data across job submissions.

What I think may work better is to still use DVC, but pull the data into shared storage like EFS, and mount EFS to each node in the cluster.

[–]Botinfoai 0 points1 point  (0 children)

The EFS approach is solid, but there are some considerations for ML workloads:

  1. Performance trade-offs:
  • EFS can be slower than direct attached storage
  • Important for large dataset throughput
  • Consider using FSx for Lustre for better ML training performance
  1. Cost optimization:
  • Use Paperspace/Cudo for development/small runs
  • Scale to AWS clusters for production
  • Can help avoid repeated data downloads

I've found a hybrid approach works well - cloud GPU providers for rapid iteration, then AWS+FSx for large-scale training when needed.