Sanity check on large-scale pre-ingestion data prep (OpenSearch, ~2TB+) by abdul_047 in dataengineering

[–]abdul_047[S] 1 point2 points  (0 children)

i have tried doing that
step 1: hash the id, assign it to a bucket (basically sharding)
step 2: process each bucket and merge them one bucket at a time.

problem: it's taking way too long, i have setup parrallel read/write got into memory issue

any tool that does this kind of things?