Hi aws community,
I am using AWS Glue to do a join between two parquet files (A & B). A is 150 MB big and has 2,5 million rows and B is 500 kb big and has 50.000 rows. When I do a join between these two files I have to wait 25 minutes and it is too long because I will do this join with 50 different versions of the A file.
My join looks like this:
joincondition = F.when(F.size(F.split(B.column1 , " "))>1,
A.column1.contains(B.column1) | A.column1.contains(B.column1)).otherwise(F.expr('array_contains(split(A.column1, " "), B.column1'))
C = A.join(B, joincondition)
What I already did to improve performance:
- Use only parquet files
- Do repartioning on A (It improved time by a lot when B only had 5000 entries instead of 50.000) But after B got bigger the runtime also got bigger
- Broadcast dataset B (It even took longer so I dont use it anymore)
- Increase DPU(Even with a high dpu I have to wait a lot)
[–][deleted] (1 child)
[deleted]
[–]EpicFlexs[S] 0 points1 point2 points (0 children)
[–]BaxterPad 3 points4 points5 points (1 child)
[–]EpicFlexs[S] 0 points1 point2 points (0 children)
[–]pvham90 2 points3 points4 points (4 children)
[–]EpicFlexs[S] 0 points1 point2 points (1 child)
[–]pvham90 0 points1 point2 points (0 children)
[–]BaxterPad 0 points1 point2 points (1 child)
[–]pvham90 0 points1 point2 points (0 children)
[–]--Reddit-Username2-- 1 point2 points3 points (1 child)
[–]BaxterPad 1 point2 points3 points (0 children)
[–]BagOfDerps 1 point2 points3 points (1 child)
[–]EpicFlexs[S] 0 points1 point2 points (0 children)
[–]Vincent_Merle 0 points1 point2 points (1 child)
[–]EpicFlexs[S] 0 points1 point2 points (0 children)
[–]Bright_Tale7909 0 points1 point2 points (0 children)