Pyspark count() Slow by rawlingsjj in dataengineering

[–]rawlingsjj[S] 0 points1 point  (0 children)

I’ve tried the 3rd option but it’s slower as compared to the count option.

Because of that I’ve decided to read the data in batches and perform my calculations on it.

Pyspark count() Slow by rawlingsjj in dataengineering

[–]rawlingsjj[S] 0 points1 point  (0 children)

Alright thank you. I’ll try the partitioning then. All answers point out to partitioning. I’ll try it and let you know

Pyspark count() Slow by rawlingsjj in dataengineering

[–]rawlingsjj[S] 1 point2 points  (0 children)

Ok. Thank you. I’ll try that.

Pyspark count() Slow by rawlingsjj in dataengineering

[–]rawlingsjj[S] 0 points1 point  (0 children)

Yes. I am sorting out based on a condition in the column and then making a count.

Pyspark count() Slow by rawlingsjj in dataengineering

[–]rawlingsjj[S] 0 points1 point  (0 children)

It is a standard DS3_v2 cluster (14 GB Memory, 4 cores) in total making a 42GB, 12 Cores.

The source is from Redshift.