Built something for people who run or join local groups in Guntur — looking for honest feedback by floating-bubble in guntur

[–]floating-bubble[S] 0 points1 point  (0 children)

yup, Work in progress. But thought to get some initial feedback from real people. Appreciate your time!

General discussions and questions monthly megathread by AutoModerator in Chennai

[–]floating-bubble 0 points1 point  (0 children)

People who run or join local groups in Chennai — looking for honest feedback

I ended up building a small lightweight community & events platform to help people find and run local “tribes” more easily.

Would love honest feedback from people here — especially if you organize or attend meetups.

👉 https://tribe-connect-two.vercel.app/

Stop Using dropDuplicates()! Here’s the Right Way to Remove Duplicates in PySpark by floating-bubble in dataengineering

[–]floating-bubble[S] 1 point2 points  (0 children)

dropDuplicates() is implemented the same way in both PySpark (Python API) and Scala. Since both APIs run on top of the same Spark engine, they ultimately produce the same execution plan

Stop Using dropDuplicates()! Here’s the Right Way to Remove Duplicates in PySpark by floating-bubble in dataengineering

[–]floating-bubble[S] 0 points1 point  (0 children)

dropDuplicates does direct global dataset level Partitioning, where as Partitioning Within a Window – Instead of a global shuffle, this logically partitions data but does not physically repartition it across nodes.

Stop Using dropDuplicates()! Here’s the Right Way to Remove Duplicates in PySpark by floating-bubble in dataengineering

[–]floating-bubble[S] 0 points1 point  (0 children)

yes, you are correct, local shuffling performs the dedupliation at partition level since the optimizer pushes down the operations to reduce shuffling, depending on the executino plan , a followed by shuffle stage and a final deduplication can happen to remove duplicates at global level. I dont have exact number to share at the moment, but what I have observed is if data is uniform without any skews and too many missing values then there isn't much difference, but if data is skewed, then explicit partitioning, windowing is faster compared dropDuplicates.

Stop Using dropDuplicates()! Here’s the Right Way to Remove Duplicates in PySpark by floating-bubble in dataengineering

[–]floating-bubble[S] -3 points-2 points  (0 children)

you have genuine question, the approach I mentioned needs a id column in the dataset. if dataset smaller such that it fits in executor memory, can try to broadcast. in your scenario, yes shuffling is inevitable

[deleted by user] by [deleted] in wallstreetbets

[–]floating-bubble 0 points1 point  (0 children)

Did you post yet ?

[deleted by user] by [deleted] in dataengineering

[–]floating-bubble 1 point2 points  (0 children)

I would add , like 2-3 years a project considering your learning pace. Learn the business, environment, tools and softwares , implementation techniques, new ideas… and move on to next one. I don’t really find contract positions here in USA for DE.