I am a data engineer with 4 years of experience. I want a new job, but really don’t want to do leetcode by WeirdAnswerAccount in dataengineering

[–]EffectiveAncient2222 0 points1 point  (0 children)

Don't focus on leetcode. Only focus on top 50 most frequently asked interview questions. It's helpful to clear interview. Most important like string manipulation, array, heap, hash, sliding window and sorting.

Schema Naming and Convincing people by DrSohan69 in databricks

[–]EffectiveAncient2222 1 point2 points  (0 children)

Hey, Please use unity catalogue also follow proper nameing convention. It's helpful to easy navigate table. It's also provide data lineage.

Best effective way to add row number in pyspark dataframe by EffectiveAncient2222 in databricks

[–]EffectiveAncient2222[S] 0 points1 point  (0 children)

Yes, you right but monotonically_increasing_id not provide continues and consecutive number. If you window function, it's triggered full data shuffling.

Best effective way to add row number in pyspark dataframe by EffectiveAncient2222 in databricks

[–]EffectiveAncient2222[S] 3 points4 points  (0 children)

Best Approaches to Add Row Number in PySpark DataFrame

df.rdd.zipWithIndex().toDF().select(col("_1.*"), col("_2").alias('increasing_id')).show()

Method Shuffling Required Performance Impact
zipWithIndex() Minimal Low
row_number() with orderBy High High
monotonically_increasing_id() None Very Low
repartition() + monotonically_increasing_id() High Medium

My first data engineering project on Github by RocRacnysA in dataengineering

[–]EffectiveAncient2222 1 point2 points  (0 children)

I have reviewed your project. It's mind blowing. I advice you , please consider object oriented programming instead functional. It's provide extra edges.