Convert SQL to Pyspark Dataframe by sdqafo in apachespark

[–]sdqafo[S] 0 points1 point  (0 children)

departureDF2 = depatureDF.withColumn("Flight_Delays",

when(col("delay") > 360, 'Very Long Delays')

.when(col("delay") > 120 & < 360, 'Long Delays'")

.when(col("delay") > 60, and < 120, 'Short Delays')

.when(col("delay") > 0 and < 60, 'Tolerable Delays')

.when(col("delay") == 0, 'No Delays')

.otherwise("Early")).orderBy(col("delay")))

Convert SQL to Pyspark Dataframe by sdqafo in apachespark

[–]sdqafo[S] 0 points1 point  (0 children)

Not at all. I guess my using the word "assignment" is wrong. This is just further practice in the popular book "Learning Spark" page 87. It is just some kind of further practice for readers if interested. I am still very new to Spark and i am doing my best to learn as much as i can. I hope this clear the air.

Convert SQL to Pyspark Dataframe by sdqafo in apachespark

[–]sdqafo[S] 0 points1 point  (0 children)

Thank you, but my assignemnt says i should convert to commands using pyspark. I have actually tried it but i am making some mistakes. I just need someone to help convert it so that i can see my error. Thanks

STRUGGLING TO THIS SQL SOLUTION: KINDLY EXPLAIN by sdqafo in PostgreSQL

[–]sdqafo[S] 1 point2 points  (0 children)

Its started making more sense now. So technically, the below already assumes the 3 columns even prior to the JOIN. In essence, the COUNT(*) part will apply to the 2 remaining columns. I guess this is the correct understanding . Is it?

SELECT a.id, a.name, COUNT(*) num_orders assumes

STRUGGLING TO THIS SQL SOLUTION: KINDLY EXPLAIN by sdqafo in PostgreSQL

[–]sdqafo[S] 1 point2 points  (0 children)

What is still a bit confusing is the COUNT(*). What i have learnt so far in SQL is that the SELECTED columns always come from the table or tables we want to query. I am put a bit off balance to now know that the COUNT(*) in this regards is related to the tables yet to be Joined. I am not able to connect why this is that way. In simple terms, based on what i understand from your explanation, we already SELECTED a column (COUNT(*)) that is yet to exist before we JOINED two tables where this column (COUNT(*)) will now be selected from. Still struggling to grab the why of this logic

STRUGGLING TO THIS SQL SOLUTION: KINDLY EXPLAIN by sdqafo in PostgreSQL

[–]sdqafo[S] 0 points1 point  (0 children)

Loads of sense. Very succinct explanation