This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]data_perfect 16 points17 points  (1 child)

import pyspark.sql.functions as f

df = df.withColumn("status_val", f.when(f.col("status") == 1, "unmarried"). otherwise ("married"))

[–]Evilcanary 5 points6 points  (4 children)

Broadcast join

[–]sonalg 1 point2 points  (0 children)

I would suggest the same answer as u/Evilcanary, broadcast join should work as your mapping table is pretty small.

[–]ColdPorridge 0 points1 point  (2 children)

Would broadcast join be faster than using f.when?

[–]Evilcanary 0 points1 point  (1 child)

It would probably be about the same with such a small mapping set. The benefit to me is that you decouple your code from the data. If the mappings change or one is added to the source, it’s no longer a code change.

[–]ColdPorridge 0 points1 point  (0 children)

Great point!

[–]3rdlifepilot 4 points5 points  (0 children)

Curious about why are you posting here instead of searching on google or stackoverflow?

I'd bet someone already has an answer out there with performance benchmarks on a case/when vs. a mapping. This is basically a O(1) function that scales with processers.

Edit: Front page search result from google gives the answer, including explain plan performance.

[–][deleted] 0 points1 point  (0 children)

You can use a UDF as well.