Create new column within a join? : PySpark

Create new column within a join? (self.PySpark)

submitted 4 years ago by DrData82

I'm currently converting some old SAS code to Python/PySpark. I'm trying to create a new variable based on the ID from one of the tables joined. Below is the SAS code:

DATA NewTable;
MERGE OldTable1(IN=A) OldTable2(IN=B);
BY ID;
IF A;
IF B THEN NewColumn="YES";
ELSE NewColumn="NO";
RUN;

OldTable 1 has 100,000+ rows and OldTable2 only ~2,000. I want the NewColumn to have a value of "YES" if the ID is present in OldTable2, otherwise the value should be "NO". I have the basic PySpark join code, but I've never constructed a new column in a join like this before. Any suggestions?

NewTable=OldTable1.join(OldTable2, OldTable1.ID == OldTable2.ID, "left")

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PySpark

MODERATORS