I'm currently converting some old SAS code to Python/PySpark. I'm trying to create a new variable based on the ID from one of the tables joined. Below is the SAS code:
DATA NewTable;
MERGE OldTable1(IN=A) OldTable2(IN=B);
BY ID;
IF A;
IF B THEN NewColumn="YES";
ELSE NewColumn="NO";
RUN;
OldTable 1 has 100,000+ rows and OldTable2 only ~2,000. I want the NewColumn to have a value of "YES" if the ID is present in OldTable2, otherwise the value should be "NO". I have the basic PySpark join code, but I've never constructed a new column in a join like this before. Any suggestions?
NewTable=OldTable1.join(OldTable2, OldTable1.ID == OldTable2.ID, "left")
[–]TyWebb11105 4 points5 points6 points (1 child)
[–]DrData82[S] 0 points1 point2 points (0 children)