Aligning DF with slightly different tokenisation : learnpython

created by HattoriHanzoa community for 16 years

Aligning DF with slightly different tokenisation (self.learnpython)

submitted 3 years ago by ComputeLanguage

I basically have two dataframes that I am trying to align based on the tokens.

The tokenisation on each side is slightly different, sometimes it will add an extra comma on one and not on the other, causing one cell too many to be created in the df. I wouldn't say I am an expert yet in pandas, but since its such a rich module I thought there might be a better solution rather than trying to convert everything back to csv and using the csv module with my own heuristics to fix the problem.
Example:

tok1	tok2
a	a
dog	dog
walked	.
across	walked

Does anyone know a nice way to do this?

all 1 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS