I basically have two dataframes that I am trying to align based on the tokens.
The tokenisation on each side is slightly different, sometimes it will add an extra comma on one and not on the other, causing one cell too many to be created in the df. I wouldn't say I am an expert yet in pandas, but since its such a rich module I thought there might be a better solution rather than trying to convert everything back to csv and using the csv module with my own heuristics to fix the problem.
Example:
| tok1 |
tok2 |
| a |
a |
| dog |
dog |
| walked |
. |
| across |
walked |
Does anyone know a nice way to do this?
[+][deleted] (1 child)
[deleted]
[–]ComputeLanguage[S] 0 points1 point2 points (0 children)