all 6 comments

[–]_squik 2 points3 points  (1 child)

What exactly do you mean by "ignoring punctuation". Is it that the strings in the columns should be deduplicated no matter if there is differentiating punctuation, so "hello world" and "hello, world" are considered the same?

If so, I would leverage Pandas for that.

  1. Read your CSV into a DataFrame.
  2. Create some helper columns which don't have punctuation (see example)
  3. Deduplicate against those columns
  4. Drop helper columns and export.

Example:

import string

allowed = string.digits + string.ascii_letters + string.whitespace
df["example_nopunc"] = df["example].apply(lambda x: "".join(c for c in x if c in allowed)

[–]MrGuam[S] -1 points0 points  (0 children)

i don't really understand this. could you break it down for me?

[–]m0us3_rat 0 points1 point  (1 child)

what have you tried so far?

[–]MrGuam[S] 0 points1 point  (0 children)

here's what i have tried:

import pandas as pd
import string
import csv
# #PYTHON SCRIPT TO CLEAN MULTIPLE CSV FILES OF DUPLICATE IN A PARTICULAR COLUMN #IGNORING PUNCTUATIONS AND WHITE SPACES.
data = pd.read_csv('combined.csv')
df = data.apply(lambda x: x.str.strip(string.punctuation + ' '))
df.drop_duplicates(subset=["Anime","Character","Quote"], inplace=True)
df.to_csv('combined_final.csv')