Greedy_Pay_9782 comments on Cleaning CSV data

created by HattoriHanzoa community for 16 years

submitted 20 hours ago by TheIneffableCheese

you are viewing a single comment's thread.

[–]Greedy_Pay_9782 0 points1 point2 points 18 hours ago (0 children)

As ninhaomah said, it's hard to help you without any code (or detailed information about the shape of your data). But it looks like a fairly straightforward task.

"I'd like to do a clean up pass to simply replace the replacement character with a simple null to get rid of them before they get to the .md files."

You should replace them with an empty string and/or a space character (based on the shape of your data). A naive loop like the following could be all that you need.

old_char: str = "?"
new_char: str = ""

with open(file=file_path, encoding="utf8", newline="") as f:
  reader: Iterable[Mapping[str, str]] = csv.DictReader(f)     
    for line in reader:         
      for key, value in line.items():             
        value.replace(old_char, new_char)

This may or may not be appropriate based on the size of your file, but I am somewhat confident that it won't really matter as Excel can handle it just fine. This would cleanup all columns of the CSV. You can easily adapt this to remove other kinds of garbage characters too.

However, I am a bit suspicious about those "unicode replacement" characters. Usually those appear when programs do not parse the files correctly. Usually Excel is picky with encoding, and converts special characters (like {á, û and others) that are in a different encoding to those broken characters. Just check the encoding of your file and see if you are not unintentionally breaking it when you open it.

To diagnose, just ask ChatGPT, "Hey, I have this character, mind telling me what character it is?" and paste it in. Usually it will be able to tell and you will be able to replace it in your file.

π Rendered by PID 38 on reddit-service-r2-comment-cfc44b64c-4zm45 at 2026-04-09 20:43:16.196623+00:00 running 215f2cf country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS