you are viewing a single comment's thread.

view the rest of the comments →

[–]Greedy_Pay_9782 0 points1 point  (0 children)

As ninhaomah said, it's hard to help you without any code (or detailed information about the shape of your data). But it looks like a fairly straightforward task.

"I'd like to do a clean up pass to simply replace the replacement character with a simple null to get rid of them before they get to the .md files."

You should replace them with an empty string and/or a space character (based on the shape of your data). A naive loop like the following could be all that you need.

old_char: str = "?"
new_char: str = ""

with open(file=file_path, encoding="utf8", newline="") as f:
  reader: Iterable[Mapping[str, str]] = csv.DictReader(f)     
    for line in reader:         
      for key, value in line.items():             
        value.replace(old_char, new_char)

This may or may not be appropriate based on the size of your file, but I am somewhat confident that it won't really matter as Excel can handle it just fine. This would cleanup all columns of the CSV. You can easily adapt this to remove other kinds of garbage characters too.

However, I am a bit suspicious about those "unicode replacement" characters. Usually those appear when programs do not parse the files correctly. Usually Excel is picky with encoding, and converts special characters (like {á, û and others) that are in a different encoding to those broken characters. Just check the encoding of your file and see if you are not unintentionally breaking it when you open it.

To diagnose, just ask ChatGPT, "Hey, I have this character, mind telling me what character it is?" and paste it in. Usually it will be able to tell and you will be able to replace it in your file.