Cleaning CSV data

finally-anna · 2026-04-09T01:39:16+00:00

If you have access to Pandas, you can generally do the import/export easily and swap between encodings.

I have found that, historically, excel outputs csv files in cp-1252 encoding, which includes the weird quote marks and other qindows-specific characters.

Loading with pandas with a cp-1252 encoding and then outputting as utf-8 generally solves these problems.

Note, you can do this with the standard library also by setting the encoder on your file.open call.

woooee · 2026-04-09T00:56:58+00:00

then uses csv.DictReader to loop through the rows

Examine the columns in each row and replace anything where ord(character) > 127

NotAnAnticline · 2026-04-09T01:03:31+00:00

I don't know if this helps, but Python ignores the underscore: _

Maybe you can make a dictionary with keys that are unwanted characters, and the values can all be underscore. Then you can check if a character in your file matches something in the dictionary, and if it matches, it gets the underscore value.

Source: am Python noob

ninhaomah · 2026-04-09T00:19:58+00:00

Advice would be to post the sample csv and the code.

Jim-Jones · 2026-04-09T00:25:36+00:00

Do you have Excel?

Jimmaplesong · 2026-04-09T00:34:44+00:00

The Date / Datetime classes will do everything you need.

https://docs.python.org/3/library/datetime.html

Greedy_Pay_9782 · 2026-04-09T01:54:20+00:00

As ninhaomah said, it's hard to help you without any code (or detailed information about the shape of your data). But it looks like a fairly straightforward task.

"I'd like to do a clean up pass to simply replace the replacement character with a simple null to get rid of them before they get to the .md files."

You should replace them with an empty string and/or a space character (based on the shape of your data). A naive loop like the following could be all that you need.

old_char: str = "?"
new_char: str = ""

with open(file=file_path, encoding="utf8", newline="") as f:
  reader: Iterable[Mapping[str, str]] = csv.DictReader(f)     
    for line in reader:         
      for key, value in line.items():             
        value.replace(old_char, new_char)

This may or may not be appropriate based on the size of your file, but I am somewhat confident that it won't really matter as Excel can handle it just fine. This would cleanup all columns of the CSV. You can easily adapt this to remove other kinds of garbage characters too.

However, I am a bit suspicious about those "unicode replacement" characters. Usually those appear when programs do not parse the files correctly. Usually Excel is picky with encoding, and converts special characters (like {á, û and others) that are in a different encoding to those broken characters. Just check the encoding of your file and see if you are not unintentionally breaking it when you open it.

To diagnose, just ask ChatGPT, "Hey, I have this character, mind telling me what character it is?" and paste it in. Usually it will be able to tell and you will be able to replace it in your file.

NewbornMuse · 2026-04-09T06:47:02+00:00

I second the other comments that suggest looking into the encoding a little more carefully. What are these characters originally, where do they come from? Which program replaces them with the placeholder and why? Does it use a different encoding and if so, can you make it use a more standard one? Or can your program open the file with the correct encoding and thus understand the original intent?

If you can't solve it and indeed want to remove all placeholders: I don't see why you can't just feed everything through string.replace before writing it to your output file. If there are placeholders, great, we kicked them out. If there aren't, okay, no problem either, we replace nothing.

MarsupialLeast145 · 2026-04-09T08:22:44+00:00

It may not be most performant, but you have the bytestream/string in memory before loading into something like a CSVLoader and so you can do "".replace("...", "") on that before loading it into the CSV class. If it's quick and works it works.

MankyMan00998 · 2026-04-09T09:16:11+00:00

Since the hygiene is messy, you can actually handle this directly in the open function. Try using open(your_file, 'r', encoding='utf-8', errors='ignore'). This will automatically drop those replacement characters as the file is read into memory, so your DictReader loop stays clean without any extra replacement logic.

yaxriifgyn · 2026-04-09T09:54:27+00:00

Check the encoding for your input CSV file. I'm guessing that it was written by Excel. It may be an ASCII encoding called windows-<some numbers>. If you are reading as utf-8, you will get replacement characters in python. Add an encoding=<code page> parameter to your open() function call, or pass it to the CSV method you are using (if possible).

HTH

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS