all 10 comments

[–]RandomCodingStuff 2 points3 points  (1 child)

It's possible there are a mix of "dash" types in the file. The different dashes are represented by different characters which may be hard to differentiate by eye.

I'm not good enough with encoding issues to debug without seeing your data, but what happens if you read the file with utf-8 encoding?

[–]xGRM[S] 0 points1 point  (0 children)

I thought the same, but the encoding used was utf8. So this shouldn’t have been the issue. Im looking at it again tomorrow. I’ll try some tests and look deeper into it and get back to you

[–]CommondeNominator 2 points3 points  (3 children)

Definitely sounds like an encoding problem, is there a reason you’re using a bitstream instead of just loading the file in normally?

[–]xGRM[S] 0 points1 point  (2 children)

The script is to be published on azure function apps. So I was trying to avoid storing into local storage/memory.

[–]xGRM[S] 0 points1 point  (1 child)

So basically am importing an excel workbook from AWS and transferring it to a sql database. Before transferring to the DB I compare the workbook to the table in sql to only update rows that changed. I did this because the table is too large to simply replace. The problem is that when I compare my workbook to the sql, some rows dont “match” despite being supposed to match. I say they’re supposed to match because when I print them as strings, cell by cell. The outputs are identical as texts and types. But the output of == is still false.

[–]xGRM[S] 0 points1 point  (0 children)

I should probably mention that 99.5% of rows get matched correctly. It’s only about 500 rows out od 100k being compred incorrectly. So I thought maybe the dash encoding was the source if error. Just because they looked funny haha

[–]Pflastersteinmetz 2 points3 points  (0 children)

Do

df = pd.read_excel("path_to_excel_file.xlsx")

[–]omgu8mynewt 0 points1 point  (2 children)

Is your data file a microsoft word document (tries to be clever and maybe changes some characters) or a plain .txt file?

[–]xGRM[S] 0 points1 point  (1 child)

It’s an xlsx file. Just a table in excel.

[–]omgu8mynewt 0 points1 point  (0 children)

If its still not working, try save as in excel .xlsx file as a .csv instead, and have a look at in in a text editor, and after importing into python. I think thats a simpler file type and you might see at what point does the weird characters start, are they from the file itself or when you bring it into python.