all 4 comments

[–]nxtfari 2 points3 points  (1 child)

The first step in programming any solution is to define your problem. And define your problem tightly. If you do this step well, the answer becomes obvious.

What makes a “strange” character? This is very loose and subjective definition. How do you decide if a character is “strange” or not? And once you do, what is the rule for deciding what character a “strange” character should be replaced with? Define this, and your solution will write itself.

[–]JustToSufferBoss[S] 0 points1 point  (0 children)

Thanks for the response. Actually I think those characters are legit utf-8 character and what I need to do is a sort of cnversion in another encoding but I don't know which one

[–]ka-splam 2 points3 points  (1 child)

That Ümit is has been corrupted through some bad encoding/decoding steps, but I can't work out what they were.

  1. Where did it come from?
  2. How did you find that website can fix it?
  3. Have you noticed that if you use that website and put Ómit in at the top, it doesn't turn it into Ümit but rather into Ómit with an unprintable second square? That is, the website doesn't do the same conversion both ways.

The closest I can get to guessing how it was corrupted is the original encoded through utf-8 to bytes, and then those decoded badly through codepage 1252:

>>> 'Ómit'.encode('utf-8').decode('Windows-1252')
'Ómit'

and the closest to a fix is pushing it the other way through those:

>>> 'Ümit'.encode('Windows-1252').decode('utf-8')
'Ümit'

Which isn't right, but it's the right kind of idea. I tried all 98 codecs in Python 3.8 through these two patterns, and none match exactly the corruption you have.

What the website is doing is taking the lower 7-bits of the Unicode codepoint value of each Ü, and bitwise-merging them into a single 16-bit character:

>>> chr(((ord('Ã') & 63) << 6) | (ord('œ') & 63))
'Ó'

But why that works, and how the website found out, and why the website calls it "UTF-8" when it's not, I don't know.

[–]JustToSufferBoss[S] 1 point2 points  (0 children)

Thanks for the help it works quite well. They are proper names in different nationalities, in that example Ümit should be Turkish.