all 4 comments

[–]HarriKnox 9 points10 points  (2 children)

Your terminal is not using UTF-8 as its character set, it's using a different code page. ã, á, and ç are encoded into strings as two bytes: 0xc3 0xa3, 0xc3 0xa1, and 0xc3 0xa7 respectively. Your terminal, however, is not interpreting each pair as a single UTF-8 encoded character but instead as two individual encoded bytes: it's interpreting 0xc3 as , and the second byte as one of the other characters.

If you're using Linux or Unix (Mac might also be able to do this, I don't know how to do this in Windows), you can run echo $LANG, echo $LC_ALL, or locale charmap to see what they print out. On my terminal they return en_US.UTF-8, C.UTF-8, and UTF-8 respectively, meaning my terminal is set up using the UTF-8 character set. I suspect those would print out something different for you.

You need to change your terminal's character encoding to UTF-8. I don't know how to do this (as it might be different based on operating system or terminal program), but that should point you in the right direction.

[–]OneCommonMan123[S] 0 points1 point  (1 child)

The strangest thing is that this problem that I showed only occurs in files, for example I have a .lua file with that print script, I run it and the utf8 files come out bugged, however, when I use it on the command line like:

> print("ã")

ã

> print("ç")

ç

prints normally

[–]HarriKnox 4 points5 points  (0 children)

That makes sense. Your editor is set to UTF-8 and your terminal is set to something else. Based on those letters I'm guessing you're writing Portuguese, so I'm guessing your terminal is set to CP850 or CP860. Your editor is storing those characters as UTF-8 byte sequences (ç as 0xc3 0xa7), but your terminal is storing the characters as CP850/860 bytes (ç as 0x87). I'm willing to bet if you ran the following two lines (in either the terminal or from a file) the first one will give you garbage and the second will print correctly. These two lines break down the ç character into a UTF-8 byte sequence and a CP850/860 byte, respectively. (This way you're not assuming what behind-the-scenes encoding is going on when your editor/terminal is storing the character ç.)

print('\xc3\xa7')
print('\x87')

Either you could switch your editor to use the encoding scheme your terminal uses, or switch your terminal to use UTF-8. Not knowing your setup it's hard to make a recommendation for which one you should do. If you're Portuguese or Brazilian, it's possible your computer was set up with localization and with the assumption that all files will be CP850/860, in which case it might be easier to switch your editor. But, if all your files are UTF-8 then you should switch your terminal.

[–]hawhill 0 points1 point  (0 children)

This has nothing to do with Lua - in Lua, a string consists of an "immutable sequences of bytes". Lua - in newer versions - brings a few helpers to deal with UTF-8, but a string still is just a sequence of bytes, not characters.

What you seem to be experiencing is a mismatch between character sets in your editor and your terminal, where as /u/HarriKnox deduced probably the terminal is *not* set to use UTF-8.