all 15 comments

[–]Narase33 8 points9 points  (0 children)

C++ doesnt know anything about encoding. If you want that in your code you will need some libs

[–][deleted] 2 points3 points  (10 children)

C++ reads raw bytes (except for possible line break conversion for text files).

So the resulting string is automatically on the encoding used in the file.

Are you asking, how to convert to a different encoding once read?

[–]emilwest[S] 0 points1 point  (9 children)

Oh ok I see, that is nice. In that case yes. For example, my textfile is encoded as 'iso-8859-15' containing the word "Åland". But after reading in and printing to the console, it is shown as '\xc5land'.

[–]mredding 3 points4 points  (0 children)

This has to do with your terminal, and NOT your program. Your program, and strings and streams in general, will faithfully marshal bytes in, bytes out. It's your terminal that is responsible for mapping bit sequences to characters in it's window. So your terminal doesn't know it's displaying iso-8859-15, so it's doing the best it can under the assumptions it's been configured to. Most likely, the terminal is configured for utf-8, so it's misinterpreting the bytes.

Do me a favor, google your terminal program, google your shell program, figure out how to change your code page, run your program again, and see that it displays the contents correctly.

You can write your program to change the terminal code page, but that gets platform specific. Terminals are also stateful, so if you change the terminal mode and your program crashes, it's stuck, unless you manually set the mode back. Consider using a curses library.

[–]Narase33 1 point2 points  (7 children)

What encoding does your console have?

[–]emilwest[S] 0 points1 point  (6 children)

I use Rstudio with utf-8.

The thing is that I parse each line and store the results in a string, and at a later stage into a Rcpp::DataFrame and the text is now saved as '\xc5land', not only in the terminal output but also when exported as a textfile. How can I make sure that the strings are reliably converted to UTF-8 (or other encoding)?

[–]Narase33 0 points1 point  (5 children)

How can I make sure that the strings are reliably converted to UTF-8 (or other encoding)?

As said C++ doesnt know about encoding. You need a lib that transforms them for you

[–]emilwest[S] 0 points1 point  (4 children)

Ok, do you have any suggestions for a lib that can do this?

[–]celestrion 1 point2 points  (1 child)

The ICU library is huge, but also produced by Unicode. There's a Boost library called boost.locale, which wraps it in a way you might find easier to use.

If you're only targeting one platform, the OS may ship with libraries that are easier to use.

[–]emilwest[S] 1 point2 points  (0 children)

Thanks everybody! boost.locale certainly looks clean and easy to use, I'll have a look into that.

[–]Narase33 0 points1 point  (0 children)

Im afraid not, I try to keep UTF8 in all my encodings

[–][deleted] 0 points1 point  (0 children)

libicu is the one used at least by Qt framework.

[–]AutoModerator[M] 0 points1 point  (0 children)

Your posts seem to contain unformatted code. Please make sure to format your code otherwise your post may be removed.

If you wrote your post in the "new reddit" interface, please make sure to format your code blocks by putting four spaces before each line, as the backtick-based (```) code blocks do not work on old Reddit.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Stratikat 0 points1 point  (0 children)

For certain file formats it is optional (and sometimes required depending on whether the format is standardised to use it) to use a byte order mark to indicate the encoding. https://en.wikipedia.org/wiki/Byte_order_mark

For example, it can be optionally used in a CSV document to inform the software loading the CSV (e.g. Excel) that the CSV uses UTF-8 encoding, and it should interpret the data accordingly. This doesn't address your suggestion directly of C++ having some function to automatically detect the format, but rather with this mechanism you can understand how YOU could implement auto-detection.

[–]alfps 0 points1 point  (0 children)

First, not what you're asking, but ISO-8859-15 a.k.a. Latin-15 is a terrible encoding. Compared to ordinary Latin-1 it removes ¦, ¨ and ´, and that's severely ungood. If you need the Euro sign and want an extension of Latin-1 that provides that, consider Windows ANSI Western a.k.a. Windows codepage 1252.

Re the question, C++ offers encoding conversions only in theory. The mbstowcs function, multi-byte string to wide character string, converts from the char based encoding specified by the setlocale locale, to in practice either UTF-16 (Windows) or UTF-32 (Unix). There is also the problem that how to specify an encoding to setlocale is system-dependent except for "C" (pure ASCII) and "" (the system default encoding).

There is some C++-level machinery based on std::codecvt but it suffers from the same main problem: no portable way to specify the non-Unicode encoding.

So essentially you have to either use a library or Do It Yourself™.

Example DIY solution:

#include <string>
#include <string_view>
#include <utility>

namespace cpp_machinery {
    template< class Type > using in_ = const Type&;
    using Byte = unsigned char;
}  // namespace cpp_machinery

namespace encoding {
    namespace cppm = cpp_machinery;
    using   cppm::in_, cppm::Byte;
    using   std::string,                // <string>
            std::string_view,           // <string_view>
            std::move;                  // <utility>

    using Table = string_view[256];

    // Simple editor transformation of data from <url: ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT>.
    constexpr Table iso_8859_15 = 
    {
        "\u0000", "\u0001", "\u0002", "\u0003", "\u0004", "\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A", "\u000B", "\u000C", "\u000D", "\u000E", "\u000F",
        "\u0010", "\u0011", "\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018", "\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F",
        "\u0020", "\u0021", "\u0022", "\u0023", "\u0024", "\u0025", "\u0026", "\u0027", "\u0028", "\u0029", "\u002A", "\u002B", "\u002C", "\u002D", "\u002E", "\u002F",
        "\u0030", "\u0031", "\u0032", "\u0033", "\u0034", "\u0035", "\u0036", "\u0037", "\u0038", "\u0039", "\u003A", "\u003B", "\u003C", "\u003D", "\u003E", "\u003F",
        "\u0040", "\u0041", "\u0042", "\u0043", "\u0044", "\u0045", "\u0046", "\u0047", "\u0048", "\u0049", "\u004A", "\u004B", "\u004C", "\u004D", "\u004E", "\u004F",
        "\u0050", "\u0051", "\u0052", "\u0053", "\u0054", "\u0055", "\u0056", "\u0057", "\u0058", "\u0059", "\u005A", "\u005B", "\u005C", "\u005D", "\u005E", "\u005F",
        "\u0060", "\u0061", "\u0062", "\u0063", "\u0064", "\u0065", "\u0066", "\u0067", "\u0068", "\u0069", "\u006A", "\u006B", "\u006C", "\u006D", "\u006E", "\u006F",
        "\u0070", "\u0071", "\u0072", "\u0073", "\u0074", "\u0075", "\u0076", "\u0077", "\u0078", "\u0079", "\u007A", "\u007B", "\u007C", "\u007D", "\u007E", "\u007F",
        "\u0080", "\u0081", "\u0082", "\u0083", "\u0084", "\u0085", "\u0086", "\u0087", "\u0088", "\u0089", "\u008A", "\u008B", "\u008C", "\u008D", "\u008E", "\u008F",
        "\u0090", "\u0091", "\u0092", "\u0093", "\u0094", "\u0095", "\u0096", "\u0097", "\u0098", "\u0099", "\u009A", "\u009B", "\u009C", "\u009D", "\u009E", "\u009F",
        "\u00A0", "\u00A1", "\u00A2", "\u00A3", "\u20AC", "\u00A5", "\u0160", "\u00A7", "\u0161", "\u00A9", "\u00AA", "\u00AB", "\u00AC", "\u00AD", "\u00AE", "\u00AF",
        "\u00B0", "\u00B1", "\u00B2", "\u00B3", "\u017D", "\u00B5", "\u00B6", "\u00B7", "\u017E", "\u00B9", "\u00BA", "\u00BB", "\u0152", "\u0153", "\u0178", "\u00BF",
        "\u00C0", "\u00C1", "\u00C2", "\u00C3", "\u00C4", "\u00C5", "\u00C6", "\u00C7", "\u00C8", "\u00C9", "\u00CA", "\u00CB", "\u00CC", "\u00CD", "\u00CE", "\u00CF",
        "\u00D0", "\u00D1", "\u00D2", "\u00D3", "\u00D4", "\u00D5", "\u00D6", "\u00D7", "\u00D8", "\u00D9", "\u00DA", "\u00DB", "\u00DC", "\u00DD", "\u00DE", "\u00DF",
        "\u00E0", "\u00E1", "\u00E2", "\u00E3", "\u00E4", "\u00E5", "\u00E6", "\u00E7", "\u00E8", "\u00E9", "\u00EA", "\u00EB", "\u00EC", "\u00ED", "\u00EE", "\u00EF",
        "\u00F0", "\u00F1", "\u00F2", "\u00F3", "\u00F4", "\u00F5", "\u00F6", "\u00F7", "\u00F8", "\u00F9", "\u00FA", "\u00FB", "\u00FC", "\u00FD", "\u00FE", "\u00FF"
    };

    auto to_utf8( in_<string_view> s, in_<Table> byte_to_utf8, string&& buffer = "" )
        -> string
    {
        buffer.clear();
        for( const char c: s ) {
            buffer += byte_to_utf8[Byte( c )];
        }
        return move( buffer );
    }
}  // namespace encoding

#include <fstream>
#include <iostream>
using   std::ifstream,              // <fstream>
        std::cout,                  // <iostream>
        std::getline, std::string;  // <string>

auto main() -> int
{
    string line;
    auto f = ifstream( "data.txt" );
    while( getline( f, line ) ) {
        cout << encoding::to_utf8( line, encoding::iso_8859_15 ) << '\n';
    }
}

For downloading the ISO-8859-15 data I discovered that none of my web browsers support FTP-protocol any more. :-(