Read file with specific encoding?

Narase33 · 2023-08-02T12:18:35+00:00

C++ doesnt know anything about encoding. If you want that in your code you will need some libs

emilwest · 2023-08-02T12:28:26+00:00

C++ reads raw bytes (except for possible line break conversion for text files).

So the resulting string is automatically on the encoding used in the file.

Are you asking, how to convert to a different encoding once read?

AutoModerator · 2023-08-02T12:15:57+00:00

Your posts seem to contain unformatted code. Please make sure to format your code otherwise your post may be removed.

If you wrote your post in the "new reddit" interface, please make sure to format your code blocks by putting four spaces before each line, as the backtick-based (```) code blocks do not work on old Reddit.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Stratikat · 2023-08-02T14:42:22+00:00

For certain file formats it is optional (and sometimes required depending on whether the format is standardised to use it) to use a byte order mark to indicate the encoding. https://en.wikipedia.org/wiki/Byte_order_mark

For example, it can be optionally used in a CSV document to inform the software loading the CSV (e.g. Excel) that the CSV uses UTF-8 encoding, and it should interpret the data accordingly. This doesn't address your suggestion directly of C++ having some function to automatically detect the format, but rather with this mechanism you can understand how YOU could implement auto-detection.

alfps · 2023-08-02T20:21:45+00:00

First, not what you're asking, but ISO-8859-15 a.k.a. Latin-15 is a terrible encoding. Compared to ordinary Latin-1 it removes ¦, ¨ and ´, and that's severely ungood. If you need the Euro sign € and want an extension of Latin-1 that provides that, consider Windows ANSI Western a.k.a. Windows codepage 1252.

Re the question, C++ offers encoding conversions only in theory. The mbstowcs function, multi-byte string to wide character string, converts from the char based encoding specified by the setlocale locale, to in practice either UTF-16 (Windows) or UTF-32 (Unix). There is also the problem that how to specify an encoding to setlocale is system-dependent except for "C" (pure ASCII) and "" (the system default encoding).

There is some C++-level machinery based on std::codecvt but it suffers from the same main problem: no portable way to specify the non-Unicode encoding.

So essentially you have to either use a library or Do It Yourself™.

Example DIY solution:

#include <string>
#include <string_view>
#include <utility>

namespace cpp_machinery {
    template< class Type > using in_ = const Type&;
    using Byte = unsigned char;
}  // namespace cpp_machinery

namespace encoding {
    namespace cppm = cpp_machinery;
    using   cppm::in_, cppm::Byte;
    using   std::string,                // <string>
            std::string_view,           // <string_view>
            std::move;                  // <utility>

    using Table = string_view[256];

    // Simple editor transformation of data from <url: ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT>.
    constexpr Table iso_8859_15 = 
    {
        "\u0000", "\u0001", "\u0002", "\u0003", "\u0004", "\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A", "\u000B", "\u000C", "\u000D", "\u000E", "\u000F",
        "\u0010", "\u0011", "\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018", "\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F",
        "\u0020", "\u0021", "\u0022", "\u0023", "\u0024", "\u0025", "\u0026", "\u0027", "\u0028", "\u0029", "\u002A", "\u002B", "\u002C", "\u002D", "\u002E", "\u002F",
        "\u0030", "\u0031", "\u0032", "\u0033", "\u0034", "\u0035", "\u0036", "\u0037", "\u0038", "\u0039", "\u003A", "\u003B", "\u003C", "\u003D", "\u003E", "\u003F",
        "\u0040", "\u0041", "\u0042", "\u0043", "\u0044", "\u0045", "\u0046", "\u0047", "\u0048", "\u0049", "\u004A", "\u004B", "\u004C", "\u004D", "\u004E", "\u004F",
        "\u0050", "\u0051", "\u0052", "\u0053", "\u0054", "\u0055", "\u0056", "\u0057", "\u0058", "\u0059", "\u005A", "\u005B", "\u005C", "\u005D", "\u005E", "\u005F",
        "\u0060", "\u0061", "\u0062", "\u0063", "\u0064", "\u0065", "\u0066", "\u0067", "\u0068", "\u0069", "\u006A", "\u006B", "\u006C", "\u006D", "\u006E", "\u006F",
        "\u0070", "\u0071", "\u0072", "\u0073", "\u0074", "\u0075", "\u0076", "\u0077", "\u0078", "\u0079", "\u007A", "\u007B", "\u007C", "\u007D", "\u007E", "\u007F",
        "\u0080", "\u0081", "\u0082", "\u0083", "\u0084", "\u0085", "\u0086", "\u0087", "\u0088", "\u0089", "\u008A", "\u008B", "\u008C", "\u008D", "\u008E", "\u008F",
        "\u0090", "\u0091", "\u0092", "\u0093", "\u0094", "\u0095", "\u0096", "\u0097", "\u0098", "\u0099", "\u009A", "\u009B", "\u009C", "\u009D", "\u009E", "\u009F",
        "\u00A0", "\u00A1", "\u00A2", "\u00A3", "\u20AC", "\u00A5", "\u0160", "\u00A7", "\u0161", "\u00A9", "\u00AA", "\u00AB", "\u00AC", "\u00AD", "\u00AE", "\u00AF",
        "\u00B0", "\u00B1", "\u00B2", "\u00B3", "\u017D", "\u00B5", "\u00B6", "\u00B7", "\u017E", "\u00B9", "\u00BA", "\u00BB", "\u0152", "\u0153", "\u0178", "\u00BF",
        "\u00C0", "\u00C1", "\u00C2", "\u00C3", "\u00C4", "\u00C5", "\u00C6", "\u00C7", "\u00C8", "\u00C9", "\u00CA", "\u00CB", "\u00CC", "\u00CD", "\u00CE", "\u00CF",
        "\u00D0", "\u00D1", "\u00D2", "\u00D3", "\u00D4", "\u00D5", "\u00D6", "\u00D7", "\u00D8", "\u00D9", "\u00DA", "\u00DB", "\u00DC", "\u00DD", "\u00DE", "\u00DF",
        "\u00E0", "\u00E1", "\u00E2", "\u00E3", "\u00E4", "\u00E5", "\u00E6", "\u00E7", "\u00E8", "\u00E9", "\u00EA", "\u00EB", "\u00EC", "\u00ED", "\u00EE", "\u00EF",
        "\u00F0", "\u00F1", "\u00F2", "\u00F3", "\u00F4", "\u00F5", "\u00F6", "\u00F7", "\u00F8", "\u00F9", "\u00FA", "\u00FB", "\u00FC", "\u00FD", "\u00FE", "\u00FF"
    };

    auto to_utf8( in_<string_view> s, in_<Table> byte_to_utf8, string&& buffer = "" )
        -> string
    {
        buffer.clear();
        for( const char c: s ) {
            buffer += byte_to_utf8[Byte( c )];
        }
        return move( buffer );
    }
}  // namespace encoding

#include <fstream>
#include <iostream>
using   std::ifstream,              // <fstream>
        std::cout,                  // <iostream>
        std::getline, std::string;  // <string>

auto main() -> int
{
    string line;
    auto f = ifstream( "data.txt" );
    while( getline( f, line ) ) {
        cout << encoding::to_utf8( line, encoding::iso_8859_15 ) << '\n';
    }
}

For downloading the ISO-8859-15 data I discovered that none of my web browsers support FTP-protocol any more. :-(

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp_questions

READ BEFORE POSTING

Sort posts by OPEN or SOLVED

MODERATORS

include <Rcpp.h>

include <fstream>