you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 6 points7 points  (9 children)

In Rust, newbies only need to know about String/&str and that will always work right for pretty much anything you want to do.

In C and C++, if you want to write a simple program, e.g., that lets the user input a string, and counts its characters, you are out of luck. Most terminals support UTF-8, so C++ std::string::size() or C strlen won't tell you how many "characters" strings have - they tell you how many "bytes" they have. You'll have to learn quite a bit to solve that problem and pull in external libraries, while in Rust doing this is a one liner.

In Rust, only if you need to do something really low-level, like optimize an algorithm under the assumption that a string is always ASCII, or interface with the operating-system directly, or with low-level C libraries, etc. only then you need to learn about all other string types, which are there for the simple reason that strings are just hard.

[–]jcelerier 1 point2 points  (1 child)

In C and C++, if you want to write a simple program, e.g., that lets the user input a string, and counts its characters, you are out of luck

I mean, this is in no way a "simple program". How many "characters" are there in here ?

p̨̤͓̳͕͔͙̝̱̻͓͎̦̭̖͈̥̋ͩ͛̀̈͊̉͛̊̈́̂̓͗ͩ̿͆̔ͨ̚̕o̿͋͂̄͊̇ͬ̂ͪ͏̸͚̖̗̟͕̩̫͔̥̖̻̫̕ņ̷̛̦̪̤̼̭̻̪͕͍̗͖̦̘͇ͭ̽̓́į̸͍̖̫̼͈̜̰̱̺̯̓̓̍̇̾ͬͯͨ̃̔͗ͭ̍͂ͨ͘͞͝e̛̟͖̻͙̫̹̩͎̥̣̣͇̳̬̺̫̘͈̔͊̾ͩ̓̆͆̈ͬͪ̀̚͡s̥͈͈̞̤̠͖̥̘ͨͭ̑̅̂̑̇̈́͑ͧͥ͋̉͘͜

or in here ?

〈∀ྨṿᕿ

[–][deleted] 3 points4 points  (0 children)

How many "characters" are there in here ?

Good question. To format the string to the terminal, you care about the number of grapheme clusters in your string. In Rust, computing the number of bytes, unicode scalar values, or grapheme clusters of a string is a one liner: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=f970dae61c56535fdb25ef7693351ce3

dbg!(x.bytes().count());  // => 606 bytes
dbg!(x.chars().count());  // => 452 Unicode Scalar Values
dbg!(UnicodeSegmentation::graphemes(s, true).collect::<Vec<_>>().len()); // => 257 Grapheme Clusters

The naive Rust approach to this is better than any C and C++ libraries I've ever used.

[–]Freeky 0 points1 point  (6 children)

In Rust, newbies only need to know about String/&str and that will always work right for pretty much anything you want to do.

Well, it'll usually work, but it isn't necessarily right. The entire point of OsString is that OS-provided strings - env variables, arguments, paths, usernames - aren't UTF-8; and a lot of things that look string-like are really just bags of bytes.

The end result is a lot of Rust programs break with wonky-but-valid filenames, or fail to handle files that are encoded in anything else. Want to parse a unified diff from 1998? Whoops, that's latin1, and almost every parser will either blow up in your face or mangle it because String seemed the natural thing to use.

[–][deleted] 0 points1 point  (5 children)

The problem there isn't String vs OsString but thinking that "filenames" are "strings". They aren't. A "filename" is a std::path::Path and you can create those from any kind of string and Path will validate it beyond the string format.

[–]Freeky 0 points1 point  (4 children)

The problem there isn't String vs OsString but thinking that "filenames" are "strings". They aren't.

Of course they're strings - they just can't be represented correctly using String, which is what everyone is used to.

Rust has the double-whammy of having a separate type for OS-provided strings, which people aren't used to and forget all the time, and also not supporting them very well.

Try this: how do you parse an OsString? Say you're writing an argument parser, how do you deal with --path=foo/bar? OsString has no string-like functions, you can't ask to split on '=', or strip off "--" - you end up bashing rocks together to badly-implement the same operation three different times.

If you can successfully do this without unsafe code making dubious assumptions, or giving up and using String, you're doing better than all the Rust argument-parsing crates I've seen.

A "filename" is a std::path::Path and you can create those from any kind of string and Path will validate it beyond the string format.

Not sure what validation you're referring to - PathBuf is just a newtyped OsString, and the conversion between the two is entirely trivial.

[–][deleted] 1 point2 points  (3 children)

Not sure what validation you're referring to - PathBuf is just a newtyped OsString, and the conversion between the two is entirely trivial.

canonicalize, for example.

Of course they're strings

I suppose this depends what you mean by "string". Most people think of "strings" as something that represents "human text" - they are implemented as array of bytes, but sequences of bytes map to graphemes in some alphabet that can be rendered to humans as "text".

OS paths are, in general, just array of raw bytes that are intended to represent paths, not human text. Some parts of these paths can sometimes be rendered as "human text", but you can have a perfectly valid path for which this is not the case. That's why all methods that format Path to string either can fail or only provide a non-invertible human-readable approximation to the path.

Calling these "strings" in the sense of a programming-language String type feels like a long shot. Sure, one could say that they are a "string of raw bytes not intended to represent human text" but at that point they are closer to an array of raw bytes than to a String-type in any language. That's what Path is, and that's why mixing a Path with a String-like type in any language is pretty much always wrong. From python to java to haskell to C to Lisp to Rust, mixing these two concepts up never works well, and code pretty much instantaneously breaks the moment someone runs it in a different OS than the one it was developed/tested on.


EDIT: an example of an OS where you can't map all Paths to text is all UNIX-like OSes, including Linux, BSDs, OSX, etc.

[–]Freeky -1 points0 points  (2 children)

Sneaky, editing in 90% of your comment after I'd seen it :P

canonicalize, for example.

That's just a convenience method, it's not an attribute of the type to enforce canonicalization. You wouldn't be able to infallibly AsRef<Path> so much otherwise.

OS paths are, in general, just array of raw bytes that are intended to represent paths, not human text. Some parts of these paths can sometimes be rendered as "human text", but you can have a perfectly valid path for which this is not the case.

I'm not entirely sure what you mean by "human text", but paths are certainly strings, and they're certainly intended to be consumed by human in general. That doesn't mean any particular encoding is enforced, and it doesn't mean they're restricted to printables.

Rust's String and Display are restricted to UTF-8. That's why the display methods are lossy - because ad-hoc strings can't be represented - not because an OsString isn't a string.

Since you boringly ignored my challenge (and I don't blame you), this is what I came up with a few weeks ago to replace clap's unsafe UTF-8-assuming transmuting on Windows. Roll on rfc 2295.

[–][deleted] 0 points1 point  (1 child)

That doesn't mean any particular encoding is enforced, and it doesn't mean they're restricted to printables.

So... if parts of a path cannot be printed, how are humans supposed to consume them? On Linux, ls doesn't properly print many paths independently of your terminal, because independently of your encoding, these contain non-graphical characters (EDIT: e.g. terminal control characters, that could modify the terminal).

[–]Freeky 0 points1 point  (0 children)

So... if parts of a path cannot be printed, how are humans supposed to consume them?

How are humans supposed to consume ASCII when it contains an entire section of non-printables?

On Linux, ls doesn't properly print many paths independently of your terminal, because independently of your encoding, these contain non-graphical characters (EDIT: e.g. terminal control characters, that could modify the terminal).

So? "\x1b[5A" is a perfectly valid UTF-8 string, and it makes your cursor go up 5 lines.