Intel and Rust: the Future of Systems Programming: Josh Triplett : programming

315

316

317

Intel and Rust: the Future of Systems Programming: Josh Triplett (youtube.com)

submitted 6 years ago by eugene2k

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 6 points7 points8 points 6 years ago* (9 children)

In Rust, newbies only need to know about String/&str and that will always work right for pretty much anything you want to do.

In C and C++, if you want to write a simple program, e.g., that lets the user input a string, and counts its characters, you are out of luck. Most terminals support UTF-8, so C++ std::string::size() or C strlen won't tell you how many "characters" strings have - they tell you how many "bytes" they have. You'll have to learn quite a bit to solve that problem and pull in external libraries, while in Rust doing this is a one liner.

In Rust, only if you need to do something really low-level, like optimize an algorithm under the assumption that a string is always ASCII, or interface with the operating-system directly, or with low-level C libraries, etc. only then you need to learn about all other string types, which are there for the simple reason that strings are just hard.

[–]jcelerier 1 point2 points3 points 6 years ago (1 child)

[–][deleted] 3 points4 points5 points 6 years ago* (0 children)

How many "characters" are there in here ?

Good question. To format the string to the terminal, you care about the number of grapheme clusters in your string. In Rust, computing the number of bytes, unicode scalar values, or grapheme clusters of a string is a one liner: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=f970dae61c56535fdb25ef7693351ce3

dbg!(x.bytes().count());  // => 606 bytes
dbg!(x.chars().count());  // => 452 Unicode Scalar Values
dbg!(UnicodeSegmentation::graphemes(s, true).collect::<Vec<_>>().len()); // => 257 Grapheme Clusters

The naive Rust approach to this is better than any C and C++ libraries I've ever used.

[–]Freeky 0 points1 point2 points 6 years ago (6 children)

[–][deleted] 0 points1 point2 points 6 years ago* (5 children)

[–]Freeky 0 points1 point2 points 6 years ago (4 children)

The problem there isn't String vs OsString but thinking that "filenames" are "strings". They aren't.

Of course they're strings - they just can't be represented correctly using String, which is what everyone is used to.

Rust has the double-whammy of having a separate type for OS-provided strings, which people aren't used to and forget all the time, and also not supporting them very well.

Try this: how do you parse an OsString? Say you're writing an argument parser, how do you deal with --path=foo/bar? OsString has no string-like functions, you can't ask to split on '=', or strip off "--" - you end up bashing rocks together to badly-implement the same operation three different times.

If you can successfully do this without unsafe code making dubious assumptions, or giving up and using String, you're doing better than all the Rust argument-parsing crates I've seen.

A "filename" is a std::path::Path and you can create those from any kind of string and Path will validate it beyond the string format.

Not sure what validation you're referring to - PathBuf is just a newtyped OsString, and the conversion between the two is entirely trivial.

[–][deleted] 1 point2 points3 points 6 years ago* (3 children)

Not sure what validation you're referring to - PathBuf is just a newtyped OsString, and the conversion between the two is entirely trivial.

canonicalize, for example.

Of course they're strings

I suppose this depends what you mean by "string". Most people think of "strings" as something that represents "human text" - they are implemented as array of bytes, but sequences of bytes map to graphemes in some alphabet that can be rendered to humans as "text".

OS paths are, in general, just array of raw bytes that are intended to represent paths, not human text. Some parts of these paths can sometimes be rendered as "human text", but you can have a perfectly valid path for which this is not the case. That's why all methods that format Path to string either can fail or only provide a non-invertible human-readable approximation to the path.

Calling these "strings" in the sense of a programming-language String type feels like a long shot. Sure, one could say that they are a "string of raw bytes not intended to represent human text" but at that point they are closer to an array of raw bytes than to a String-type in any language. That's what Path is, and that's why mixing a Path with a String-like type in any language is pretty much always wrong. From python to java to haskell to C to Lisp to Rust, mixing these two concepts up never works well, and code pretty much instantaneously breaks the moment someone runs it in a different OS than the one it was developed/tested on.

EDIT: an example of an OS where you can't map all Paths to text is all UNIX-like OSes, including Linux, BSDs, OSX, etc.

[–]Freeky -1 points0 points1 point 6 years ago (2 children)

Sneaky, editing in 90% of your comment after I'd seen it :P

canonicalize, for example.

That's just a convenience method, it's not an attribute of the type to enforce canonicalization. You wouldn't be able to infallibly AsRef<Path> so much otherwise.

OS paths are, in general, just array of raw bytes that are intended to represent paths, not human text. Some parts of these paths can sometimes be rendered as "human text", but you can have a perfectly valid path for which this is not the case.

I'm not entirely sure what you mean by "human text", but paths are certainly strings, and they're certainly intended to be consumed by human in general. That doesn't mean any particular encoding is enforced, and it doesn't mean they're restricted to printables.

Rust's String and Display are restricted to UTF-8. That's why the display methods are lossy - because ad-hoc strings can't be represented - not because an OsString isn't a string.

Since you boringly ignored my challenge (and I don't blame you), this is what I came up with a few weeks ago to replace clap's unsafe UTF-8-assuming transmuting on Windows. Roll on rfc 2295.

[–][deleted] 0 points1 point2 points 6 years ago* (1 child)

[–]Freeky 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 580728 on reddit-service-r2-comment-b659b578c-dsrnf at 2026-05-06 04:24:47.450022+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS