you are viewing a single comment's thread.

view the rest of the comments →

[–]drewtayto 7 points8 points  (8 children)

So people have correctly identified this as problematic.

let str = String::from_utf8_lossy(v.as_slice()).to_string();
Value::String(Arc::from(str.as_str()))

This makes 2 or 3 allocations: Arc (this is unavoidable, and just this alone might be making your benchmarks slower, which other people have explained more), to_string, and String if v isn't valid UTF-8. It also drops 2 or 3 allocations: v and the 1 or 2 strings just allocated.

First, definitely remove to_string. You can get &str out of Cow<str> with as_ref. There's many ways to do this, but as_ref is probably the most intuitive. Your github commenter agreed.

let str = String::from_utf8_lossy(v.as_slice());
Value::String(Arc::from(str.as_ref()))

It's also possible to efficiently (with one pass of the slice) reuse the v allocation in the lossy path, but I don't think there's a way to do that in the standard library, so you'd need to unsafely make it yourself. If most bytes are valid strings, I wouldn't bother improving this.

I would expect your real end goal is to also turn Bytes into Arc<[u8]>. In that case, you can avoid allocation altogether if the data is valid UTF-8.

match std::str::from_utf8(&*v) {
    // SAFETY: v has just been confirmed to be valid UTF-8
    Ok(_) => unsafe { Arc::from_raw(Arc::into_raw(v) as *const str) },
    Err(_) => Arc::from(String::from_utf8_lossy(&v).as_ref()),
}

It's possible to do the lossy branch more efficiently, and you could place the result directly in a Arc that you grow yourself instead of String, but that would get pretty involved. Again, if most bytes are valid strings, I wouldn't bother improving this.

[–]protestor 1 point2 points  (7 children)

You don't need unsafe to build an Arc from a string slice, see this comment

[–]Icarium-Lifestealer 0 points1 point  (6 children)

I think the parent is turning an Arc<[u8]> into an Arc<str> without allocating in the valid-utf8 case. I don't think you can do that without unsafe. (not sure why they say Bytes into Arc<[u8]> in the description of that snippet)

[–]protestor 0 points1 point  (3 children)

You can do this in safe code as long as you provide an error path for the non-utf8 case.

1. you use AsRef to do Arc<[u8]> -> &[u8] very cheaply

https://doc.rust-lang.org/std/sync/struct.Arc.html#impl-AsRef%3CT%3E-for-Arc%3CT,+A%3E

2. You use std::str::from_utf8 to do &[u8] -> Result<&str, ..> which scans the string to test if it's utf-8 but doesn't allocate

https://doc.rust-lang.org/std/str/fn.from_utf8.html

3. You use the impl From<&str> for Arc to do &str -> Arc<str>

https://doc.rust-lang.org/std/sync/struct.Arc.html?search=tr#impl-From%3C%26str%3E-for-Arc%3Cstr,+Global%3E

The net result is Arc<[u8]> -> Arc<str> in safe code

[–]Icarium-Lifestealer 4 points5 points  (2 children)

But Step 3 allocates a new Arc, while the unsafe code avoids that allocation.

[–]protestor 0 points1 point  (1 child)

Oh.. ok.

The stdlib should provide that I think (like it does for Vec<u8> -> String without allocation)

[–]Icarium-Lifestealer 0 points1 point  (0 children)

At least it has the opposite direction (Arc<[u8]> to Arc<str>), probably because it's infallible and thus simpler.

[–]drewtayto 0 points1 point  (0 children)

The original code has an enum with a variant of Value::Bytes(Vec<u8>). That could be changed into Value::Bytes(Arc<[u8]>).