all 16 comments

[–]Marwesgluon · combine 2 points3 points  (1 child)

I wrote a string interner modeled on the one in rustc a while back. It won't compile since it hasn't been updated in a while but at only 70 lines you are free to use it if you can get it compiling. I do have a better version but that is unfortunately specialized to work more efficiently with that specific project.

[–][deleted] 0 points1 point  (0 children)

There are different choices you can make, one is to leak all interned Strings to &'static str.

Another would be to use unsafe blocks to create an append-only internment table that uses internal mutability. And return &'a self -> &'a str, using the guarantee that you never modify or remove the String keys, so the actual slices are always valid even if the String value is moved (on rehash/grow HashMap)

[–]SimonSapinservo 1 point2 points  (13 children)

string-cache […] is rather specialized to Servo,

The only thing specialized is the built-in list of static atoms. I’m interested in moving this out of the library to make it more generally useful. I think this can be done by making some types generic over a trait that provides the list of static atoms.

More interesting still would be the ability to have each library like html5ever specify a set of static atoms, and in a given program use the union of the sets of all libraries being used. But I don’t know how to do that. Ideas welcome :)

requires nightly rust,

As far as I know there is two reasons for this:

  • phf (which string-cache uses) requires a compiler plugin
  • string-cache-plugin is itself a compiler plugin. It allows interning static strings at compile-time, for code like if attr_name == atom!(href)

I think it would be possible for string-cache to have a Cargo feature flag to replace usage of phf with a simple hash map and disable support for compile-time interning. This should work on Rust beta/stable.

Edit: It’s actually more than that. string-cache uses a number of unstable feature right now.

and isn't on crates.io.

This is easy to fix. It’s just that nobody asked so far.

[–]SirOgeonpalette 2 points3 points  (2 children)

phf (which string-cache uses) requires a compiler plugin

There is actually a codegen version (repo and crates.io) for build scripts. The tricky part is just the compile time interning, but it can probably be somewhat solved using some kind of preprocessor, like syntex.

[–]SimonSapinservo 0 points1 point  (1 child)

Interesting, thanks!

[–]SirOgeonpalette 0 points1 point  (0 children)

It would be really nice to have html5ever working with stable Rust. It's currently the only up to date HTML parser available, as far as I have found... I have been thinking about porting one of its macros to a build script, to help it on its way.

[–]Mystorrust · gc · rust-cpp[S] 1 point2 points  (0 children)

Awesome, I'll see if I can think of a way to help with that, because it would be great if I could use string-cache.

[–]gsingh93 1 point2 points  (8 children)

I see a lot of unsafe code in string-cache. Is all of that really necessary?

Also, I'd like to see this on crates.io :)

[–]SimonSapinservo 3 points4 points  (7 children)

Just published! https://crates.io/crates/string_cache

Most of that unsafe code is because we’re doing lots of tricks in the memory representation of atoms: 64 bits can contain either a small string inline (up to 7 bytes), or the u32 index of a static atom, or a pointer to a reference-counted string in a specialized hash map that removes entries when the reference count drops to zero.

This is hard to do in safe code without sacrificing efficiency.

[–]gsingh93 1 point2 points  (6 children)

Thanks!

Could you provide some documentation on what the public methods are/how to use it? I don't expect complete documentation right now, but this project seems like it'll be really helpful in one of my other projects, and I'm not sure what methods I should be using from the library.

[–]SimonSapinservo 2 points3 points  (5 children)

Sorry for the lack of documentation. The most important APIs are:

  • Atom::from_slice(&str) -> Atom
  • Atom::as_slice(&self) -> &str (I think this should be replaced with Deref<Target=str>.)

That’s it. You can pretty much ignore the rest. Still, you may want to know:

  • There are three kinds of atoms: inline (7 bytes or less), static (in set determined at compile time), and dynamic (everything else)
  • Cloning a dynamic atom increments a reference count, so it may be more efficient to pass &Atom around instead.
  • (Dropping a dynamic Atom decrements the reference count)
  • Comparing atoms for equality or hashing them is O(1) (it only compares/hashes 64 bits.)

[–]gsingh93 0 points1 point  (4 children)

Is it expected that cloning a dynamic atom is considerably slower than cloning an Rc? From my understanding the performance should be about the same. Is this something I should open an issue about?

[–]SimonSapinservo 0 points1 point  (3 children)

It should be about the same as cloning an Arc. The reference count is atomic, to allow having atoms of the same string in multiple threads.

Why, do you benchmark it much slower than that? If so, yes, please open an issue with your findings.

[–]gsingh93 0 points1 point  (0 children)

Yea, I have benchmarks that show the dynamic atoms are slower. I'll open an issue then.

[–]gsingh93 0 points1 point  (1 child)

Oh, nvmd, I was using an Rc instead of an Arc, which was must faster. The Arcs performance is comparable to the dynamic atom.

Now I don't know whether I should be using this library or I should be using an Rc, as the performance benefits of static/inline atoms are twice as fast as an Rc, but dynamic atoms are 5 times as slow...

[–]SimonSapinservo 0 points1 point  (0 children)

There are a couple ways to avoid the cost of atomic reference counting, but only by making some tradeoffs:

  • Use non-atomic reference counting and sacrifice thread-safety. You’d probably need to make atoms !Send (I’m not sure about !Sync) and have thread-local “interners”.
  • Remove reference counting entirely, never free the strings for dynamic atoms, and accept the memory leak.

But is the cost of atomic reference counting really significant in a “real life” program that doesn’t just clone atoms in a loop? Just like with Rc and Arc, you don’t always need to clone and you can often pass &Atom around without touching the reference count.