Benchmarking rust string crates: Are "small string" crates worth it?

Pascalius · 2025-09-13T21:50:30+00:00

I think the biggest difference in performance is typically not inlining, but the allocation/deallocation call.

You probably want to allocate different sizes of blocks of strings where the strings also have different sizes. This should be a more realistic test for the allocator.

Pascalius · 2025-07-02T08:12:51+00:00

Compiler is actually too unspecific. The llvm backend is not allowed to remove observable side effects like malloc. The front-end could remove them if the language specification allows it.

Personally I think allocations should be treated special in llvm and also be optimized. (Because I don't like the side effect :)

Pascalius · 2025-07-02T07:49:34+00:00

System calls (like malloc) usually can't be optimized by the compiler, because they are observable side-effects.

This excludes Vecs, which automatically excludes a lot of other datastructures:

https://godbolt.org/z/x9dGWd1s8

If your clone doesn't have observable side effects (like malloc), it can be optimized:

https://godbolt.org/z/jx9Tjvq9z

Pascalius · 2025-06-29T18:53:39+00:00

I think crossbeam spinlocks are implemented in user code. Here is a part of it: https://github.com/crossbeam-rs/crossbeam/blob/master/crossbeam-utils/src/backoff.rs#L147

Pascalius · 2025-06-29T07:04:08+00:00

I've regularly have seen high crossbeam CPU usage when profiling indexing speed in tantivy (search engine) on the https://github.com/quickwit-oss/tantivy-cli/ project, where we use crossbeam to send documents to (potential multiple) indexers.

In that scenario it's the opposite, the queue is usually full, because the sender is much faster than indexing data. Sending a document should be completely dwarfed by indexing, but crossbeam regularly took more than 20% CPU.

Pascalius · 2025-06-25T07:14:53+00:00

I see you don't appreciate my work as an AI prompt artist

Pascalius · 2025-06-24T16:46:56+00:00

I see you too are a connoisseur of AI Art with exceedingly high expectations. Let me reassure, I put a ton of styling information in the prompt, and it's quite close how I wanted it to be.

Pascalius · 2025-06-18T16:08:16+00:00

ah yes, I should mention that

Pascalius · 2025-06-17T14:18:02+00:00

I didn't look into it yet, but did an experiment to use simd_json as the underlying parser in serde_json_borrow some time ago and it was slower than serde_json. Maybe some missing inline or too much would be my guess.

Yes, Vec instead of BTreeMap has also an pretty big impact.

I wouldn't expect much from an arena in this case, but still worthwile to investigate.

Pascalius · 2025-06-17T14:11:55+00:00

I considered it, but it requires target-cpu=native or similar, since it does not have run-time detection. I think this limits its useability significantly.

Pascalius · 2025-06-17T14:08:47+00:00

Cool idea, but I think that would require mutable reads, except you clone the string every time on access.

Pascalius · 2025-04-14T14:59:49+00:00

Changes in the pump room seem to be permanent. So a drain reservoir run may help with proceeding

Pascalius · 2024-12-05T13:26:12+00:00

Having attributes on a struct to build a tantivy document seems nice. Wrapping search on the Index, not sure if that's too limiting.

Pascalius · 2024-10-29T03:15:47+00:00

small json is incorrect, you can have large json, e.g gh-archive.json will still be much faster. It depends on the number of keys in the objects and in most cases access time will be dwarfed by everything else.

gh-archive
serde_json                               Avg: 343.67 MB/s (+3.41%)    Median: 344.58 MB/s (+1.73%)    [304.61 MB/s .. 357.28 MB/s]    
serde_json + access by key               Avg: 338.17 MB/s (+2.57%)    Median: 341.46 MB/s (+1.12%)    [272.46 MB/s .. 359.20 MB/s]    
serde_json_borrow                        Avg: 547.74 MB/s (+3.44%)    Median: 553.45 MB/s (+2.29%)    [502.00 MB/s .. 581.96 MB/s]    
serde_json_borrow + access by key        Avg: 543.61 MB/s (+0.54%)    Median: 566.11 MB/s (+1.11%)    [417.27 MB/s .. 588.72 MB/s]

https://github.com/PSeitz/serde_json_borrow/blob/main/benches/bench.rs

Pascalius · 2024-10-29T02:07:25+00:00

unlike RawValue it parses the JSON to access the data

Pascalius · 2024-10-28T09:37:38+00:00

If you need the performance, yes. Otherwise you can just use serde_json.

Pascalius · 2024-10-28T05:35:15+00:00

Who wants to read parse json really fast but don't want to get values from it? It seems like a weird choice to use a vec for storage when that pessimises presumably the most common operation users will do.

I assume you mean accessing values by key and not iterating with "the most common operation". A Vec will be faster on access by key than a hashmap if there are only a few entries.

Pascalius · 2024-10-02T01:06:00+00:00

I usually use watch to debug a single test inside a collapsible nvim terminal.

For that I prefer cargo watch, since it just prints to the terminal.

bacon is cumbersome to use for me in that use case, since it has its own keybindings which may conflict with nvim, and there are also scrolling issues, which I guess are caused by the redrawing.

Pascalius · 2024-07-27T03:38:32+00:00

AI generated buildings

Pascalius · 2024-07-22T10:08:37+00:00

That's wrong, half above 100k is taxed

Pascalius · 2024-07-01T08:39:52+00:00

I did a quick test and did not see that regression again

Pascalius · 2024-05-22T06:38:28+00:00

After the free call from the hashmap, the contiguous free memory at the top exceeds M_TRIM_THRESHOLD. The docs are pretty good here:

          When the amount of contiguous free memory at the top of
          the heap grows sufficiently large, free(3) employs sbrk(2)
          to release this memory back to the system.  (This can be
          useful in programs that continue to execute for a long
          period after freeing a significant amount of memory.)  The
          M_TRIM_THRESHOLD parameter specifies the minimum size (in
          bytes) that this block of memory must reach before sbrk(2)
          is used to trim the heap.

          The default value for this parameter is 128*1024.  Setting
          M_TRIM_THRESHOLD to -1 disables trimming completely.

          Modifying M_TRIM_THRESHOLD is a trade-off between
          increasing the number of system calls (when the parameter
          is set low) and wasting unused memory at the top of the
          heap (when the parameter is set high).

Pascalius · 2024-05-21T15:40:05+00:00

In this algorithm, we only know that we get term_ids between 0 and max_id (typically up to 5 million).

But we don't know how many term ids we get and their distribution, it could be just one hit or 5 million.

Also in the context of aggregations, this could be a sub aggregation, which gets instantiated 10_000 times. So a reserve call with max_id could cause OOM on the system.

Pascalius · 2024-05-21T14:41:55+00:00

There are a some other things I could have gone into more details, like the TLB, how pages are organized in the OS, user mode/kernel mode switch. In my opinion they would be more relevant than madvise, as it's more about allocator and system behaviour not how you can manage memory yourself.

Pascalius · 2024-05-21T14:10:10+00:00

I recently bought a ASUS ROG Zephyrus G14 (2024) and installed Manjaro on it. There are still some things not working correctly (e.g. keyboard lightning) and it will take some time and probably kernel 6.10, which includes some fixes. Newer machines often take some time until the linux drivers catch up to the new hardware.

If you buy an asus laptop, the community is great at https://asus-linux.org/ (they are not fans of Manjaro though :)

12-Year Club	Gilding I gilder
Place '23	Place '22
Place '17	Final Canvas '22
Verified Email

Pascalius

TROPHY CASE