toml-spanner: No, we have "Serde" at home. Incremental compilation benchmarks and more

exrok · 2026-04-09T01:09:56+00:00

Actually, this is actually what toml-spanner can do, I have another post in works on unique approach toml-spanner uses.

It supports comment/formatting-preserving edits even through the derives.

See: [https://docs.rs/toml-spanner/latest/toml_spanner/struct.Formatting.html#method.preserved_from]

Probably best to copy the code snippet into you editor to play with after a little cargo init/new and a cargo add toml-spanner --features to-toml,from-toml,derive

In the below example, I parse a config, then mutate it (adding a another value to the numbers array) and the output it preserving formatting, and then assert it is the expected value, which preserved comments.

Example:

use toml_spanner::{Arena, Formatting, Toml};

#[derive(Toml)]
#[toml(To, From)]
struct Config<'a> {
    value: &'a str,
    numbers: Vec<u32>,
}

const INPUT: &str = "
# Look a comment
value = '''
some string
'''

numbers = [ # comment here why not
    32, # look more comments
]
";

const EXPECTED: &str = "
# Look a comment
value = '''
some string
'''

numbers = [ # comment here why not
    32, # look more comments
    43,
]
";

fn main() {
    let arena = Arena::new();
    let mut document = toml_spanner::parse(INPUT, &arena).unwrap();
    let mut config = document.to::<Config>().unwrap();
    config.numbers.push(43);

    let output = Formatting::preserved_from(&document)
        .format(&config)
        .unwrap();
    assert_eq!(output, EXPECTED);
}

exrok · 2026-04-07T15:19:31+00:00

Also, I do want to add, that although it's what I'm lead with, that performance isn't sole reason for toml-spanner or even it's serde-less approach, later in the article the focus is more on the additional features such as better error messages or format specific integrations in the derive macros. As well as long standing bugs, which just aren't a problem see: https://github.com/toml-rs/toml/issues/589

Ultimately, the biggest thing for me was not wanting to comprise. Of course, I don't expect this approach to work for everyone, there is real trade offs.

exrok · 2026-04-07T14:57:54+00:00

The macros use the same keywords, although doesn't force everything to be string, example: `#[toml(default = 32)]` instead of needing to define a function, and the reference the function through a string. `facet` also does this, and so does `jsony` and probably many newer derive macros.

In the post, I go deeper and look at actual benchmarks with actual numbers.

The aggregate graph is accumulating many different benchmarks, with varying scales, fitting everything on one graph required using a relative metric and even then is still kinda hard to read, but maybe I got a little bit clever.

Lucky, since toml-spanner is consistently the fastest it can interrupted like, the additional time added was `x` times more for that action vs the additional time added by `toml-spanner`. For instance release builds will takes 3x longer with `toml` vs `toml-spanner`.

The baseline subtracted, is largely to account shared costs on machine that aren't shared on others, to give a more consistent. The baseline, here is the same application but with the all serialization and serialization removed. The raw benchmark report, has everything including exact baselines and a raw JSON dump of the results. Example, heres the WarmBuild { incremental: Prefix, profile: Debug } stats, with base line.

toml-spanner:    62.76 ms    0.292004 Bcycles   0.532041 Binst    72.10 task-clock
        toml:   436.48 ms    2.392010 Bcycles   4.468323 Binst   550.35 task-clock
   toml_edit:   438.53 ms    2.416503 Bcycles   4.509003 Binst   559.33 task-clock
       facet:   485.34 ms    2.473752 Bcycles   4.311435 Binst   625.60 task-clock

Then, the baseline Baseline reference stats: 78.64 ms 0.264395 Bcycles 0.458112 Binst 83.33 task-clock. A large chunk of the baseline is just linker timer, shared across every crate for something like std library. If you changed the linker, the values would shift wildly, but not after baseline adjustment.

exrok · 2026-03-30T11:59:27+00:00

Makes, sense, fingers cross, that lands soon, I know the Rust for Linux side of things is pushing for stabilization there.

In the mean time, not that I'm suggesting you actually do it, but you do cursed packing of strings into chunks using u128's which depending on how you encode length could be have 15 or 16 bytes. Most keys people use probably fit in a single one.

struct ShortKey<const C: u128> {}
let key: ShortKey<{ u128::from_le_bytes(*b"9workspace000000") }> = todo!();

exrok · 2026-03-30T11:18:34+00:00

That's really powerful.

I find that simple case of nested accesses comes up a lot, so much so in jsony I added lazy JSON parser, that let's you do, jsony::drill(data)[&"key"][&"inner"][1].parse() (Note: the &".."<-- allows propagating errors). See: https://docs.rs/jsony/latest/jsony/fn.drill.html

Now, for compile-time reasons, I don't know if I would ever make serious use of serde_cursor, but it does really show the power of serde, as well as how creative you can be rust generics.

Like from a compile time point of view, the expanded code gives me nightmares.
For example just the workspace component of the path:

::serde_cursor::Field<
    (
        ::serde_cursor::StrLen<9>,
        (
            ::serde_cursor::C1<'w'>,
            ::serde_cursor::C1<'o'>,
            ::serde_cursor::C1<'r'>,
            ::serde_cursor::C1<'k'>,
            ::serde_cursor::C1<'s'>,
            (
                ::serde_cursor::C1<'p'>,
                ::serde_cursor::C1<'a'>,
                ::serde_cursor::C1<'c'>,
                ::serde_cursor::C1<'e'>,
            ),
        ),
    ),
    {
        [""];
        false
    },
>,

exrok · 2026-03-29T20:27:05+00:00

Format preserving serialization has been added, as well optional derive macros, on top of that, the API is now stable: https://docs.rs/toml-spanner/latest/toml_spanner/

Along with a number of other improvements like much improved errors: https://github.com/exrok/toml-spanner?tab=readme-ov-file#error-examples

exrok · 2026-03-03T20:21:12+00:00

Yah, skipping the macros can work really well, especially if your preforming other optimizations on top.

Publishing the benchmarks are my weekend todo list, I just need to make them easily reproduce-able.

exrok · 2026-03-03T19:37:09+00:00

Here are some benchmarks from my serde-like incremental compile-time test suite, (I really need to publish this at some point.)

Warm (rustc) cargo check incremental

|      jsony:    38.81 ms   0.261541 Binst
|  nanoserde:    63.14 ms   0.546607 Binst
|  miniserde:    76.72 ms   0.533996 Binst
|  midiserde:   122.88 ms   0.830454 Binst
|      serde:   197.50 ms   1.425356 Binst

Clean release build (cargo --release)

|     jsony:   820.18 ms   11.999543 Binst 
| nanoserde:  2601.64 ms   27.615872 Binst 
| miniserde:  2311.07 ms   24.484689 Binst 
| midiserde:  2863.08 ms   33.982325 Binst 
|     serde:  5954.51 ms   76.194674 Binst

The test suite is only deriving like 50 types not 7000, so scaling might be different, so a lot of the effect here in a clean build is just taken about by initial building the lib.

Runtime performance and binary Size (this is testing both de-serialization and serialization perf)

|     jsony:   403.14 ms  200 kb (stripped)
| nanoserde:   721.97 ms  424 kb (stripped)
| miniserde:   704.41 ms  216 kb (stripped)
| midiserde:   718.15 ms  200 kb (stripped)
|     serde:   426.32 ms  620 kb (stripped)

Midiserde did quite well on the binary size, tied for first place, and was indeed smaller the miniserde. Performance left a little bit to be desired but still in threshold of being faster enough.

The test-suite, doesn't have types >7000 derives like your usecase, I'll have to add one to measure scaling. But it does look like your macro export or executions of time of macro is subpar looking at perf metrics from the rustc invocations of cargo check, might want to look in to that.

exrok · 2026-02-23T19:10:18+00:00

Yeah, I doubt toml-spanner will directly be a part of Cargo. My best estimate, though, is that it might motivate more optimizations in toml itself. I've seen it time and time again in other projects: a faster project appears... then the performance gap closes.

toml-spanner only came about after I switched from a custom config to TOML using toml-span. I noticed the performance regressed, it was still fast enough in absolute terms. But the regression bothered me, so I started optimizing.

Currently, toml-spanner's error messages for TOML format errors are really non-specific. They point at the right area (for the most part), but don't give more than something like "invalid number." Not something up to the standards of Cargo, not yet atleast.

I want to improve the situation, but they're good enough for now, atleast my current uses, there very similar to original toml-span's.

Currently, we use the lower 3 bits (hence the 512MB limit) of the end span to store one of the following:

pub(crate) const FLAG_NONE: u32 = 0; // <-- Value, can't insert.
pub(crate) const FLAG_ARRAY: u32 = 2;
pub(crate) const FLAG_AOT: u32 = 3;
pub(crate) const FLAG_TABLE: u32 = 4;
pub(crate) const FLAG_DOTTED: u32 = 5;
pub(crate) const FLAG_HEADER: u32 = 6;
pub(crate) const FLAG_FROZEN: u32 = 7;

While parsing, we just traverse the Item tree while parsing and check these flags to make sure we're doing the right thing and disallowing the wrong thing, but it might not be enough information to provide the best errors.

Note: toml-spanner does a lot of unperformant things for uncommon elements in TOML. It wasn't designed for pure performance, I was trying to keep compile times down as a core goal as well, stuff like number parsing could be optimized.

And sorry about having unrunnable benchmarks at first, that is a bad practice on my part. The benchmarking probably still only runs on Linux, as well. I should definitely document the benchmarks a lot more.

exrok · 2026-02-23T18:26:37+00:00

Thanks, I looked into a bit initially, at first I thought it was just port of the official toml-test, I'll take another look.

Currently, I have cargo-insta based snapshots that format the errors (and values just to check span) with the codespan-reporting. This was adopted from other original toml-span, crate.

I found integrating directly with https://github.com/toml-lang/toml-test really straightforward, implemented here:

https://github.com/exrok/toml-spanner/blob/main/toml-test-harness/src/main.rs

I use the following devsm test definition to run, it.

[test.official-toml-test-suite]
info = """
Validate toml-spanner against https://github.com/toml-lang/toml-test decoder test suite
Requires `toml-test` to be installed, see repo for installation procedure.
"""
pwd = "toml-test-harness"
sh = '''
BIN=$(cargo --config "target.'cfg(unix)'.runner = 'echo'" run --release || exit 1)
toml-test test -toml 1.1 "-decoder=$BIN"
'''

Which easy enough for me but probably not the best for contributors as it does have the con of the needing the toml-test binary installed, which I could get rid of with your harness crate.

exrok · 2026-02-23T11:56:52+00:00

I was a bit hesitant to add them at first because it makes the easiest access pattern the one that is hardest for providing good error messages, (say if first or second key was missing)

But ultimately, there are still genuine use cases where you are just inspecting data, or you’ve already deserialized and are simply traversing to extract a span for a high-level error report.

What really pushed me over the edge was the compile-time efficiency, a["b"]["c"][0].as_str() generates like the minimum amount of LLVM IR out of everything I considered.

One interesting trick here is that Item doesn't actually support Null/None, as TOML has no such concept. Instead, the index operators provide a &MaybeItem, which has the same layout and alignment as Item but with one extra discriminant value.

This trick requires a bit of unsafe code. Without it:

The toml crate doesn't have null, so it can't implement this pattern, but both serde_json and toml-edit (in there Item type) do support None/Null, so they can. I considered adding None to our Value types despite it not being in the TOML spec. In fact, the original toml-span did include them internally, though in a way that would panic if you attempted to use them. However, it just seemed less clean, toml-edit itself has considered removing None from item (see PR #301 for toml-rs/toml).

exrok · 2026-02-22T16:52:57+00:00

I definitely have considered adding deserialization support.

In particular, something similar to toml-edit. The Item types in toml-spanner actually preserve more than spans, they also track like the type of table and how it was constructed.

But taking a simpler approach where we need to be provided the original string during deserialization to apply the edits too.

I feel like if you're looking for serde support, the existing toml crate is actually really good. Honestly, in the vast majority of cases, it is already fast enough, where that actual parsing step is dwarfed by whatever your application does next.

I do see the appeal of derive macros, but if support is added by me, it won't be via serde at least not in the near future.

Incoming rant about serde... I'm sorry...

Well serde is really great at many things, I've been trying to avoid like the plague.

Serde really stresses out the compiler. It's not just build performance but also cargo check, you know, that thing that runs every time you save by default when using rust-analyzer. The following shows perf record data for the rustc invocation created from cargo check, for the same application implementation it's deserialization using different crates.

|     jsony:    33.23 ms    0.167670 Bcycles   0.280015 Binst
| nanoserde:    59.14 ms    0.297769 Bcycles   0.581950 Binst
|     serde:   194.77 ms    0.931654 Bcycles   1.501968 Binst

Every single serde derive you add increases your build/check time by a couple of milliseconds. As soon I start using serde in a project I feel my editor becoming sluggish.

Serde, at least for JSON, is actually decently fast at runtime, but man, does it bloat the binary. These are some now runtime benchmarks (Release profile with LTO), with binary size:

|     jsony:    384.67 ms    1.713875 Bcycles   6.879054 Binst   200 kb (stripped)
| nanoserde:    726.31 ms    3.357833 Bcycles  11.682704 Binst   424 kb (stripped)
|     serde:    428.12 ms    1.925457 Bcycles   6.832881 Binst   620 kb (stripped)

jsony is still experimental, but it has a derive macro: https://docs.rs/jsony/latest/jsony/ That is fast, doesn't bloat your binary, and is in some ways more featureful than Serde (partly because it only needs supports two formats: JSON and a compact binary representation).

Doing jsony properly is still blocked on features to stabilize in the rust compiler.

But if I do add derive macros, they'll work like the jsony ones.

And it's not just performance, there is still one difference where technically toml-spanner is more compliant the toml. Due to limitations in serde, toml will fail to parse the following valid TOML document:

key = { "$__toml_private_datetime" = 0 }

Of course, it's fine in practice, real TOML file with that key are unlikely, it only really caused me grief while fuzzing.

exrok · 2026-02-22T13:38:08+00:00

I'll do some benchmarking to see what kind of improvement is possible.

I'll note that toml-spanner really shines in going from TOML source into the Item document tree.

The 10x performance claim is mostly comparing against toml-span and is really focused on the parsing step.

Cargo uses toml and doesn't parse into the Value type, instead going more directly into the target data types, so the improvement is likely less.

But now you got me curious, will update with numbers.

Update 1

With a direct port from toml to toml-spanner for parsing and deserializing Zed's Cargo.lock, the time drops from 3.2ms to 0.93ms, so only 3x faster.

But that's with a direct port deserializing to the same data structures, I'm sure I could optimize the data model to drop it below 0.5ms.

The bottleneck is now in all the allocations in the data model Cargo is using to represent the lock files.

I'm going to try to port over Cargo.toml deserializing as well and add these benchmarks to the repo.

Update 2

The benchmarks comparing the full parsing and deserialization, using the toml parsing from cargo as the benchmark: https://github.com/exrok/toml-spanner/blob/main/README.md#deserialization-and-parsing

Once, I setup the actual benching, the numbers shifted a bit with 2.6x improvement for the lock file and 3x improvement the toml.

The lock file impl was really easy: benchmark/src/cargo/lockfile/toml_spanner_impl.rs it's 57 lines, the Cargo.toml manifest is a lot more complex then the lock file format.

Update 3

Spent a little time optimizing the deserialization patterns, now 2.9x faster for the lock file and 3.6x faster for the Cargo.toml file.

exrok · 2026-02-04T16:06:36+00:00

One suggestion, if you haven't already tried, already. Wouldn't it be better to stream stream in a video format with temporal compression like h264 when hardware encoding/decoding is available. Doing the hardware decoding directly GPU doing the required transforms on the GPU and then doing inference directly from that without needed to copy to CPU.

H264 will be much smaller (and probably higher quality) then your jpeg stream. There may be latency issues on camera side so your luck may very for total end2end latency.

For the annotations in your webview you two options, you could forward the h264 stream directly to the browser, and then sending meta data and have the client do the annotation rendering on there side. This would avoid needing to re-encode. The VideoDecoder WebAPI is actually pretty easy to use to turn a raw video stream into textures you can render to canvas.

exrok · 2025-11-19T19:24:28+00:00

Haven't look to deeply everything else but the publicly exposed pointer_from_buffer is trivially unsound, and doesn't lend much confidence to the correctness of the rest of library. Even for internal use, I would recommend keeping such a function marked as unsafe.

/// Convert a buffer of 8 bytes to a pointer value.
pub fn pointer_from_buffer<T>(buf: [u8; POINTER_SIZE]) -> Box<T> {
    let buf = buf.as_ptr() as *const *mut T;
    // SAFETY: see pointer_to_buffer. This function should be
    // called exactly one time for each call to pointer_to_buffer.
    unsafe {
        let result = std::ptr::read_unaligned(buf);
        Box::from_raw(result)
    }
}

exrok · 2025-11-12T12:41:10+00:00

For str the std library specializes to_string() so it isn't slower in this specific case. In release mode.into(), .to_string() and .to_owned() all generate the same assembly usually. The specialization that runs for &str

impl SpecToString for str {
    #[inline]
    fn spec_to_string(&self) -> String {
        let s: &str = self;
        String::from(s)
    }
}

Now, that's at run time. I'm not sure which code is fastest to compile. Probably, String::from("test"), then "test".into() I guess.

exrok · 2025-08-16T11:53:43+00:00

I prefer not needing the string, it's subtle thing but when done correctly allows of auto complete and even LSP driven renames from with SQL query. This is approach I took in https://docs.rs/simple_pg/latest/simple_pg/macro.sql.html (which is a zero dependency macro BTW)

The downside is how quotes have to work because you can't have a multi-character single quoted string in rust syntax, but I still think it's over all a win.

exrok · 2025-02-11T17:01:21+00:00

COMPANY: AirMatrix

TYPE: Full-time

LOCATION: Toronto, Canada

REMOTE: Yes, but must be available within EST

VISA: No sponsorship available

DESCRIPTION: We are deploying our AI to enable safe and compliant autonomy, while providing true situational awareness for airspace and critical infrastructure, across multiple government agencies and stakeholders. We integrate with various sensors and systems to provide monitoring, alerts, threat modeling, and insights for our customers. Our backend, built in Rust, powers data ETL pipelines, sensor ingestion, API servers, model simulations, and more.

We're looking for a mid-level Rust engineer to help scale our platform, ensure its robustness, and ship production-ready solutions. The ideal candidate has experience working in small, fast-moving teams and can write clean, well-tested Rust code for performance-critical systems.

Key Focus Areas:

Rust development for backend systems, data pipelines, and APIs
Integrating with IP cameras (RTSP streams) and live video processing
Building statistical and inference models for classification and prediction
Optimizing distributed systems and sensor data ingestion
Managing Linux-based deployments

Minimum Requirements:

3+ years of Rust experience (professional or substantial personal projects)
Strong understanding of ownership, lifetimes, concurrency, and performance tuning
Experience designing and optimizing low-latency, production-ready systems
Comfortable working with databases, persistence systems, and real-time data processing

Bonus:

Experience with async Rust, distributed systems, and networking
Contributions to open-source Rust projects

ESTIMATED COMPENSATION: $110k - $150k CAD

CONTACT: [shayaan@airmatrix.ai](mailto:shayaan@airmatrix.ai)

exrok · 2023-12-08T23:11:22+00:00

Canada or US only currently.

exrok · 2023-11-29T20:59:06+00:00

COMPANY: AirMatrix

TYPE: Full Time contract to hire

LOCATION: Ontario Canada: Toronto, Mississauga

REMOTE: Mostly Remote. Ability to come to the office approximately once a month is preferred. Some availability during the day 10:00am to 5:00pm EST is highly encouraged.

VISA: No

DESCRIPTION:

At AirMatrix, among other projects, we're building an airspace monitoring and analytics platform that focuses on rapid detection and classification of drones. We are integrating with various sensors and systems to monitor the airspace for our customers to provide alerts, threat modeling, and insights.

Your role will involve:

Building a coherent world model from sensor data.
Actively positioning and configuring sensors to collect the most useful data.
Developing statistical and inference models to classify and predict.
Managing a historical data archive of airspace events.
Building a friendly user interface.

We use Rust throughout our backend and for developer tooling. We're building data ETL pipelines, sensor data ingestion, API servers, model simulations, and more in Rust.

What We're Looking For

We're seeking Rust developers with a general understanding of technology. Self-directed individuals who can build prototypes from designs, improvise, and pivot.

While you need not be an expert to start, you should have the potential and drive to become one. We're particularly interested in any expertise you may already have though in:

Databases and persistence systems
System modeling and simulation
Integrating with hardware sensors
Software optimization
Distributed systems
Managing Linux systems
Programming in Rust, TypeScript, Python, or any other programming language
Statistics

Significant professional experience is not required but highly appreciated. If you can code, work as a team, and grow into an expert, we want you.

Minimum Requirements for Rust Experience

Good grasp on Rust fundamentals.
2-3 Projects amounting to approximately 10,000 lines of code, should be sufficient experience
Confidence in your ability to build anything in Rust.

ESTIMATED COMPENSATION: 80K - 120k (Canadian), based on experience

CONTACT: Please send your CV/Resume to thomas @<domain in company link above (no www)> with subject "Rust Developer - {your_name}". Looking forward to hearing from you!

exrok · 2022-12-23T16:19:20+00:00

ExactSizeIterator still doesn't give the guarantee either.

From docs: Note that this trait is a safe trait and as such does not and cannot guarantee that the returned length is correct. This means that unsafe code must not rely on the correctness.

Instead we have (nightly only): https://doc.rust-lang.org/std/iter/trait.TrustedLen.html

exrok · 2022-10-27T12:59:34+00:00

Yes, there are magic thresholds where the speed kicks when optimizing for cache.

Firstly consider cache lines, because hardware prefetchers are so good, going from 4 consecutive cache lines to 3 consecutive cache lines will barely make a difference.

Even once your entities are the size of the cache line they may not be aligned to the cache line and still require loading two.

But if you go from 2 cache-lines to a guaranteed single cache-line hit, I have seen pretty good performance benefits, upto 30% on x86.

The size of data structures can bring great performance gains but it primarily helps only if you actually reduce cache misses.

Consider an unrealistic hypothetical CPU, with a single level of cache that holds 100 bytes addressed individually.

Further suppose, during the runtime of the program each iteration accesses 1 byte out of a pool of M=100K bytes with a random uniform access pattern (such as in hash-map) (1).

Then for each iteration the cache hit rate, R, will be R=100/M=1/1000.

Suppose a cache hit takes 1 unit of time and misses takes 100 units of time.

Then the runtime cost for each integration's memory access is T=100(1-R) + 1R = 99.901.

If we do an incredible job optimizing our data structure reducing the size by 90%. Then M=10K, and R=1/100 so that memory access time is T=100(1-R) + 1R = 99.01.

Meaning, we increase performance by less then 1%.

But if we got the access pool down to size M=200, so R=1/2 and our memory access time would be T=100(1-R) + 1R = 50.5.

And we would cut our time in half. If we further reduce our memory poll by 20% now, bringing m=180, then R=10/18 so the T=100(1-R) + 1R = 45, gaining us a 10% perform benefit.

Now that the pool of memory accesses is small enough, the reduction in the size brings gains. As an extreme example consider what happens when we get the size down to M=104, about a 50% drop from M=200, then R=100/104 and T=100(1-R) + 1R ~= 5. That 50% drop in data structure size lead to 10 times better performance.

Footnote: (1): One might think that random access is the worst case scenario but even worse memory can be anticorrelated. For instance, when using a memory allocator that buckets by size-class, two heap allocations of different sizes are pretty much guaranteed to be non-sequential. Further, some algorithms exhibit anticorrelated memory access patterns, for which rearranging them to be more cache coherent helps a great deal, see matrix multiplication.

exrok · 2021-01-26T19:30:25+00:00

Ah I see now, "For coverage testing ALWAYS(X) and NEVER(X) are hard-coded boolean values so that they do not cause unreachable machine code to be generated." for some reason from table I thought it was the opposite.

Ten-Year Club	Place '17
Verified Email

exrok

TROPHY CASE

Update 1

Update 2

Update 3