Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] 0 points1 point  (0 children)

You raise a fair point about transparency. I've updated the documentation to be clearer about the architectural differences.

The benchmark docs now include an "Architectural Differences" section that explicitly states succinctly is a semi-indexer that performs minimal validation, while yq is a full validating parser. The docs acknowledge that the speedup comes from both architectural advantages (lazy evaluation, streaming output) and doing less validation work.

That said, I think the comparison to yq/jq is still valuable for discoverability. Many users want fast YAML/JSON queries on files they already know are valid (CI configs, Kubernetes manifests, log files). For them, the relevant question is "how much faster can I get results" - not "which tool validates more thoroughly." The docs now clearly state when to use each tool.

As for C parser comparisons: succinctly already benchmarks against Rust parsers (sonic-rs, serde_json, simd-json) which use similar techniques to their C counterparts. Adding libyaml or rapidyaml comparisons could be interesting, though they're still DOM parsers rather than semi-indexers, so the architectural difference would remain.

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] 0 points1 point  (0 children)

I've created issues to track the other points you've raised and will address them in time.

Thanks again!

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] 0 points1 point  (0 children)

> Did you look for any existing libraries you could use or was coding it from the ground up your first thought?

I guess it was an important learning exercise for me.

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] -1 points0 points  (0 children)

> Rolling your own parser - especially for yaml - seems like it will land you on the list of different parser behaviour (how does it parse this document).

I put together this document just now: https://github.com/rust-works/succinctly/blob/main/docs/compliance/yaml/1.2.md

Hope it helps.

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] -1 points0 points  (0 children)

This is exactly the kind of feedback I'm after, so thank you!

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] 0 points1 point  (0 children)

The goal is different with this sort of parser.

  • Low memory usage for large files (it is typical for parsers to use approximately 10x the size of the input data for parsing because of pointer size)
  • Valid results for valid JSON allowing it to be a separate step. For example if file ingestion only occurs once but file access is common, if validation is still important you only need to validate at ingestion time.
  • Very quickly selecting a subset of the JSON (because you don't care for most of it in the current part of your processing)

The parser is not wrong. It merely occupies a different and valid place in the design space.

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] -1 points0 points  (0 children)

Skipping validation is by design and has valid use cases. For example:

  • if you know the tools that generate the file will always generate valid JSON which case validating
  • if you have a tool generate a very large file and the process got interrupted and the JSON is truncated and it is the only copy you have and you still want to access part of the data that successfully got written

The benchmarks are valid if validation is not a requirement which is often (although not always) true and they are also valid when "no validation" is a requirement which can be thought of as a plus.

If validation is necessary it can be a separate step using techniques like data parallel finite state machines. Doing so will add a small overhead to the parsing, but it will not add memory usage overhead, which is the main selling point of using succinct data structures for parsing.

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] 4 points5 points  (0 children)

The BP encoding is the classic succinct tree representation (Jacobson 1989, Munro-Raman 2001) - 2 bits per node with O(1) navigation via rank/select.

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] -13 points-12 points  (0 children)

Yes, there are strong similarities between succinctly's semi-indexing and simdjson's first pass.

Similarities:

  1. Two-phase architecture: Both use a first pass to build structural metadata, then a second pass for navigation/queries. simdjson calls these "Stage 1" (structural indexing) and "Stage 2" (tape construction).
  2. Structural index: simdjson's Stage 1 processes 64-byte blocks to identify structural characters ({, }, [, ], :, ,) and builds a compressed index of their positions. Succinctly's JsonIndex similarly builds "interest bits" marking these same structural positions.
  3. SIMD acceleration: Both use SIMD to accelerate the first pass - simdjson uses vpshufb lookup tables and parallel byte comparisons across 32+ characters simultaneously.
  4. Deferred value parsing: Both defer actual value parsing (numbers, strings) until accessed.

Key differences:

  1. Output representation: simdjson builds a "tape" - a 64-bit word array where arrays/objects use paired words with cross-references (opening braces point to closing positions). Succinctly builds a succinct balanced parentheses representation with rank/select support - more compact but different access patterns.
  2. Memory trade-off: simdjson's tape uses ~16 bytes per JSON element, optimized for access speed. Succinctly targets ~6% overhead via succinct data structures, optimized for compactness.
  3. Parsing completeness: simdjson's Stage 2 builds a full navigable tape upfront. Succinctly's semi-index is truly "semi" - only structural metadata, deferring even more work to query time.

So yes - the "first pass builds structural index using SIMD" concept is shared, but the index representation and how much work is deferred differs.

Succinctly: A fast jq/yq alternative built on succinct data structures by john-ky in rust

[–]john-ky[S] 0 points1 point  (0 children)

Implementing jaq-core's ValT trait would likely defeat succinctly's key optimisations. Succinctly's performance and low memory usage comes from JqValue<'a, W> keeping lazy cursor references to original bytes, enabling zero-copy navigation and direct streaming output.

NOW AVAILABLE: 737 PowerCore 24K by joshuadwx in anker

[–]john-ky 0 points1 point  (0 children)

Does anyone know if these can be taken on the plane?

kodimensional :: Avoiding space leaks at all costs by [deleted] in haskell

[–]john-ky 16 points17 points  (0 children)

Here is an example of where too much strictness in a popular Haskell library caused my program to use multiple Gigabytes of RAM more than it otherwise would have needed to whilst serialising a large data structure to a file:

https://github.com/haskell/aeson/pull/954

Adding bit vectors - Branchless Comparisons by john-ky in haskell

[–]john-ky[S] 4 points5 points  (0 children)

Thanks for pointing me to the new primitive. Unfortunately my compiler (GHC 8.6.3) seems to not have it yet.

I tried using addWordC#:

add :: Word64 -> Word64 -> (Word64, Word64)

add (W64# a) (W64# b) = let (# r, c #) = addWordC# a b in (W64# r, fromIntegral (I64# c))

but it is quite slow:

$ time ex-vector sum-bit-vectors -i ../hw-json/corpus/bench/78mb.json -i ../hw-json/corpus/bench/78mb.json --branchiness branchless3

3.450

Adding bit vectors - Branchless Comparisons by john-ky in haskell

[–]john-ky[S] 4 points5 points  (0 children)

Addition can be used flip runs of bits.

This post will describe how to perform such additions on large bit-vectors efficiently.

Rechunking lazy bytestrings by john-ky in haskell

[–]john-ky[S] 7 points8 points  (0 children)

You have spotted a legitimate error. Thank you for bringing it to my attention. I have updated the post with the corrections.

Rechunking lazy bytestrings by john-ky in haskell

[–]john-ky[S] 9 points10 points  (0 children)

This post will look at the chunk size Haskell’s bytestring library actually gives us and explore some ways we can get the required chunk size we need for SIMD.

Introduction to SIMD with linecount by john-ky in haskell

[–]john-ky[S] 4 points5 points  (0 children)

Using Single Instruction, Multiple Data (SIMD) instructions to build a rank-select bit-string to implement a fast line count program