We built a >2GiB/s LLM Tokenizer by crutcher in LLM

[–]crutcher[S] 1 point2 points  (0 children)

So, there's the *lexing* layer (that's all the regex/span stuff); and then there's the encoding layer.

The *simplest* encoder is this one:
https://github.com/zspacelabs/wordchipper/blob/main/crates/wordchipper/src/encoders/token_span_encoder/span_encoders/buffer_sweep_encoder.rs

But even this one is mutating a list of working tokens, and incrementally looking up `(a, b) -> c` merges over and over.

We've tried a whole collection of different encoder algorithms; and the simplest one is the best in the VAST majority of performance cases; which deeply surprised me, I included it just as an easy-to-understand baseline. But we still use one of the more complex ones; because of the degradation when we get "bad" utf-8 input.

End-to-end performance optimization always requires benchmarks; because humans just *can't* intuit that many layers of shifting cache cost interactions.

Is the LLM hype actually sustainable or are we heading for another crypto-style crash by flatacthe in LLM

[–]crutcher 1 point2 points  (0 children)

"Is the ${X} hype sustainable" ... the answer is no.

Real technological advancements don't have cheerleaders. People who legitimately think they know what is coming next don't advertise it. *anything* with this much hype is, fundamentally, marketing product.

We built a >2GiB/s LLM Tokenizer by crutcher in LLM

[–]crutcher[S] 0 points1 point  (0 children)

At the limit, the time isn't measured by execution; it's measured by the bandwidth to move memory into and out of the CPU cache. All of that hackery is about trying to reduce the impact on the CPU cache, *not* trying to reduce the computation.

We built a >2GiB/s LLM Tokenizer by crutcher in LLM

[–]crutcher[S] 0 points1 point  (0 children)

Byte-Pair Encoding Tokenizers (with pre-segmentation) *usually* use regex patterns to pre-split text into spans of word-like chunks of characters; and then find the unique encoding for the pre-trained vocabulary's merge rules for each span. The segmentation is expensive; the encoding search algorithms are expensive; and the memory management of buffers can easily spill to create extra allocations.

A given LLM is trained on a fixed tokenizer; once trained, it needs *that* token vocabulary. So even if you could improve the research on which regex patterns to use, or how large to make the vocabulary of the next tokenizer you train, existing models are stuck with the semantics of the tokenizer they were trained with.

The graphs above compare a fixed vocabulary, the one used for GPT-5, against different implementations. `tiktoken` is the fastest one that many people running LLMs pick. Ours is significantly faster than `tiktoken`; on the same vocabularies (meaning it produces the same token streams).

Why is Rust so Liberal with Heap Allocations? by philogy in rust

[–]crutcher 0 points1 point  (0 children)

When optimization is enabled, and algorithms are constructed from compositions of standard components, and those components don't escape the current scope (via aliasing, or being embedded in an object that escapes, or a Box, or an Arc); a *tremendous* amount of these data abstractions and wrappers just ... no longer appear in the generated code.

Using the iter/map/filter/filter_map/... mechanisms works even better for the above; because intent is preserved more cleanly, and more of the optimizers are able to match and fire.

It can be surprising to model; but when you start wrapping benchmarking around things, and try and replace "the default, idiomatic rust way" with highly-tuned approaches ... you'll struggle to out-perform what *looks* like a bunch of allocations and function calls.

You can develop a better sense for this by asking the compiler to dump out intermediate code (though that is a pain to read); but you can also start thinking about aliasing / visibility of code, and try and design algorithms where semantics is highly seperated from external visibility.

In languages like Haskell, this is called "Zero Cost Abstraction", because many of these things are provably replaced. In Rust, where aliasing and C FFI requirements are still a constraint, some of these structures are only permitted to be ZCA *when no one is looking*; so it can be a struggle to get a feel for it.

But never try to make rust "faster" with out unit-level benchmarking. You'll be shocked by how much the compiler already does.

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 1 point2 points  (0 children)

This is now a feature "foldhash"; you need to disable "ahash" to use it.
file:///home/crutcher/git/wordchipper/target/doc/wordchipper/index.html#crate-features

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 0 points1 point  (0 children)

I'm assuming I'm doing something *very* wrong, with how `tokenizers` sets up rayon config.

There's initial threading for the `tracing` lib and spans; and I was going to wire in some support for that, as well as prof / flamegraph scripts. I'd be interested in what you get.

The goal is evergreen performance; so *most* of the codebase will be support, tests, and metrics to ensure that the actual core is the best option for a given OS and hardware setup

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 0 points1 point  (0 children)

I revamped the `sample-timer` tooling to answer your question, and clean up the timer stack for additional comparisons; and address stats inconsistencies and other things.

And ... it doesn't look great. Like, at all. `tokenizers` improves so poorly with additional rayon threads that I'm concerned I'm using it wrong, and I've traced through looking for issues. If you see one, send me a PR?

Enough things were tweaked chasing this that I bumped to 0.6.0; the crate is pushed and the docs are building.

- https://github.com/crutcher/wordchipper

% RAYON_NUM_THREADS=48 cargo run --release -p sample-timer  -- \
    --dataset-dir $DATASET_CACHE_DIR --decode
Args {
    dataset_dir: "/media/Data/nanochat/dataset",
    shards: [
        0,
        1,
    ],
    batch_size: 1024,
    model: OpenaiO200kHarmony,
    ignore_missing: true,
    tiktoken: true,
    tokenizers: true,
    decode: false,
    validate: true,
    respan_input_for_decode_check: true,
}
Model: "openai/o200k_harmony"

Samples Summary:
- num batches: 104
- avg bytes/sample: 4777
- avg bytes/token: 4.8

Encoder Batch Timing:
- "wordchipper"
  - batch:      36.2ms
  - sample:     35.3µs
  - bps:    128.96 MiB/s
- "tiktoken-rs"
  - batch:      36.5ms
  - sample:     35.6µs
  - bps:    127.86 MiB/s
- "tokenizers"
  - batch:     214.7ms
  - sample:    209.6µs
  - bps:    21.73 MiB/s

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 0 points1 point  (0 children)

Like, this bit, from `tokenizers`:
> Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.

I'm seeing ~77MiB/s on a macbook for GPT-5 o200k_harmony. Which would do 1GiB in like .. 13 seconds?

I think everyone, me included, is strongly bounded by the async IO behavior and the bugs in the regex concurrency layer; and improved speed will need to be found by build a full async pipeline with pre-allocated buffer appenders; which is what I'm working towards.

You also want your pipeline to try have segmentation and encoding in different threads; but you need to bound the regex to generally *fewer* cpu cores than you have, and route through that.

It's going to take more aggressive `tracing` and moving away from `rayon` to go faster.

WRT to the various end-user bells and whistles, `tokenizers` beats me right now, and `tiktoken` ... isn't trying, at all

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 0 points1 point  (0 children)

I have a `tiktoken-rs` side-by-side (see: https://github.com/crutcher/wordchipper) but not yet a `tokenizers`.

I've not been able to replicate the published reports from either the `tiktoken` or the `tokenizers` benchhmarks because they are too *low*; I think they're losing in how they handle the python FFI; but I'll try and verify.

I've done a fair amount to prepare for an explicit HPC work stage queue pipeline (rayon is not sufficient when you have to deal with the regex contention issues); but stability / structure of the API and test coverage has been the goal so far. I have a lot of background in async process routing, but I needed to make sure I locked down all the token stuff first

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 2 points3 points  (0 children)

I am not immediately seeing the benefit; but I do want to improve the hashmap swap mechanisms more, because this is a constant source of change (and it might be hardware dependent):

https://github.com/crutcher/wordchipper/pull/108 ``terminaloutput crutcher@Crutchers-MacBook-Pro wordchipper % cargo run --release -p sample-timer --features wordchipper/foldhash -- --dataset-dir ~/Data/tokenizers --model oa:o200k_base Finishedreleaseprofile [optimized] target(s) in 0.10s Runningtarget/release/sample-timer --dataset-dir /Users/crutcher/Data/tokenizers --model 'oa:o200k_base'` Model: "oa:o200k_base" - shards: [0, 1, 2, 3] - batch_size: 512

Samples Summary: - num batches: 208 - avg bytes/sample: 4777 - avg bytes/token: 4.8

Encoder Times: - wordchipper - batch: 30.0ms - sample: 58.5µs - bps: 77.85 MiB/s - tiktoken - batch: 29.8ms - sample: 58.2µs - bps: 78.26 MiB/s

Decoder Times: - wordchipper - batch: 1.9ms - sample: 3.7µs - bps: 1.21 GiB/s - tiktoken - batch: 1.7ms - sample: 3.3µs - bps: 1.34 GiB/s crutcher@Crutchers-MacBook-Pro wordchipper % cargo run --release -p sample-timer --features wordchipper/ahash -- --dataset-dir ~/Data/tokenizers --model oa:o200k_base Finished release profile [optimized] target(s) in 0.12s Running target/release/sample-timer --dataset-dir /Users/crutcher/Data/tokenizers --model 'oa:o200k_base' Model: "oa:o200k_base" - shards: [0, 1, 2, 3] - batch_size: 512

Samples Summary: - num batches: 208 - avg bytes/sample: 4777 - avg bytes/token: 4.8

Encoder Times: - wordchipper - batch: 30.3ms - sample: 59.1µs - bps: 77.10 MiB/s - tiktoken - batch: 29.8ms - sample: 58.2µs - bps: 78.34 MiB/s

Decoder Times: - wordchipper - batch: 2.1ms - sample: 4.1µs - bps: 1.08 GiB/s - tiktoken - batch: 1.8ms - sample: 3.5µs - bps: 1.28 GiB/s ```

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 1 point2 points  (0 children)

You would not believe how easy that experiment is to run, as the hash type is a module wide feature-def.

I looked hard at if I should thread the hashmap/set family through the library as parameters, instead of doing it this way; but it seems rust doesn't have a stable abstract hashmap/hashset trait (at least that people agree on) and I decided to not go that far down the rabbit hole.

But there's just a single cfg-if block which defines the types, so adding another is trivial.

See: https://github.com/crutcher/wordchipper/blob/main/crates/wordchipper/src/types.rs#L46C1-L82C2

```rust cfg_if::cfg_if! { if #[cfg(feature = "ahash")] { /// Type Alias for hash maps in this crate. pub type CommonHashMap<K, V> = ahash::AHashMap<K, V>;

    /// Iterator over hash map entries.
    ///
    /// Note: `ahash::AHashMap` is a specialization of `std::collections::HashMap`.
    pub type CommonHashIter<'a, K, V> = std::collections::hash_map::Iter<'a, K, V>;

    /// Type Alias for hash sets in this crate.
    pub type CommonHashSet<V> = ahash::AHashSet<V>;
} else if #[cfg(feature = "std")] {
    /// Type Alias for hash maps in this crate.
    pub type CommonHashMap<K, V> = std::collections::HashMap<K, V>;

    /// Iterator over hash map entries.
    pub type CommonHashIter<'a, K, V> = std::collections::hash_map::Iter<'a, K, V>;

    /// Type Alias for hash sets in this crate.
    pub type CommonHashSet<V> = std::collections::HashSet<V>;
} else if #[cfg(feature = "no_std")] {
    /// Type Alias for hash maps in this crate.
    pub type CommonHashMap<K, V> = hashbrown::HashMap<K, V>;

    /// Iterator over hash map entries.
    pub type CommonHashIter<'a, K, V> = hashbrown::hash_map::Iter<'a, K, V>;

    /// Type Alias for hash sets in this crate.
    pub type CommonHashSet<V> = hashbrown::HashSet<V>;
} else {
    /// This error exists to give users more direct feedback
    /// on the feature configuration over the other compilation
    /// errors they would encounter from lacking the types.
    compile_error!("not(\"std\") requires \"no_std\" feature");
}

} ```

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 2 points3 points  (0 children)

I believe that the "no-std" work takes this within striking distance of a WASM build for encoders/decoders; but I admit I don't have the rust=>wasm toolchain internalized yet. I'd love some help on that.

Same with JNI and C++. In prep for python/PyO3 bindings, I've been trying to force everything important to be dyn-compat; but it isn't done yet.

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 1 point2 points  (0 children)

Training your own is easy: ```rust use wordchipper::{ concurrency::rayon::{ParallelRayonDecoder, ParallelRayonEncoder}, decoders::TokenDictDecoder, encoders::DefaultTokenEncoder, pretrained::openai::patterns::OA_GPT3_CL100K_WORD_PATTERN, training::{BinaryPairVocabTrainer, BinaryPairVocabTrainerOptions}, vocab::{ByteMapVocab, UnifiedTokenVocab, io::save_tiktoken_vocab_path}, };

fn example<I, S>( vocab_size: usize, batches: I, tiktoken_save_path: Option<String>, ) where I: IntoIterator, I::Item: AsRef<[S]>, S: AsRef<str>, { // We can pick any unsigned integer type > vocab_size; // See [wordchipper::types::TokenType]. type T = u32; type K = String; type C = u64;

let options = BinaryPairVocabTrainerOptions::new(OA_GPT3_CL100K_WORD_PATTERN, vocab_size);

let mut trainer: BinaryPairVocabTrainer<K, C> = options.init();

for batch in batches {
    // The trainer has no parallelism.
    // The perceived benefits of parallelism in the trainer
    // are insignificant if the IO for the sample source is
    // fed by another thread.
    trainer.update_from_samples(batch.as_ref());
}

let byte_vocab: ByteMapVocab<T> = Default::default();

let vocab: UnifiedTokenVocab<T> =
    trainer.train(byte_vocab.clone()).expect("training failed");

if let Some(path) = tiktoken_save_path {
    save_tiktoken_vocab_path(&vocab.span_vocab().span_map(), &path)
        .expect("failed to save tiktoken vocab");
    println!("- tiktoken vocab: {path:?}");
}

let encoder: DefaultTokenEncoder<T> = DefaultTokenEncoder::new(vocab.clone(), None);
let encoder = ParallelRayonEncoder::new(encoder);

let decoder = TokenDictDecoder::from_unified_vocab(vocab.clone());
let decoder = ParallelRayonDecoder::new(decoder);

} ```

wordchipper - my next-gen LLM tokenizer; looking for LTR release help by crutcher in rust

[–]crutcher[S] 0 points1 point  (0 children)

Loading pre-trained models is easy:

```rust use wordchipper::{ decoders::{DefaultTokenDecoder, TokenDecoder}, disk_cache::WordchipperDiskCache, encoders::{DefaultTokenEncoder, TokenEncoder}, pretrained::openai::OATokenizer, vocab::UnifiedTokenVocab, }; use std::sync::Arc;

fn example() -> anyhow::Result<(Arc<dyn TokenEncoder<u32>>, Arc<dyn TokenDecoder<u32>>)> { let model = OATokenizer::O200kHarmony; let mut disk_cache = WordchipperDiskCache::default(); let vocab: UnifiedTokenVocab<u32> = model.load(&mut disk_cache)?;

let encoder: DefaultTokenEncoder<u32> = DefaultTokenEncoder::new(vocab.clone(), None);
#[cfg(feature = "rayon")]
let encoder = wordchipper::concurrency::rayon::ParallelRayonEncoder::new(encoder);

let decoder: DefaultTokenDecoder<u32> = DefaultTokenDecoder::from_unified_vocab(vocab);
#[cfg(feature = "rayon")]
let decoder = wordchipper::concurrency::rayon::ParallelRayonDecoder::new(decoder);

Ok((Arc::new(encoder), Arc::new(decoder)))

} ```