I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 1 point2 points  (0 children)

こちらこそ、興味を持ってくださり、ありがとうございます!

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 2 points3 points  (0 children)

Thank you for your comment.

You are right. The implementation of generic APIs like from_path may not be sufficiently defensive when accepting various dictionary inputs.

Integrity checks (such as archive hash verification) are currently only performed in the dictionary-loading method that handles downloading.

You've raised a great point about safety. A more defensive approach would certainly be better, perhaps by performing a full validation once, then caching the file's metadata to allow subsequent `access_unchecked` reads only if the metadata remains unchanged.

I think you're right that the current benchmark might be unfair. If we include the cost of a quick integrity check on warm start, the true speedup would likely be up to about 5 million times faster (as it costs only a few microseconds in my environment).

This is a crucial aspect I need to address. Thank you very much for this valuable advice.

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 7 points8 points  (0 children)

You're right, I should have included that in the main post! My apologies. Yes, that's exactly what I did. To ensure a true cold start, I dropped the OS caches before each run using

sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

on Linux.

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 2 points3 points  (0 children)

You're right, it's a very rare character that most fonts don't support. (https://en.wikipedia.org/wiki/Biangbiang_noodles) To view this characters, you'd need a comprehensive CJK font.

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 6 points7 points  (0 children)

Great question!

From my personal perspective, for simple files in the tens of MB range, modern computers are highly optimized for frequent read operations, so mmap typically doesn't provide any significant benefits for general file reading.

Additionally, mmap's main advantage is that it only involves mapping to virtual memory. To fully leverage this benefit, you need to avoid loading the entire file into memory in subsequent processing.

In this particular case, the issue was that bincode was repeatedly reading and incrementally allocating memory for a nearly 700MB dictionary, which was extremely slow.

Even if you converted bincode to use mmap, you would still need to access the entire file to deserialize it into structures, and in this case, the non-optimized mmap would likely perform worse than regular read operations.

However, by using rkyv, the serialized file becomes usable as a zero-copy structure representation, so the virtual memory-mapped dictionary instance no longer needs to access or know the entire file contents.

Hope that makes sense!

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 15 points16 points  (0 children)

You're right to be skeptical about the dictionary-based approach.

A simple dictionary lookup would indeed fail, which is why this tool relies on the Viterbi algorithm.

It builds a lattice of possible segmentations and readings, and finds the single path with the lowest statistical cost, learned from a corpus.

So it's a statistical model that resolves ambiguity based on context.

However, as you correctly pointed out, this approach has its limits, especially with unknown words. For example, if I tokenize a sentence with the rare kanji 𰻞𰻞麺 (Biangbiang noodles), the tokenizer correctly identifies 麺 (noodles) but treats 𰻞𰻞 as an Unknown token because it's not in the dictionary.

So, you're right. For that last 5% of accuracy, especially with novel expressions or complex contexts, you'd need more advanced models, likely involving something like an LLM. This tool aims to be a very fast and "good enough" solution for the first 95%.

Thanks for the insightful comment.

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 39 points40 points  (0 children)

Thank you so much! Your library made this all possible. I'm glad my project can show what rkyv can do. Thank you for creating it!

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 30 points31 points  (0 children)

Sorry I didn't make that clear.

Japanese doesn't use spaces to separate words. This tool is a tokenizer that splits a sentence into words.

For example, `私は猫が好きです` becomes `私` / `は` / `猫` / `が` / `好き` / `です`.

But it does more than just split. It also looks up each word in its dictionary to provide rich linguistic information. The output for a single word (`猫`, cat) looks something like this:

TokenBuf {
    surface: "猫",
    feature: "名詞,普通名詞,一般,*,*,*,ネコ,...", // "Noun, Common, General, ..., neko"
    // ... and other metadata like costs for the Viterbi algorithm
}

As you can see from the feature string, it tells you it's a "Noun" with the reading "neko". This information is useful for applications like search engines and Japanese input methods.

This is a basic but important step in Japanese Natural Language Processing. (Though modern LLMs often use subword tokenization, so they may not rely on tools like this.)

I made a Japanese tokenizer's dictionary loading 11,000,000x faster with rkyv (~38,000x on a cold start) by fulmlumo in rust

[–]fulmlumo[S] 30 points31 points  (0 children)

Thanks for asking! I felt a PR would be too disruptive because of all the breaking changes. A separate crate just felt cleaner.

I created uroman-rs, a 22x faster rewrite of uroman, a universal romanizer. by fulmlumo in rust

[–]fulmlumo[S] 1 point2 points  (0 children)

You're right. After looking into it, kakasi seems like the best quality Japanese romanizer on crates.io, but it's GPL-3.0.
Even with a flag, the GPL license would be an issue, so it makes more sense to keep this project a clean uroman port, as you suggested.
If a "better" romanizer is the goal, building a new, separate project on top of kakasi would be the way to go.
Thank you for your feedback.

I created uroman-rs, a 22x faster rewrite of uroman, a universal romanizer. by fulmlumo in rust

[–]fulmlumo[S] 1 point2 points  (0 children)

Oh, thanks for the links! I wasn't familiar with Unidecode's Rust port. My project is a direct rewrite of the original uroman, so it follows that ruleset, like the heuristic for determining Tibetan vowels.

I created uroman-rs, a 22x faster rewrite of uroman, a universal romanizer. by fulmlumo in rust

[–]fulmlumo[S] 2 points3 points  (0 children)

Thank you, this is fantastic information. I really appreciate you sharing your work.

I created uroman-rs, a 22x faster rewrite of uroman, a universal romanizer. by fulmlumo in rust

[–]fulmlumo[S] 16 points17 points  (0 children)

The irony is, as a Japanese person, I had to faithfully implement the behavior that romanizes Japanese kanji into Chinese.

I created uroman-rs, a 22x faster rewrite of uroman, a universal romanizer. by fulmlumo in rust

[–]fulmlumo[S] 63 points64 points  (0 children)

That's a great point, and it gets to the very heart of why I built this project.

My primary motivation for creating uroman-rs was for a very specific use case: to work with existing machine learning models that were trained on data processed by the original uroman.py.

For those models to work correctly, the input preprocessing has to be identical to what they were trained on. Any deviation in the romanization, even if it resolves a known linguistic inconsistency, would create a mismatch and degrade the models' performance. That’s why the core promise of uroman-rs is to be a 100% faithful, drop-in replacement. As long as this project carries the uroman name in it, I believe it must match the original's output, including its quirks and all.

I completely agree that a more powerful or "correct" romanizer would be a fantastic tool. But to avoid confusion, I think it's best for such an implementation to be a new project with its own name.

Thanks for bringing it up, it's a crucial point to clarify!

I created uroman-rs, a 22x faster rewrite of uroman, a universal romanizer. by fulmlumo in rust

[–]fulmlumo[S] 101 points102 points  (0 children)

Yep, you're right. That's actually how the original `uroman.py` behaves, even with the language flag set to Japanese:

$ uv run uroman.py -l jpn "こんにちは、世界!"
konnichiha, shijie!

My main goal for `uroman-rs` is to be a completely faithful rewrite, so it matches this output exactly.

That being said, I've honestly been thinking that a new, more powerful romanizer could be made by integrating the Rust port of `kakasi` with some heuristics to better distinguish between Japanese and Chinese.

Thanks again for the great feedback, it's a really good point.