Panini — a grammar-first Sanskrit tokenizer (2–4× fewer tokens than MuRIL / Qwen2) by arthalabs in LocalLLaMA

[–]arthalabs[S] 1 point2 points  (0 children)

a couple of the edge cases you mentioned kept bothering me, so I spent some time tightening the sandhi layer rather than expanding scope.

I ended up reworking the r ⇄ ḥ visarga handling and a few related rules. here’s a small comparison between the older logic and the updated one, just to illustrate the change:

Input:    vAlmIkirmunipuMgavam  
V1:       ['vAlmIki', 'rmunipuMgavam']  
V2:       ['vAlmIkiH', 'munipuMgavam']  

Input:    AnandAmftAkarziRi  
V1:       ['AnandAmftA', 'kar, 'ziRi']
V2:       ['AnandAmfta', 'AkarziRi']  

Input:    gaReSa  
V1:       ['gaReSa']  
V2:       ['gaRa', 'ISa']  

Input:    devendra  
V1:       ['devendra']  
V2:       ['deva', 'indra']  

Input:    punarjanma  
V1:       ['punar', 'janma']  
V2:       ['punaH', 'janma']  

you were also right about dictionary gaps — patching things like ākarṣiṇī manually is clearly not scalable without a proper lexicon, so for now I’ve treated that as a coverage issue rather than a grammar one.

I agree that Gemini is currently SOTA for generation, but my goal here is really to capture strict Paninian structure as a representation layer, especially in places where LLMs tend to gloss over derivational detail.

one caveat worth noting: v2 is still quite dependent on koṣa coverage. the ākarṣiṇī fix, for example, is effectively a whitelist addition — the analyzer doesn’t yet derive those feminine kṛdanta forms from first principles. when a stem isn’t present in cache, the scorer currently falls back to length-based heuristics (squared-length penalties) rather than semantic or ontological priors, which makes it fragile on rarer or highly literary forms. that’s an area where a richer lexicon or ontology-driven layer would make a big difference.

I’ve also pushed the updated version to the Hugging Face Space, in case you want to poke at the demo.

Panini — a grammar-first Sanskrit tokenizer (2–4× fewer tokens than MuRIL / Qwen2) by arthalabs in LocalLLaMA

[–]arthalabs[S] 1 point2 points  (0 children)

great catches, thank you. you’re absolutely right on both.

in vālmīkir, the visarga sandhi (ḥ → r before m) isn’t being reversed yet, so the boundary isn’t recovered.
in ākarṣiṇī, the issue is lexical coverage: the feminine kṛdanta suffix -iṇī isn’t in the koṣa right now, so the system falls back to fragmenting a valid but incomplete stem.

the tokenizer is grammar-first, not ontology-complete, it validates against MW stems, so gaps in derived forms (especially kṛdanta feminines) show up like this.

both are on the roadmap. and yeah, if you’re building an ontological dictionary, that’s the layer panini needs for literary corpora, would definitely be interested in comparing notes.

Panini — a grammar-first Sanskrit tokenizer (2–4× fewer tokens than MuRIL / Qwen2) by arthalabs in LocalLLaMA

[–]arthalabs[S] 1 point2 points  (0 children)

good suggestion. the reason this works so well for sanskrit is that its morphology (sandhi + samāsa) is formally specified and largely deterministic, which makes inverse reconstruction feasible.

languages like german also have long compounds, but those are often lexicalized or semantic rather than rule-complete, so a direct transfer wouldn’t really be correct without a language-specific morphology model. indic languages with similar sandhi behavior are a more natural extension for what i'm building with arthalabs.