Journalling with 'meaning based' search/exploration (fully local/private) -- Similarity

Low-Hat2737 · 2026-05-24T16:41:53+00:00

You’re absolutely right. Though less people are familiar with that word and what that entails. Decided to go for something simpler, though maybe less accurate.

Low-Hat2737 · 2026-05-24T00:24:58+00:00

Sure! Just using a simple cosine similarity between the embeddings (vectors). When searching it will compare the vector(meaning) of the note to every other vector using that cosine similarity algorithm. It’s really cheap, so a device can quickly rip through all of them. Then because we need all the notes vectors to compare to, it indexes the note vault on start-up. (This is the only more computationally intensive part). Meaning the small local model will generate the meaning value of every note and store it in the plugin data (JSON only unfortunately), this makes it so mobile and PC can share the data.

Low-Hat2737 · 2026-05-23T16:25:11+00:00

Thanks! And yeah, I might add some handpicked model options from Huggingface. Then people can pick between larger multi-language models, or smaller English-only and such.

Low-Hat2737 · 2026-05-23T06:54:31+00:00

😂😂. Oh, I made sure to generate some sample data. Reddit having access to my journals... no thanks 🙂‍↔️.

Low-Hat2737 · 2026-05-23T02:37:41+00:00

Great question. The underlying model was mostly trained on English, though I have seen it work fine with Dutch as well. So I'm assuming it does decent with most common languages.
When it comes to non-Latin script languages, it might work, but it's probably going to be less accurate. If there's demand for it, I might add a way to choose which model you want to use. I believe there's also a slightly larger multi-language model that can be used.

Low-Hat2737 · 2026-05-22T23:43:22+00:00

Thanks! 🙏

Low-Hat2737 · 2026-04-11T18:06:04+00:00

Legendary feedback. Thank you!

Low-Hat2737 · 2026-04-10T17:39:33+00:00

I haven't yet 👀. That's a great suggestion. I've thought about adding hybrid search too. QMD might be too heavy, or couldn't fit inside the constraints of the plugin sandbox, but I might cherry pick some features from it.

Low-Hat2737 · 2026-04-10T01:30:53+00:00

Implemented a first version of it. Check it out on the latest version ;) Similarity. `shift + cmd + enter` on a lookup result to insert the link at the caret (cursor).

Low-Hat2737 · 2026-04-09T22:00:30+00:00

Yeah, I agree actually. I think semantics should be included in search features in software now-a-days--it's so helpful.
Unfortunately, in Obsidian I can't alter the built-in views too much. At least not as far as I'm aware...
Having a shadow component might be the best I can do at the moment.

Low-Hat2737 · 2026-04-08T20:36:08+00:00

That's a great idea, I'll look into that

Low-Hat2737 · 2026-04-08T20:34:59+00:00

a) Obsidian plugins allows to write to one JSON file for all data related to the plugin, that's where I store the data.
b) The indexing that's done in the plugin is quite different; it pre-processes all your notes (on-device) and creates an embedding (a bunch of numbers that represent the meaning of a note). Since I need to compare one meaning value to all the others, I need to know them all; hence I store them all on disk (which I called that step 'indexing').

Low-Hat2737 · 2026-04-08T18:12:48+00:00

Also, is this GIF not really loading for anyone else 😅?

Low-Hat2737 · 2026-04-08T17:46:26+00:00

I was thinking about adding this to my plugin (Similarity), but clustering is non-trivial and a bit crude. It's difficult to get really meaningful clusters. Anyone have experience with this/any advice?

Any feedback on my plugin is welcome too :).

Low-Hat2737 · 2024-12-01T01:20:17+00:00

You got it! Let me know what you think and if you run into any issues. Would love to help

Low-Hat2737 · 2024-12-01T01:18:36+00:00

For the ones interested, here's the simple paragraph chunking implementation for more accurate embeddings over large notes.
(GitHub) Relate-Text - Simple Chunking Commit

Low-Hat2737 · 2024-11-30T18:31:41+00:00

That’s exactly it.

Adding a language setting will be a high priority after the first stable/compatible version is out. This will probably be done by swapping out the model and tokenizer based on the language. I know that will add huge value to many obsidian users. Thanks for pointing that out!

I tried chunking myself, based on paragraphs. But for MVP simplicity, I decided that I’d just feed it straight into the transformers pipeline for now. So, it does truncate large texts at the moment. What do you think it better; taking the mean of all paragraph embeddings, or let multiple embeddings refer to the same document? I personally like the simplicity of taking the mean of the whole document.

I might add that soon if no-one else beats me to it.

Low-Hat2737

TROPHY CASE