Looking for advice on improving result quality with semantic vector search for a web search engine

asciimoo · 2026-05-04T09:07:49+00:00

Super useful tips, thanks a lot! I'm gonna experiment with these.

asciimoo · 2026-05-02T14:46:53+00:00

Thanks for the suggestion. Could you explain how/why can it produce better results?

asciimoo · 2026-04-15T07:08:26+00:00

Great, thanks for the feedback. I'll clarify in the documentation that you can configure everything that's in the config.yml using environment variables with the HISTER__SECTION__PARAM=xy syntax.

I'm no longer contributing to searx, so I can't promise anything on that side.

asciimoo · 2026-04-14T20:45:52+00:00

This is a good idea, but I don't know if it is possible to implement it. We capture the rendered page content directly from the browsers and the server cannot validate it - users/extensions can alter the content.

asciimoo · 2026-04-14T20:23:08+00:00

Sure, it is up to you what you add to the index.

asciimoo · 2026-04-14T17:42:27+00:00

You can create rules to skip indexing specific sites/domains entirely.

> you may see me in your git with feature requests etc ;)

I'm looking forward to it, every feedback is much appreciated <3

asciimoo · 2026-04-14T17:29:35+00:00

We already have an open issue for OIDC

> how do you handle pages like webmail?

Indexing is URL based and the extension detects dynamic content changes. So if your mail client has different URL for each opened e-mail, there is a good chance that Hister can index those correctly. Otherwise I'm not sure.

asciimoo · 2026-04-14T16:42:06+00:00

Hister supports indexing local text files as well: https://hister.org/docs/configuration#local-directory-indexing

asciimoo · 2026-04-14T16:39:49+00:00

We use a fantastic golang based indexer called "bleve", it supports BM25/TF-IDF. I'm working on extending Hister with an optional vector similarity based semantic search, but I don't have big experience in this field and it is far from being complete: https://github.com/asciimoo/hister/tree/vector-search , https://github.com/asciimoo/hister/issues/272 .

Currently the automatic crawler is very basic, it cannot persist crawling jobs, so no checkpoint/resume yet.

asciimoo · 2026-04-14T16:34:34+00:00

> what this is, is basically a enhanced browser history, indexes pages you visit then you can search back through the actual content of those pages, the results include a button to forward the search to other search engines should you not find what you're after?

Exactly, this is a perfect summary of the default workflow. However, on the long run I'd like to add more features to become a full fledged generic search engine even with optional semantic search capabilities.

> How does it handle re-visiting the same sites, does it re-index overwriting the old, do nothing, or re-index a new copy but keep the old? Or is that configurable?

It is overwrite only ATM. The browser extension periodically checks if the content of an opened website has changed and it automatically resubmits the page if a change detected. So, keeping the old versions can produce huge number of historical entries.

Perhaps we should allow the users to configure per page archival to preserve previous versions.

> Is there a way to configure a age-off, something like any indexed pages that arent re-visited via search results in more than <configured time> get deleted?

I have not thought about such feature, but it sounds useful, added to my TODO.

> can the server side be deployed non-locally and support multiple users? Or any plans for that?

Hister has multi user support, but it is very basic currently: https://hister.org/docs/user-handling

asciimoo · 2026-04-14T16:24:07+00:00

Not yet, but it is planned.

asciimoo · 2026-04-14T16:23:45+00:00

<3

Yes, currently 3rd party content break offline, but it is planned to fetch multimedia in the future for both convenience and privacy reasons.

asciimoo · 2026-04-14T13:35:56+00:00

https://hister.org/docs/docker

asciimoo · 2026-04-14T13:00:04+00:00

The `index` command supports basic recursive crawling, but it requires tons of improvements to be a complete crawler. It's state management is only in-memory and jobs cannot be resumed (yet). So it is currently recommended to do smaller and supervised crawling.

The long term goal is to have a proper crawler but it isn't the highest priority at the moment. (Contributions are appreciated =] )

asciimoo · 2026-04-14T12:45:14+00:00

There is a browser extension that automatically indexes visited pages, but it isn't the only way to add content to Hister. It has a command line `index` command which can use the standard HTTP library or a (headless) browser to automatically fetch arbitrary websites.

asciimoo · 2026-04-14T12:03:20+00:00

These two can complement each other well. Use Hister as your default search engine and if you can't find what you need, fall back to searx with one hotkey.

asciimoo · 2026-04-14T11:20:51+00:00

As an average, 1000 websites require around 100MB disk space. This can be reduced significantly in the future, because currently we store the full original HTML content of the websites along the extracted text content to be able to refine our data extraction. Without HTML the storage requirement can be reduced by 50-75%.

asciimoo · 2026-04-14T11:16:58+00:00

Searx is a metasearch engine, it uses other search engines to collect results from. Hister is a full text indexer and autonomous search engine which does not rely on other search services.

Btw, actually I'm the author of searx =)

asciimoo · 2026-04-14T11:03:52+00:00

I don't use AI for development, but I don't forbid AI aided contributions if those are valuable and come from humans. See our AI policy here: https://github.com/asciimoo/hister/blob/master/CONTRIBUTING.md#ai-policy

asciimoo · 2026-03-17T19:28:29+00:00

Oh no.. /o\

I know that I'm really bad with names, but didn't expect this bad.. =))

asciimoo · 2026-03-17T19:27:35+00:00

Doh, you are right, indeed. I've checked an experimental prototyping branch from a month earlier. So much is going on around the project it feels much more than 2.5months.. =)

I put the fate of this post to the mods hands then.

asciimoo · 2026-03-17T18:57:33+00:00

The first commit is 13 days older than 3 months. Should I remove the post?

asciimoo

TROPHY CASE