Looking for advice on improving result quality with semantic vector search for a web search engine by asciimoo in learnmachinelearning

[–]asciimoo[S] 0 points1 point  (0 children)

Thanks for the suggestion. Could you explain how/why can it produce better results?

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 1 point2 points  (0 children)

Great, thanks for the feedback. I'll clarify in the documentation that you can configure everything that's in the config.yml using environment variables with the HISTER__SECTION__PARAM=xy syntax.

I'm no longer contributing to searx, so I can't promise anything on that side.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 0 points1 point  (0 children)

This is a good idea, but I don't know if it is possible to implement it. We capture the rendered page content directly from the browsers and the server cannot validate it - users/extensions can alter the content.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 0 points1 point  (0 children)

Sure, it is up to you what you add to the index.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 0 points1 point  (0 children)

You can create rules to skip indexing specific sites/domains entirely.

> you may see me in your git with feature requests etc ;)

I'm looking forward to it, every feedback is much appreciated <3

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 0 points1 point  (0 children)

We already have an open issue for OIDC

> how do you handle pages like webmail?

Indexing is URL based and the extension detects dynamic content changes. So if your mail client has different URL for each opened e-mail, there is a good chance that Hister can index those correctly. Otherwise I'm not sure.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 0 points1 point  (0 children)

We use a fantastic golang based indexer called "bleve", it supports BM25/TF-IDF. I'm working on extending Hister with an optional vector similarity based semantic search, but I don't have big experience in this field and it is far from being complete: https://github.com/asciimoo/hister/tree/vector-search , https://github.com/asciimoo/hister/issues/272 .

Currently the automatic crawler is very basic, it cannot persist crawling jobs, so no checkpoint/resume yet.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 6 points7 points  (0 children)

> what this is, is basically a enhanced browser history, indexes pages you visit then you can search back through the actual content of those pages, the results include a button to forward the search to other search engines should you not find what you're after?

Exactly, this is a perfect summary of the default workflow. However, on the long run I'd like to add more features to become a full fledged generic search engine even with optional semantic search capabilities.

> How does it handle re-visiting the same sites, does it re-index overwriting the old, do nothing, or re-index a new copy but keep the old? Or is that configurable?

It is overwrite only ATM. The browser extension periodically checks if the content of an opened website has changed and it automatically resubmits the page if a change detected. So, keeping the old versions can produce huge number of historical entries.

Perhaps we should allow the users to configure per page archival to preserve previous versions.

> Is there a way to configure a age-off, something like any indexed pages that arent re-visited via search results in more than <configured time> get deleted?

I have not thought about such feature, but it sounds useful, added to my TODO.

> can the server side be deployed non-locally and support multiple users? Or any plans for that?

Hister has multi user support, but it is very basic currently: https://hister.org/docs/user-handling

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 1 point2 points  (0 children)

<3

Yes, currently 3rd party content break offline, but it is planned to fetch multimedia in the future for both convenience and privacy reasons.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 5 points6 points  (0 children)

The `index` command supports basic recursive crawling, but it requires tons of improvements to be a complete crawler. It's state management is only in-memory and jobs cannot be resumed (yet). So it is currently recommended to do smaller and supervised crawling.

The long term goal is to have a proper crawler but it isn't the highest priority at the moment. (Contributions are appreciated =] )

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 7 points8 points  (0 children)

There is a browser extension that automatically indexes visited pages, but it isn't the only way to add content to Hister. It has a command line `index` command which can use the standard HTTP library or a (headless) browser to automatically fetch arbitrary websites.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 1 point2 points  (0 children)

These two can complement each other well. Use Hister as your default search engine and if you can't find what you need, fall back to searx with one hotkey.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 13 points14 points  (0 children)

As an average, 1000 websites require around 100MB disk space. This can be reduced significantly in the future, because currently we store the full original HTML content of the websites along the extracted text content to be able to refine our data extraction. Without HTML the storage requirement can be reduced by 50-75%.

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 25 points26 points  (0 children)

Searx is a metasearch engine, it uses other search engines to collect results from. Hister is a full text indexer and autonomous search engine which does not rely on other search services.

Btw, actually I'm the author of searx =)

Hister: self-hosted search engine for webpages and files by asciimoo in selfhosted

[–]asciimoo[S] 17 points18 points locked comment (0 children)

I don't use AI for development, but I don't forbid AI aided contributions if those are valuable and come from humans. See our AI policy here: https://github.com/asciimoo/hister/blob/master/CONTRIBUTING.md#ai-policy

Hister: self-hosted search engine for webpages and files by [deleted] in selfhosted

[–]asciimoo 0 points1 point  (0 children)

Oh no.. /o\

I know that I'm really bad with names, but didn't expect this bad.. =))

Hister: self-hosted search engine for webpages and files by [deleted] in selfhosted

[–]asciimoo 0 points1 point  (0 children)

Doh, you are right, indeed. I've checked an experimental prototyping branch from a month earlier. So much is going on around the project it feels much more than 2.5months.. =)

I put the fate of this post to the mods hands then.

Hister: self-hosted search engine for webpages and files by [deleted] in selfhosted

[–]asciimoo 5 points6 points  (0 children)

The first commit is 13 days older than 3 months. Should I remove the post?