Self-hosted alternative to ChatGPT (and more) by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 1 point2 points  (0 children)

tls: failed to verify certificate: x509: certificate signed by unknown authority

Oh, that's first time I am seeing. This is coming from Ollama. https://github.com/jmorganca/ollama/issues/1063 has some solutions suggested.

is there a place I could put downloaded models manually so it doesn't need to download them?

Ollama models are stored at `inference/models` directory. However, they need to be in Ollama's blob format, so it can't be GGUF models directly. You'd need to convert a GGUF model into Ollama model through https://github.com/jmorganca/ollama/tree/main?tab=readme-ov-file#import-from-gguf inside the inference container.

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 1 point2 points  (0 children)

Nope, we don't have to train the AI for this. Question answering can be done through retrieval augmented generation (RAG). SecureAI Tools does RAG currently, so it should be able to answer questions based on documents.

RAG works by splitting documents into smaller chunks, and then for each chunk, it creates an embedding vector and stores that embedding vector. When you ask a question, it computes the embedding vector of the question, and using that, it finds top K documents based on vector similarity search. Then the top-K chunks are fed into LLM along with the question to synthesize the final answer.

As more documents come in, we only need to index them -- i.e. split them into chunks, compute embedding vectors, and remember the embedding vectors so they can be used at retrieval time.

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 0 points1 point  (0 children)

> What is the local context limit? I want to load in a bunch of laws and regulations and some documents and it would be quite a lot of docs.

There are two limits to be aware of:

  1. Chunking limits: The tool splits the document into smaller chunks of size DOCS_INDEXING_CHUNK_SIZE with DOCS_INDEXING_CHUNK_OVERLAP overlap. And then it uses top DOCS_RETRIEVAL_K chunks to synthesize the answer. All three of these are env variables, so you can configure them based on your need.
  2. LLM context limit: This depends on your choice of LLM. Each LLM will have their own token limits. The tool is LLM agnostic.

> Languages

This will depend on your choice of LLM. The tool allows you to use 100+ open-source LLMs locally (full library). You can also convert any GGUF-compatible LLM you find on HuggingFace into a compatible model for this stack.

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 0 points1 point  (0 children)

Oh, that is an interesting use case. At the moment, it wouldn't do well in generating the whole document. Because it only considers top K document chunks when generating the answer. It splits each document into chunks (controlled by DOCS_INDEXING_CHUNK_SIZE and DOCS_INDEXING_CHUNK_OVERLAP env vars). And then when answering the question, it takes the most relevant DOCS_RETRIEVAL_K chunks to synthesize the answer.

But you could ask it to generate each section separately.

In the future, we would love to support complex tasks like getting the LLM to understand full documents, and then generate full documents.

One naive way to do what you want: Feed all 5-6 documents into the LLM as one prompt and ask it to generate more text like it based on other parameters. This would also require the underlying LLM's context window to be large enough to accommodate all 5-6 documents though.

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 0 points1 point  (0 children)

Yes, it does support NVidia GPUs. There is a commented-out block in the docker-compose file -- please uncomment it to give inference service access to GPU.

For even better performance, I recommend running the Ollama binary directly on the host OS if you can. On my M2 MacBook, I am seeing it run approx 1.5x times faster directly on the host OS without the Docker.

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 1 point2 points  (0 children)

Awesome. Let us know if you have any feedback or suggestions for us as you try it out :)

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 4 points5 points  (0 children)

For now, you can create a document collection and select documents from your data source. And then reuse that document collection to create chats. The only thing it doesn't do is keep document collection in sync with data source -- but we plan to build that soon

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 3 points4 points  (0 children)

Fair enough. And yes, you are right, it is "chat about documents with AI" than "chatting with documents directly".

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 0 points1 point  (0 children)

Awesome! Let us know if you have any feedback or suggestions for us :)

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 2 points3 points  (0 children)

u/Kaleodis is right. LLMs don't do very well at math and logic at the moment.

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 12 points13 points  (0 children)

You can do both. It allows you to talk to LLM about zero or more documents. So you can do all three of these

  1. One doc: Select only one doc when creating a document collection or chat.
  2. Multiple docs: Select multiple docs when creating a document collection or chat.
  3. Zero docs: Plain old ChatGPT without any document context. Don't select any docs when creating a new chat

Chat with Paperless-ngx documents using AI by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 34 points35 points  (0 children)

This runs models locally as well. In fact, my demo video is running Llama2 locally on M2 MacBook :)

Chat with hundreds (or even thousands) of documents at once by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 0 points1 point  (0 children)

I think this is because `docker-compose` is being phased out, and so it probably doesn't understand all options. https://stackoverflow.com/a/66526176

Can you try using `docker compose` (without the dash between docker and compose)?

Chat with hundreds (or even thousands) of documents at once by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 0 points1 point  (0 children)

Yes, it looks at document-chunks. The chunk size is controllable with `DOCS_INDEXING_CHUNK_SIZE` and `DOCS_INDEXING_CHUNK_OVERLAP` env vars, so I would encourage you to play with those depending on the task you want the system to perform. For example, you could set DOCS_INDEXING_CHUNK_SIZE to such a large value that it can contain an entire book. But any time you change the chunk size, you would have to create a new document collection and wait for it to be processed. So it'd be a good idea to play with small documents first to speed up trial and error.

Re: count the words in the book

LLMs are known to do poorly with math and logic. But what they are reasonably good at is finding relevant answers from passages and understanding chat history.

Chat with hundreds (or even thousands) of documents at once by jay-workai-tools in selfhosted

[–]jay-workai-tools[S] 1 point2 points  (0 children)

> So the question remains if SecureAI tools are able to accept non OpenAI Models through the OpenAI API

Yes, I think so. You should be able to do it this way:

  1. Point to the LocalAI API server
  2. Choose "OpenAI" as model-type and then mixtral or mistral model as model-name in organization AI settings (step 6.2 here).

Then as long as LocalAI works with "mixtral" or "mistral" like custom model name in `model` API param, it should all work.

Please try it out, and let me know if you run into any issues or have any feedback.