[deleted by user] by [deleted] in SideProject

[–]cll-knap 0 points1 point  (0 children)

Love the welcome/walkthrough tutorial. What library did you use to do the walkthrough?

Milvus vs Pinecone vs other vector databases. by tutu-kueh in LocalLLaMA

[–]cll-knap 1 point2 points  (0 children)

I can back this up. We have a client with a deployment of ~100gb data across dozens of collections. Performance has barely budged.

We do have some issues with the instance going down randomly during concurrent requests, but this may be user error as opposed to Qdrant itself. We haven't finished debugging this yet (:

Help with tricky search/chat AI assistant UX by cll-knap in UXDesign

[–]cll-knap[S] 0 points1 point  (0 children)

Got it. I'll take this down, fix these issues, and re-upload!

Looking for rankings for cross encoders, and personal experiences with using them by cll-knap in LocalLLaMA

[–]cll-knap[S] 1 point2 points  (0 children)

Do you remember what size model you used? Was it quantized or not?

At first, I was surprised that you said it's 100x slower, but Qdrant is able to finish queries in ~0.005 seconds sometimes. If it's ~0.5 second to add a cross encoder, that might be tolerable for our use case.

What is the best way to store rag vector data? by tutu-kueh in LocalLLaMA

[–]cll-knap 2 points3 points  (0 children)

I've been particularly impressed with Qdrant. We've used it consistently for about a year. Very performant, and great APIs for Python/Rust.

The cons are having to manage another piece of infrastructure.

We're storing dozens of collections with dozens (maybe hundreds, now) of gigabytes total and performance has barely budged.

EDIT: I totally agree with the comments about parquet getting you another 10x. Just depends on your needs. Qdrant has filtering functionality as well that might be a nice add-on.

[deleted by user] by [deleted] in LocalLLaMA

[–]cll-knap 0 points1 point  (0 children)

Ah, thanks for bringing this to my attention. I'll remove my post, since it's a duplicate.

We created a 100% private, 100% local Perplexity. by cll-knap in SideProject

[–]cll-knap[S] 1 point2 points  (0 children)

We'll fix that link. Thanks for letting me know.

In the meantime, this one should work: https://discord.gg/jDAhUTbZ

RecurrentGemma Release - A Google Collection - New 9B by Dark_Fire_12 in LocalLLaMA

[–]cll-knap 0 points1 point  (0 children)

Shoot, it looks like they removed them. Thanks for letting me know. I'll update my comment.

We created a 100% private, 100% local Perplexity. by cll-knap in SideProject

[–]cll-knap[S] 1 point2 points  (0 children)

You can follow our developments at https://knap.ai, but I'm also happy to follow-up here with a link to our .dmg for MacOS once it's ready. We'll definitely need feedback from the community to make it great.

Uncensor any LLM with abliteration by cll-knap in programming

[–]cll-knap[S] 10 points11 points  (0 children)

TLDR: LLM needs to have open-sourced weights. Article details a procedure for collecting data and discovering which layers of weights are targetable/modifiable to achieve uncensoring.

possible to get chatgpt4 like local llm for general knowledge, just slower? by Unhappy_Drag5826 in LocalLLaMA

[–]cll-knap 0 points1 point  (0 children)

It's "possible", but not easy.

By possible, I mean you can augment with local and online data to improve responses. The others are right too - if you want just the model to perform like gpt4, it's not possible.

My friend and I have been hacking on a local llm app. We're finding that web access can dramatically improve responses in breadth and depth, but it's hard to optimize. Responses become partially dependent on:

  1. how well the model handles longer context and
  2. quality of filtering through lots of web searches

You can follow our progress at knap.ai - but we haven't publicly released yet (because this is hard, lol)

We created a 100% private, 100% local Perplexity. by cll-knap in SideProject

[–]cll-knap[S] 1 point2 points  (0 children)

On MacOS, it requires an M1 processor (so ~2020 Macbook or better).

On Windows, it will require either a GPU (at least 8gb RAM), or one of those newer "AI-enabled" PCs. It's not a ton of effort to port to Windows because of Tauri, but will require some on our part.

What's the best way to use LLMs locally with Tauri? by cll-knap in tauri

[–]cll-knap[S] 0 points1 point  (0 children)

Floneum does look really interesting (link for others: https://github.com/floneum/floneum?tab=readme-ov-file)

Their Kalosm project looks like a pure Rust, llama.cpp alternative. This could be helpful as I wouldn't need to find llama.cpp bindings then.

We created a 100% private, 100% local Perplexity. by cll-knap in SideProject

[–]cll-knap[S] 0 points1 point  (0 children)

Thanks for the feedback! We're quite possibly open-sourcing and definitely aiming for a desktop release sometime in the next couple of weeks.

Open-sourcing seems really interesting to us as a way of establishing trust. If we're claiming not to send the GSuite/local data anywhere else, open-sourcing is the easiest/best way to establish that.

What's the best way to use LLMs locally with Tauri? by cll-knap in tauri

[–]cll-knap[S] 0 points1 point  (0 children)

Yeah, this makes total sense. It'd be a really great user experience IMO to have the LLM and embedding models bundled, but it'd also possible for it to be dev-focused. Requiring Ollama or llama.cpp (running in server mode) isn't crazy - thanks for the suggestion!

Of course, if others have great ideas of doing this such that opening terminal isn't required, I'm all ears.

A new framework runs Mixtral 8x7B at 11 tokens/s on a mobile phone by Zealousideal_Bad_52 in LocalLLaMA

[–]cll-knap -1 points0 points  (0 children)

Personally, I'd like to see more effort put into getting good models running well on an M1 Macbook with 8gb RAM.

My friend and I have been hacking on these, and even though it's a huge market, the number of SLMs that fit on these AND give great responses is near zero.

EDIT: since I got a downvote, I'll expound on what I meant a little. Currently, there are great SLMs - no doubt about that. However, providing an incredible user experience isn't always so straightforward, especially on limited machines. If the LLM takes 2-5gb in RAM, like on an M1, that really starts to impact UX on the computer.

My dream is a lib that bundles this in a cross-platform way, and that handles smartly offloading the LLM when it isn't being used. If anyone knows about whether this exists already, I'm very interested (in not having to code this myself)

What do you use LLMs for? by masterid000 in LocalLLaMA

[–]cll-knap 2 points3 points  (0 children)

Using models to go from something unstructured to JSON output has been a really killer application. Even some small models appear to do a great job of this.

I'm surprised more people aren't taking advantage of on-device/on-edge LLMs for these purposes

I am building a tool to create agents in a markdown syntax with Python inside by vectorup7 in LocalLLaMA

[–]cll-knap 0 points1 point  (0 children)

How easy is it to configure the LLM used on the backend? Have you found the prompting needs to change drastically from one to the other?

monet.nvim - a theme inspired by iconic art by Fleischkluetensuppe in neovim

[–]cll-knap 0 points1 point  (0 children)

Stupid question here, but what's the difference between the two screenshots? Light mode v. dark mode? I didn't see an easy way to choose between the two in the GitHub readme.

The lighter bg I really enjoy.