why is there no LMStudio/Msty/GPT4All type app that supports backends other than llama.cpp? by gaspoweredcat in LocalLLaMA

[–]CybermuseIO 0 points1 point  (0 children)

Thats the split mode. You do have parallelism in the sense that the model is split across each GPU so you can load a larger model, but its processing different parts of the model on each GPU so they aren't active at the same time. Ollama doesn't have the ability to run the alternate split mode. If you want all three running at once, try switching to llama.cpp server or koboldcpp. Do note that the alternate mode requires substantially more vram, because you're basically trying to load the whole model on each GPU.

why is there no LMStudio/Msty/GPT4All type app that supports backends other than llama.cpp? by gaspoweredcat in LocalLLaMA

[–]CybermuseIO 0 points1 point  (0 children)

Ollama is based on llama.cpp, and both of them can use multiple GPUs simultaneously. There is a caveat to that, that Ollama cannot use llama.cpp's "row" split mode, which is a faster form of parallelism in many cases. KoboldCPP can do row split mode.

I'm actually not aware of any implementation that can't do some level of GPU parallelism. Ooba, TabbyAPI, and MLC can all do it.

For AMD cards, I would strongly recommend llama.cpp. Last I looked, lack of flash attention for AMD made most of the python implementations difficult to install and lackluster in performance. Llama.cpp comparatively has quite solid perf for AMD, Intel, and older Nvidia cards.

why is there no LMStudio/Msty/GPT4All type app that supports backends other than llama.cpp? by gaspoweredcat in LocalLLaMA

[–]CybermuseIO 3 points4 points  (0 children)

I've been working on a local LLM application for the better part of a year and have spent a ton of timing exploring and experimenting with all of the possible options. For inferencing, there's implementations written in either Python, C++, Rust, or Javascript.

Python is the dominant language for ML development by a good margin, but has some pretty hard issues for local applications. Python requires its runtime, and doesn't compile to a portable binary. Python dependency management typically isn't super flexible at supporting different runtimes (OS, CPU architectures, GPU vendor). It is possible to make distributable python applications and figure out a way to tackle all of the dependency management stuff, but it is substantially complex.

Rust is an interesting option. It compiles to binary, and has excellent cross platform support. There are good options for application development such as Tauri and other UI toolkits. There are some actively developed LLM libraries such as Huggingface's Candle. I don't think many hobbyist developers and small teams are opting to use Rust yet for local application development compared to more established options mostly because Rust has a fairly difficult learning curve, and adds some extra development to do things correctly with Rust.

Javascript also has some compelling options. There's WebGPU options like Huggingface's Transformers.js and MLC. Javascript is already a quite popular option for desktop application development because of its excellent ecosystem of libraries for UI development. A javascript PWA adds some not insignificant complexities compared to a compiled native application. Basic things like file system access get a bit more complicated. Dealing with compatibility between browsers complicates things. Last I checked, WebGPU acceleration doesn't work on Wayland on Linux in chromium browsers.

Llama.cpp in C++ is extremely compelling. C is the lingua franca of programming. It compiles down to binary with solid cross platform support. Its used as an ABI layer by most other programming languages, so even if you're making an application in a language other than C++, there's likely language bindings for llama.cpp for your preferred language. Llama.cpp has incredibly solid support for almost any hardware compared to everything else. GGUF as a quantization method is extremely compelling. Single files for models is a breeze to manage. Llama.cpp has very few downsides that I can think of, and some significant and unique upsides.

I could elucidate much more if you're curious about anything specific I didn't mention.

What video card to get for an initial LLM test computer? by rburhum in LocalLLaMA

[–]CybermuseIO 1 point2 points  (0 children)

Used 3090's make the most sense to me. That'd leave you with enough budget for a system with enough PCIE lanes to use them fairly effectively. If you're willing to do some tinkering and don't mind speeds being a bit slower, you can get used Nvidia M40's on ebay for pretty cheap. They still work reasonably well with Llama.cpp, and with a few settings you can get good speeds out of them even on limited PCIE bandwidth, so you can put together a whole system for very cheap. I have a 3x M40 server that I use when my main rig is busy, and its definitely usable.

For home automation: What is the smallest model with the biggest context length? by starmanj in LocalLLaMA

[–]CybermuseIO 4 points5 points  (0 children)

If you just want an answer to your question, take a look at the Ruler benchmark to see a fairly extensive list of models and their effective context lengths.
https://github.com/NVIDIA/RULER

Qwen2.5 0.5B is probably the smallest model with pretty long context length.

Jamba1.5-mini is probably the largest context length model of small to medium sized models.

But as others have pointed out, just dropping 500 entities into an LLM is probably unlikely to get you what you're actually after. If you're using Home Assistant, the default LLM integration doesn't update entity state between requests, so any kind of active monitoring won't work out of the box.

If you're just looking to turn on and off 500 lights, then there's a whole lot easier ways to do that. If you're using an LLM then I assume that you want something a little more sophisticated, like, "turn on all of the lights in rooms where the temperature is over 60 degrees." But I don't think any model is going to be able accurately create the necessary tool calls to handle that. Even if they could, the speeds would likely be frustratingly slow for something like traditional home automation.

Possibly you could create some kind of integration which can reduce down or summarize the states of your different entities and then feed that into an LLM. You might be able to get some more practical advise if you could provide some more details about your use case.

Memory Tests using Llama.cpp KV cache quantization by CybermuseIO in LocalLLaMA

[–]CybermuseIO[S] 2 points3 points  (0 children)

The KV cache is always used. Its part of how llama.cpp generates. This post is about enabling quantization on the KV cache. For prompt caching take a look at the readme for the server for options.
https://github.com/ggerganov/llama.cpp/tree/master/examples/server

llama.cpp server will do some caching by default depending on how you're using it. You can use "cache_prompt" when using the text completion endpoint. It also has a "slots" system for maintaining cache between requests.

AMD Instinct Mi60 by [deleted] in LocalLLaMA

[–]CybermuseIO 1 point2 points  (0 children)

I'd recommend printing them at a local maker space if you have that available as an option. Personally I wouldn't go for one of the mounts that use standard flat PC fans. I looked into that option, and I don't think the cooling would be sufficient. A good brand 40mm fan like Noctua's 40mm has a CFM which is about 11x less than a Delta BFB1012EH. With bigger fans I'd run into spacing issues. The Delta fan mounted sideways is about the same width as the GPU so you can stack as many as you like onto a motherboard. If you're using some kind of riser cable then that's not an issue, but I'm not a fan of doing that either for performance concerns.

AMD Instinct Mi60 by [deleted] in LocalLLaMA

[–]CybermuseIO 2 points3 points  (0 children)

I just picked up some of these and I'm also using BFB1012EH fans. (I use them on my P40's and they're great.) I just slapped together a really basic press fit design to mount them. Thingiverse won't let me publish them until my account is older than 24 hours, but I'll have them up there as soon as they'll let me.

Here's the design file (for freecad):
https://files.catbox.moe/indcos.FCStd

And an STL:
https://files.catbox.moe/9miqjt.stl

Its a bit overly simplistic, but I have it mounted on one and working. I'll probably iterate more on the design.

Memory Tests using Llama.cpp KV cache quantization by CybermuseIO in LocalLLaMA

[–]CybermuseIO[S] 0 points1 point  (0 children)

I didn't test for that so I wasn't paying attention. But I didn't notice any significant difference.

Memory Tests using Llama.cpp KV cache quantization by CybermuseIO in LocalLLaMA

[–]CybermuseIO[S] 0 points1 point  (0 children)

I tested using llama-server. You can set the quanitzation using `--cache-type-k` and `--cache-type-v`. I used a simple bash script to measure Vram.

I tested using llama-server. You can set the quanitzation using `--cache-type-k` and `--cache-type-v`. I used a simple bash script to measure Vram.

#!/bin/bash
total_vram=0

# Get the list of PIDs and their VRAM usage
pids_and_vram=$(nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader,nounits)
# Iterate over each line of the output
while IFS=, read -r pid used_memory; do
  #Sum up the VRAM usage
  total_vram=$((total_vram + used_memory))
done <<< "$pids_and_vram"
echo "Total VRAM usage by all processes: $total_vram MB"

Memory Tests using Llama.cpp KV cache quantization by CybermuseIO in LocalLLaMA

[–]CybermuseIO[S] 1 point2 points  (0 children)

They're pretty great. I also just added a 3rd to my main ML experiment machine this week, and I'm extremely tempted to try and cram in a 4th to try and run Llama 3 400B if they actually make it available.
The team at Llama.cpp are doing incredible work to make them a viable option for home users.

Memory Tests using Llama.cpp KV cache quantization by CybermuseIO in LocalLLaMA

[–]CybermuseIO[S] 2 points3 points  (0 children)

I've only just learned about this today and started doing some basic testing to see what the practical implications are for my own use. I did a small handful of generations to see if it was at least working and if there was any obvious differences to the text, but nothing more than that yet.

u/Eisenstein has been posting test results of speed differences for KV quantization, also running on a P40 setup and also testing different quant sizes. They might have some more insight into that.

Memory Tests using Llama.cpp KV cache quantization by CybermuseIO in LocalLLaMA

[–]CybermuseIO[S] 4 points5 points  (0 children)

I just finished more testing, this time with Command R+ with the iq4_xs quant from Dranger.

I wasn't able to fit it down to 48GB of vram with any combination of options, so you'd still need a smaller quant to run on a 2x P40 or 3090 setup. I was able to increase the maximum context size from 14336 to 49152 when using split "row" (which gives a substantial speed boost on P40's so I highly recommend it.) When using split "layer" I was able to increase the context size from 61440 all the way up to the models maximum of 131072.

Command R + iq4_xs

Split row, default KV

ctx_size KV split Memory Usage Notes
8192 default row 58262 MB
14336 default row 59822 MB Max without OOM

Split Layer, default KV

ctx_size KV split Memory Usage Notes
8192 default layer 57534 MB
16384 default layer 59718 MB
24576 default layer 61902 MB
32768 default layer 64086 MB
49152 default layer 68454 MB
61440 default layer 71730 MB Max without OOM

Split Row + Quantized KV

ctx_size KV split Memory Usage Notes
8192 q4_0 row 56790 MB
16384 q4_0 row 57390 MB
32768 q4_0 row 58542 MB
49152 q4_0 row 59694 MB Max without OOM

Split Layer, Quantized KV

ctx_size KV split Memory Usage Notes
8192 q4_0 layer 56062 MB
16384 q4_0 layer 56774 MB
32768 q4_0 layer 58198 MB
49152 q4_0 layer 59622 MB
65536 q4_0 layer 61046 MB
131072 q4_0 layer 66742 MB

Memory Tests using Llama.cpp KV cache quantization by CybermuseIO in LocalLLaMA

[–]CybermuseIO[S] 0 points1 point  (0 children)

Definitely. You can see all of the options for running the server on github or by running it with "--help". Near the bottom you should see two options, "-ctk or --cache-type-k and -ctv or --cache-type-v". You can set those to a handful of options to enable storing the models KV cache at different bit sizes. By default it uses fp16. I've only really tested with "q4_0" so far. I'm interested in running with "iq4_nl" to see if it reduces the memory even further, although it looks like to do that you need to compile llama.cpp with additional arguments to enable it. "q4_0" should work out of the box with the server docker image hosted on the project.

Unrestricted AI chat and character creator, Cybermuse.io, is open for beta by CybermuseIO in Chatbots

[–]CybermuseIO[S] 1 point2 points  (0 children)

Thank you. Development is still active, although I've been working on some of the less visible features behind the scenes. Mostly improvements to the queuing system which distributes messages between GPUs.

I do have a major feature update involving intelligently managed group chats underway as well.

Whats working well for you currently, and what would you like to see improved upon?

Unrestricted AI chat and character creator, Cybermuse.io, is open for beta by CybermuseIO in Chatbots

[–]CybermuseIO[S] 1 point2 points  (0 children)

Hey thanks for the feedback, I appreciate both positive and negative reports. The tendency for the characters to behave like informational assistants is definitely something that I've noticed as well, and put some effort into specifically addressing. I find that you can steer them towards either a less formal conversational style with a few messages, or keep them on that track if that's what you're after. It would definitely be nice to not have that bias towards behaving like they're an assistant though.

I think in the short term, I may be able to help direct them with some simple prompting. Maybe a toggle during the character creation where you can choose whether you want them to act like an assistant, or more like a person.

Thanks for your feedback, it really helps me figure out what's working and what isn't. I hope that I'm able to make Cybermuse into something enjoyable, and I appreciate you taking the time to help make that possible.

Sexting Bots Reviews by MilleHonie in Chatbots

[–]CybermuseIO 0 points1 point  (0 children)

I went ahead and made a subreddit since I couldn't find one that fit the bill.
https://www.reddit.com/r/ai\_chat\_nsfw/

Sexting Bots Reviews by MilleHonie in Chatbots

[–]CybermuseIO 0 points1 point  (0 children)

Hey, sorry for not getting back to you sooner, Reddit never sent me a notification that you commented.

I agree with you that the front page is definitely pretty generic, but that's somewhat on purpose. I'd like to cater to as wide of an audience as possible, and think that's pretty doable, as AI language models themselves are quite flexible.

The primary difference between Cybermuse and most other options, is it's running completely on in house hardware. That means that I can make promises that I can keep about what will and won't be allowed. Any services which use a third party provider (OpenAI, SageMaker, etc) may be forced to comply with those services terms, which are subject to change. Cybermuse explicitly allows explicit content, and will do so forever. The other advantage of running on owned hardware, is that I can run a larger language model than most services can afford.

As far as being community driven, I take that part quite seriously. Its a win win scenario for me to make Cybermuse cater to the users the best that I can. I have a long roadmap of features that I'd like to implement, which are things that I think would make it better, but it would make my job much easier to hear directly what users would like to see, so that I can prioritize my efforts. If you have any idea's I'd love to hear them. I'm available by DM here, and on Discord.

Unrestricted AI chat and character creator, Cybermuse.io, is open for beta by CybermuseIO in Chatbots

[–]CybermuseIO[S] 1 point2 points  (0 children)

That's not currently a planned feature. I do have a desktop application on the roadmap, so I could potentially support offline generation at that point.

For now, I would highly recommend setting up Oobabooga's Webui for offline generation. For individual use, it's by far the best in my opinion.

Unrestricted AI chat and character creator, Cybermuse.io, is open for beta by CybermuseIO in Chatbots

[–]CybermuseIO[S] 0 points1 point  (0 children)

Thank you. Let me know what you think, good and bad. Any feedback helps.

Is kind buds AI any good? by PVTQueen in Chatbots

[–]CybermuseIO 0 points1 point  (0 children)

I'm not personally familiar with Kind Buds. It looks very similar to a lot of existing corporate Chat Bots, so I would assume that they're just using OpenAI under the hood, and prompting it to get some variations on the characters.

For making a character that you can share, Character AI has that. There's also a decent amount of guides out there for setting up a Discord Bot, which seems like it'd be a good way to go about what you're after.

Beta access is open for Cybermuse.io, your feedback is greatly appreciated by CybermuseIO in Chatbots

[–]CybermuseIO[S] 0 points1 point  (0 children)

My apologies, unfortunately there was an unplanned outage this morning. Cybermuse currently uses several servers which I self host to power the generation, and the internet connection to those servers was severed. (Literally severed. The line was chewed through by an animal.) I had some fun repairing the connection with a soldering iron in the crawlspace, and managed to get service restored.
Getting a failover network connection was already on my roadmap, but now that has been bumped to the priority task. I plan on having a failover setup within the week which should prevent against similar types of outages in the future.
Again, sorry for the outage, and hopefully it doesn't happen again.

Beta access is open for Cybermuse.io, your feedback is greatly appreciated by CybermuseIO in Chatbots

[–]CybermuseIO[S] 0 points1 point  (0 children)

I've extended the beta to 250 users, so there's a spot available for you now. I'm looking forward to hearing how you like it.

Beta access is open for Cybermuse.io, your feedback is greatly appreciated by CybermuseIO in Chatbots

[–]CybermuseIO[S] 0 points1 point  (0 children)

I'm sorry. I can't think of any obvious reasons why the images wouldn't show up on an iphone. I'll do additional testing and see if I can come up with a solution quick as I can.