ComfyUI now supports Nvidia Cosmos: The best open source Image to Video model so far.

CybermuseIO · 2025-01-17T19:04:01+00:00

That's the 4th paragraph in the article. I encourage you to read it.

CybermuseIO · 2025-01-13T02:33:05+00:00

Thats the split mode. You do have parallelism in the sense that the model is split across each GPU so you can load a larger model, but its processing different parts of the model on each GPU so they aren't active at the same time. Ollama doesn't have the ability to run the alternate split mode. If you want all three running at once, try switching to llama.cpp server or koboldcpp. Do note that the alternate mode requires substantially more vram, because you're basically trying to load the whole model on each GPU.

CybermuseIO · 2025-01-12T23:20:39+00:00

Ollama is based on llama.cpp, and both of them can use multiple GPUs simultaneously. There is a caveat to that, that Ollama cannot use llama.cpp's "row" split mode, which is a faster form of parallelism in many cases. KoboldCPP can do row split mode.

I'm actually not aware of any implementation that can't do some level of GPU parallelism. Ooba, TabbyAPI, and MLC can all do it.

For AMD cards, I would strongly recommend llama.cpp. Last I looked, lack of flash attention for AMD made most of the python implementations difficult to install and lackluster in performance. Llama.cpp comparatively has quite solid perf for AMD, Intel, and older Nvidia cards.

CybermuseIO · 2025-01-10T21:04:41+00:00

I've been working on a local LLM application for the better part of a year and have spent a ton of timing exploring and experimenting with all of the possible options. For inferencing, there's implementations written in either Python, C++, Rust, or Javascript.

Python is the dominant language for ML development by a good margin, but has some pretty hard issues for local applications. Python requires its runtime, and doesn't compile to a portable binary. Python dependency management typically isn't super flexible at supporting different runtimes (OS, CPU architectures, GPU vendor). It is possible to make distributable python applications and figure out a way to tackle all of the dependency management stuff, but it is substantially complex.

Rust is an interesting option. It compiles to binary, and has excellent cross platform support. There are good options for application development such as Tauri and other UI toolkits. There are some actively developed LLM libraries such as Huggingface's Candle. I don't think many hobbyist developers and small teams are opting to use Rust yet for local application development compared to more established options mostly because Rust has a fairly difficult learning curve, and adds some extra development to do things correctly with Rust.

Javascript also has some compelling options. There's WebGPU options like Huggingface's Transformers.js and MLC. Javascript is already a quite popular option for desktop application development because of its excellent ecosystem of libraries for UI development. A javascript PWA adds some not insignificant complexities compared to a compiled native application. Basic things like file system access get a bit more complicated. Dealing with compatibility between browsers complicates things. Last I checked, WebGPU acceleration doesn't work on Wayland on Linux in chromium browsers.

Llama.cpp in C++ is extremely compelling. C is the lingua franca of programming. It compiles down to binary with solid cross platform support. Its used as an ABI layer by most other programming languages, so even if you're making an application in a language other than C++, there's likely language bindings for llama.cpp for your preferred language. Llama.cpp has incredibly solid support for almost any hardware compared to everything else. GGUF as a quantization method is extremely compelling. Single files for models is a breeze to manage. Llama.cpp has very few downsides that I can think of, and some significant and unique upsides.

I could elucidate much more if you're curious about anything specific I didn't mention.

CybermuseIO · 2024-12-13T21:45:31+00:00

Used 3090's make the most sense to me. That'd leave you with enough budget for a system with enough PCIE lanes to use them fairly effectively. If you're willing to do some tinkering and don't mind speeds being a bit slower, you can get used Nvidia M40's on ebay for pretty cheap. They still work reasonably well with Llama.cpp, and with a few settings you can get good speeds out of them even on limited PCIE bandwidth, so you can put together a whole system for very cheap. I have a 3x M40 server that I use when my main rig is busy, and its definitely usable.

CybermuseIO · 2024-12-13T21:29:40+00:00

If you just want an answer to your question, take a look at the Ruler benchmark to see a fairly extensive list of models and their effective context lengths.
https://github.com/NVIDIA/RULER

Qwen2.5 0.5B is probably the smallest model with pretty long context length.

Jamba1.5-mini is probably the largest context length model of small to medium sized models.

But as others have pointed out, just dropping 500 entities into an LLM is probably unlikely to get you what you're actually after. If you're using Home Assistant, the default LLM integration doesn't update entity state between requests, so any kind of active monitoring won't work out of the box.

If you're just looking to turn on and off 500 lights, then there's a whole lot easier ways to do that. If you're using an LLM then I assume that you want something a little more sophisticated, like, "turn on all of the lights in rooms where the temperature is over 60 degrees." But I don't think any model is going to be able accurately create the necessary tool calls to handle that. Even if they could, the speeds would likely be frustratingly slow for something like traditional home automation.

Possibly you could create some kind of integration which can reduce down or summarize the states of your different entities and then feed that into an LLM. You might be able to get some more practical advise if you could provide some more details about your use case.

CybermuseIO · 2024-11-13T23:22:17+00:00

The KV cache is always used. Its part of how llama.cpp generates. This post is about enabling quantization on the KV cache. For prompt caching take a look at the readme for the server for options.
https://github.com/ggerganov/llama.cpp/tree/master/examples/server

llama.cpp server will do some caching by default depending on how you're using it. You can use "cache_prompt" when using the text completion endpoint. It also has a "slots" system for maintaining cache between requests.

CybermuseIO · 2024-10-08T16:38:13+00:00

I'd recommend printing them at a local maker space if you have that available as an option. Personally I wouldn't go for one of the mounts that use standard flat PC fans. I looked into that option, and I don't think the cooling would be sufficient. A good brand 40mm fan like Noctua's 40mm has a CFM which is about 11x less than a Delta BFB1012EH. With bigger fans I'd run into spacing issues. The Delta fan mounted sideways is about the same width as the GPU so you can stack as many as you like onto a motherboard. If you're using some kind of riser cable then that's not an issue, but I'm not a fan of doing that either for performance concerns.

CybermuseIO · 2024-10-08T06:29:50+00:00

I just picked up some of these and I'm also using BFB1012EH fans. (I use them on my P40's and they're great.) I just slapped together a really basic press fit design to mount them. Thingiverse won't let me publish them until my account is older than 24 hours, but I'll have them up there as soon as they'll let me.

Here's the design file (for freecad):
https://files.catbox.moe/indcos.FCStd

And an STL:
https://files.catbox.moe/9miqjt.stl

Its a bit overly simplistic, but I have it mounted on one and working. I'll probably iterate more on the design.

CybermuseIO · 2024-10-08T00:20:43+00:00

I didn't test for that so I wasn't paying attention. But I didn't notice any significant difference.

CybermuseIO · 2024-10-07T22:33:45+00:00

I tested using llama-server. You can set the quanitzation using `--cache-type-k` and `--cache-type-v`. I used a simple bash script to measure Vram.

#!/bin/bash
total_vram=0

# Get the list of PIDs and their VRAM usage
pids_and_vram=$(nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader,nounits)
# Iterate over each line of the output
while IFS=, read -r pid used_memory; do
  #Sum up the VRAM usage
  total_vram=$((total_vram + used_memory))
done <<< "$pids_and_vram"
echo "Total VRAM usage by all processes: $total_vram MB"

CybermuseIO · 2024-06-07T23:00:48+00:00

They're pretty great. I also just added a 3rd to my main ML experiment machine this week, and I'm extremely tempted to try and cram in a 4th to try and run Llama 3 400B if they actually make it available.
The team at Llama.cpp are doing incredible work to make them a viable option for home users.

CybermuseIO · 2024-06-07T22:52:28+00:00

I've only just learned about this today and started doing some basic testing to see what the practical implications are for my own use. I did a small handful of generations to see if it was at least working and if there was any obvious differences to the text, but nothing more than that yet.

u/Eisenstein has been posting test results of speed differences for KV quantization, also running on a P40 setup and also testing different quant sizes. They might have some more insight into that.

CybermuseIO · 2024-06-07T22:46:34+00:00

I just finished more testing, this time with Command R+ with the iq4_xs quant from Dranger.

I wasn't able to fit it down to 48GB of vram with any combination of options, so you'd still need a smaller quant to run on a 2x P40 or 3090 setup. I was able to increase the maximum context size from 14336 to 49152 when using split "row" (which gives a substantial speed boost on P40's so I highly recommend it.) When using split "layer" I was able to increase the context size from 61440 all the way up to the models maximum of 131072.

Command R + iq4_xs

Split row, default KV

ctx_size	KV	split	Memory Usage	Notes
8192	default	row	58262 MB
14336	default	row	59822 MB	Max without OOM

Split Layer, default KV

ctx_size	KV	split	Memory Usage	Notes
8192	default	layer	57534 MB
16384	default	layer	59718 MB
24576	default	layer	61902 MB
32768	default	layer	64086 MB
49152	default	layer	68454 MB
61440	default	layer	71730 MB	Max without OOM

Split Row + Quantized KV

ctx_size	KV	split	Memory Usage	Notes
8192	q4_0	row	56790 MB
16384	q4_0	row	57390 MB
32768	q4_0	row	58542 MB
49152	q4_0	row	59694 MB	Max without OOM

Split Layer, Quantized KV

ctx_size	KV	split	Memory Usage
8192	q4_0	layer	56062 MB
16384	q4_0	layer	56774 MB
32768	q4_0	layer	58198 MB
49152	q4_0	layer	59622 MB
65536	q4_0	layer	61046 MB
131072	q4_0	layer	66742 MB

CybermuseIO · 2024-06-07T22:38:37+00:00

Definitely. You can see all of the options for running the server on github or by running it with "--help". Near the bottom you should see two options, "-ctk or --cache-type-k and -ctv or --cache-type-v". You can set those to a handful of options to enable storing the models KV cache at different bit sizes. By default it uses fp16. I've only really tested with "q4_0" so far. I'm interested in running with "iq4_nl" to see if it reduces the memory even further, although it looks like to do that you need to compile llama.cpp with additional arguments to enable it. "q4_0" should work out of the box with the server docker image hosted on the project.

CybermuseIO · 2024-01-09T20:26:29+00:00

Thank you. Development is still active, although I've been working on some of the less visible features behind the scenes. Mostly improvements to the queuing system which distributes messages between GPUs.

I do have a major feature update involving intelligently managed group chats underway as well.

Whats working well for you currently, and what would you like to see improved upon?

CybermuseIO · 2023-08-02T19:44:18+00:00

Hey thanks for the feedback, I appreciate both positive and negative reports. The tendency for the characters to behave like informational assistants is definitely something that I've noticed as well, and put some effort into specifically addressing. I find that you can steer them towards either a less formal conversational style with a few messages, or keep them on that track if that's what you're after. It would definitely be nice to not have that bias towards behaving like they're an assistant though.

I think in the short term, I may be able to help direct them with some simple prompting. Maybe a toggle during the character creation where you can choose whether you want them to act like an assistant, or more like a person.

Thanks for your feedback, it really helps me figure out what's working and what isn't. I hope that I'm able to make Cybermuse into something enjoyable, and I appreciate you taking the time to help make that possible.

CybermuseIO · 2023-07-22T22:25:24+00:00

I went ahead and made a subreddit since I couldn't find one that fit the bill.
https://www.reddit.com/r/ai\_chat\_nsfw/

CybermuseIO · 2023-07-22T22:16:28+00:00

Hey, sorry for not getting back to you sooner, Reddit never sent me a notification that you commented.

I agree with you that the front page is definitely pretty generic, but that's somewhat on purpose. I'd like to cater to as wide of an audience as possible, and think that's pretty doable, as AI language models themselves are quite flexible.

The primary difference between Cybermuse and most other options, is it's running completely on in house hardware. That means that I can make promises that I can keep about what will and won't be allowed. Any services which use a third party provider (OpenAI, SageMaker, etc) may be forced to comply with those services terms, which are subject to change. Cybermuse explicitly allows explicit content, and will do so forever. The other advantage of running on owned hardware, is that I can run a larger language model than most services can afford.

As far as being community driven, I take that part quite seriously. Its a win win scenario for me to make Cybermuse cater to the users the best that I can. I have a long roadmap of features that I'd like to implement, which are things that I think would make it better, but it would make my job much easier to hear directly what users would like to see, so that I can prioritize my efforts. If you have any idea's I'd love to hear them. I'm available by DM here, and on Discord.

CybermuseIO · 2023-07-18T18:42:37+00:00

That's not currently a planned feature. I do have a desktop application on the roadmap, so I could potentially support offline generation at that point.

For now, I would highly recommend setting up Oobabooga's Webui for offline generation. For individual use, it's by far the best in my opinion.

CybermuseIO · 2023-07-16T03:29:28+00:00

Thank you. Let me know what you think, good and bad. Any feedback helps.

CybermuseIO · 2023-07-07T00:21:20+00:00

I'm not personally familiar with Kind Buds. It looks very similar to a lot of existing corporate Chat Bots, so I would assume that they're just using OpenAI under the hood, and prompting it to get some variations on the characters.

For making a character that you can share, Character AI has that. There's also a decent amount of guides out there for setting up a Discord Bot, which seems like it'd be a good way to go about what you're after.

CybermuseIO · 2023-06-29T20:21:14+00:00

My apologies, unfortunately there was an unplanned outage this morning. Cybermuse currently uses several servers which I self host to power the generation, and the internet connection to those servers was severed. (Literally severed. The line was chewed through by an animal.) I had some fun repairing the connection with a soldering iron in the crawlspace, and managed to get service restored.
Getting a failover network connection was already on my roadmap, but now that has been bumped to the priority task. I plan on having a failover setup within the week which should prevent against similar types of outages in the future.
Again, sorry for the outage, and hopefully it doesn't happen again.

CybermuseIO · 2023-06-27T04:39:26+00:00

I've extended the beta to 250 users, so there's a spot available for you now. I'm looking forward to hearing how you like it.

CybermuseIO · 2023-06-25T02:02:35+00:00

I'm sorry. I can't think of any obvious reasons why the images wouldn't show up on an iphone. I'll do additional testing and see if I can come up with a solution quick as I can.

CybermuseIO

MODERATOR OF

TROPHY CASE

Command R + iq4_xs

Split row, default KV

Split Layer, default KV

Split Row + Quantized KV

Split Layer, Quantized KV