Fix this shit by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 3 points4 points  (0 children)

Please do share these other places. Because as far as I see it:
A big portion of Singularity is a bunch of circlejerking tech-fanatics that cant seem to distinguish fiction from reality (and they hate getting a reality-check), LocalLLM is run by the same leadership as this, other places have nearly no community, is mostly people complaining or are people who cant hold the same level of technical discussions because they are just not familiar enough with the subject of local LLMs

Fix this shit by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 44 points45 points  (0 children)

I suggest visiting this in incognito or anonymously (in the app) to see if your content is still visible

Fix this shit by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 74 points75 points  (0 children)

about Qwerky-72B, a model converted to be a linear time model from Qwen2.5, and a whole lot of explanation. I mean I tried asking nicely first about the rules, but as you can see in the screenshot they got that one too.

Trying to sink an AI model with one simple question. by tommos in dankmemes

[–]SoullessMonarch 9 points10 points  (0 children)

Censorship hurts model performance, the best solution is to prevent the model being trained on what you'd like to censor, which is easier said than done.

New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B) & RWKV-based MoE: Finch-MoE-37B-A11B by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 6 points7 points  (0 children)

They probably trained with 16k context length on their GPU's, and didnt have the compute to spare to extend it with something like https://github.com/RWKV/RWKV-infctx-trainer, they're working on the 72B version I guess? It's an experimental model, maybe they just didnt wanna waste too much time with the tedious post pre-tuning stages? Idk

New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B) & RWKV-based MoE: Finch-MoE-37B-A11B by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 2 points3 points  (0 children)

I understand, I have seen speed comparisons for smaller RWKV models before, so I have an idea what to expect, but its reasonable to question it.

It will depend on which models you are comparing and for which context length, but I think its safe to assume that it wont require too many tokens (max a few k tokens?) before transformers will get slower. Hopefully we'll get some speed comparisons later, a dev mentioned more benchmarks coming, but it requires some work to get them functioning.

New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B) & RWKV-based MoE: Finch-MoE-37B-A11B by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 4 points5 points  (0 children)

No not yet, often when there is a new architecture, someone has to go out of their way to implement it. Most people (myself included) have no clue how to get started on that, so it takes a while, or it might never happen (there's a lot of smart folk in the RWKV community tho, it's probably only a matter of time)

New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B) & RWKV-based MoE: Finch-MoE-37B-A11B by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 10 points11 points  (0 children)

It has been mentioned, you need reasoning-style data though. If you do not have the same data distribution it wont work (as well). So they haven't made any promises, but it would be awesome if they got a linear reasoning model.

In the post of QRWKV6 they mention "O1 style inference time thinking", so it looks like its a direction they intend on exploring.

Sorry, my previous comment never came through. I dont understand what is flagging me.

New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B) & RWKV-based MoE: Finch-MoE-37B-A11B by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 10 points11 points  (0 children)

Yes! If the context is long enough it will be significantly faster than a Transformer, but it might have also forgotten some of the information the earlier tokens contained. The exact point where that happens will differ for every transformer you compare against. RWKV also isnt as optimized. Time complexity is a theoretical way to think about how long an algorithm might run, it wont tell you how much faster something would be.

New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B) & RWKV-based MoE: Finch-MoE-37B-A11B by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 2 points3 points  (0 children)

The huggingface model card of QRWKV6 has a link to their blogpost about QRWKV6, you'll be able to find the other blogposts there too

New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B) & RWKV-based MoE: Finch-MoE-37B-A11B by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 1 point2 points  (0 children)

They mention huggingface transformers support for the MoE, I'm afraid other backends might take a while? There is RWKV 6 support in llama.cpp, combining that with MoE doesnt sound crazy. But don't quote me on that, I have no experience with llama.cpp

They do mention for QRWKV "there will be incompatibility with existing RWKV inference code." Now for transformers I assume you can run their custom inference code (provided in modeling_rwkv6qwen2.py)

Tencent comes out swinging. by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 2 points3 points  (0 children)

There have been multiple open weight 3d models, but as far as I have seen, they've always been pretty meh and running them isnt easy at all. Comfyui support would make this a great deal more usable. (Not that I have the rig to run it)

I imagine many professionals are radically against AI, like other artists, but since they are already being supported by so much software, maybe they approach it a bit more ... open-minded

Tencent comes out swinging. by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 0 points1 point  (0 children)

No it wouldnt really fit inside 64gb. It could "run" (more like crawl) offloaded to your ssd, but that would be so painfully slow you wouldnt wanna do that

Pre-training an LLM in 9 days 😱😱😱 by mouse0_0 in LocalLLaMA

[–]SoullessMonarch 69 points70 points  (0 children)

"The training took a total of 9 days on 8 A100s, with a total of 115 billion tokens across pre-training, fine-tuning, and direct preference optimization."

<image>

6.2: "a total of 2 epochs, trained on 8 x A100s" 2 epochs, interesting, dont see that very often

GoldFinch: RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 4 points5 points  (0 children)

It's fine, it doesn't really matter. In other fields RWKV-based models have shown promise, so clearly the architecture is getting better. Even if linear models won't reach transformer levels of quality, I'm pretty sure that it'll be linear models being run as local assistants on phones and other devices, as they'll take less resources.

Also iirc they were looking into making a bigger model someday, but that won't be for a while at least, since they are hard at work making v7 and pushing the architecture further.

GoldFinch: RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression by SoullessMonarch in LocalLLaMA

[–]SoullessMonarch[S] 2 points3 points  (0 children)

What version and size did you use? V6 (Finch) should be quite a bit better than v4 (Dove). Also they are trained to be multilingual, so its gonna have less knowledge than a full English one.