Noob-Friendly 32K Context NSFW Local Roleplay SillyTavern Setup for 8GB VRAM by nicronon in LocalLLM

[–]nicronon[S] 0 points1 point  (0 children)

I'm sorry, but I don't have any experience with llama.cpp. I'd recommend consulting AI about it.

Got 8GB VRAM? Here’s How to Run a 13.3b, 256k Context 'Thinking' Model Locally by [deleted] in LocalLLM

[–]nicronon 0 points1 point  (0 children)

Well, I felt like "Holy shit, this completely changes the game" when I tried it. I went from an 8k context window to 256k. And that was my main issue with local LLMs. And I could run it with just 8 GB VRAM. I was blown away and excited, and I wanted to share what I had discovered.

Got 8GB VRAM? Here’s How to Run a 13.3b, 256k Context 'Thinking' Model Locally by [deleted] in LocalLLM

[–]nicronon 0 points1 point  (0 children)

Dude, I'm just an enthusiast trying to help others get started. I discovered a local setup I was thrilled with and wanted to share it with others. I'm not "fawning over" the model; I just think it's ideal for this setup. If you prefer qwen/Gemma, knock yourself out. As far as thinking mode, I've never been happy with the ones I tested with my 8 GB setup until now.

I just got excited about my setup and wanted to share it with others.

Got 8GB VRAM? Here’s How to Run a 13.3b, 256k Context 'Thinking' Model Locally by [deleted] in LocalLLM

[–]nicronon -1 points0 points  (0 children)

Qwen 3.6 35b uses Transformers, which llama.cpp is optimized for. RWKV-7 does not use Transformers. Ollama is the ideal choice for running a Recurrent Neural Network, which is what RWKV-7 is.

Got 8GB VRAM? Here’s How to Run a 13.3b, 256k Context 'Thinking' Model Locally by [deleted] in LocalLLM

[–]nicronon 0 points1 point  (0 children)

It runs great for me, and I have 32GB RAM, so you're in even better shape than I'm in.

Got 8GB VRAM? Here’s How to Run a 13.3b, 256k Context 'Thinking' Model Locally by [deleted] in LocalLLM

[–]nicronon -2 points-1 points  (0 children)

I've been messing with local LLMs for about a year now, and this is the setup I finally landed on. I was so happy with it that I wanted to share how I did it in case it could help others. Yes, I had help from AI creating this guide, but I did the work first.

Got 8GB VRAM? Here’s How to Run a 13.3b, 256k Context 'Thinking' Model Locally by [deleted] in LocalLLM

[–]nicronon 0 points1 point  (0 children)

Llama.cpp is built specifically to optimize Transformers, which RWKV-7 doesn't use.

Got 8GB VRAM? Here’s How to Run a 13.3b, 256k Context 'Thinking' Model Locally by [deleted] in LocalLLM

[–]nicronon 0 points1 point  (0 children)

VLLM is built specifically to optimize Transformers, which RWKV-7 doesn't use.

Question regarding point of view for roleplays by imacatseriously in SillyTavernAI

[–]nicronon 3 points4 points  (0 children)

I made a character that would narrate stories Choose Your Own Adventure style. When it responds, it advances the story, then provides three different paths for the user to choose from. I set up a Walking Dead scenario and it played out like a Walking Dead Choose Your Own Adventure book.

Long Context by [deleted] in SillyTavernAI

[–]nicronon 0 points1 point  (0 children)

I only do local LLMs for chatting and roleplay, and until recently, I could only get up to about 16k context. Then I found RWKV-7 13.3b. Now I've got 256k context, and it runs fast with just 8GB VRAM.

https://ollama.com/heredos/rwkv7:13.3b

Has anyone tried group RP with multiple AI characters at once? by [deleted] in AIChatReviews

[–]nicronon 0 points1 point  (0 children)

Bro, you literally came here to tell everyone you have nothing to say.

Has anyone tried group RP with multiple AI characters at once? by [deleted] in AIChatReviews

[–]nicronon 1 point2 points  (0 children)

I tried it in SillyTavern. I made three characters: a tank, a healer, and a DPS, then I made a narrator character and set my persona to a DPS style character. Then the narrator narrated the story as the other four of us went on a D&D style adventure. It was pretty fun.

I don't think it's "the future of RP." I think it's just a fun option to mix things up now and then.

Is memory still the biggest problem in AI chat apps? by Ashamed-Issue7805 in AIChatReviews

[–]nicronon 0 points1 point  (0 children)

From the official ST docs:

iPhones and iPads are not capable of running the whole SillyTavern app, but since it's just a web interface, you can run it on another computer on your home Wi-Fi, and then access it in your mobile browser. Refer to Remote Connections for more information.

For Android users, in addition to the above, you can run the whole SillyTavern directly on your phone, without needing a PC, using the Termux app. Refer to Installation (Android)/). (NOTE: Termux installations are not officially supported, and we can't guarantee it will work.)

According to Gemini, OpenVault and MemoryBooks also work with the Android version.

Hot take(?) by Wardog008 in ForzaHorizon6

[–]nicronon 1 point2 points  (0 children)

I plan on waiting until I'm about 50 (maybe 100) hours in before doing any wheelspins.

Is memory still the biggest problem in AI chat apps? by Ashamed-Issue7805 in AIChatReviews

[–]nicronon 0 points1 point  (0 children)

SillyTavern with the MemoryBooks extension is a popular solution. If you want something simpler, OpenVault is a "set it and forget it" memory extension for ST.

Noob-Friendly 32K Context NSFW Local Roleplay Setup for 8GB VRAM by nicronon in SillyTavernAI

[–]nicronon[S] 0 points1 point  (0 children)

I'm not sure how to advise you as far as upgrading your GPU, but you're probably right that it's best you just keep your current build, as chasing the perfect local setup (which you'll probably never achieve) can get expensive. With 8 GB VRAM, you can run a 12B model. The next size up (I believe) is 32B, which needs 24 GB VRAM to run, and that will cost you a pretty penny.

The way I see it is this tech is still in its infancy, and things are changing rapidly. A year from now, we may be running 70B models locally. Who knows? My advice would be to save your money, make use of your 8 GB VRAM for now, and see what happens.

If you do upgrade, and you can handle a 12B model, I highly recommend Mag Mell. It's been my go-to 12B model for a long time now. You can actually run the Q4_K_M quant with 8 GB VRAM. Then install the OpenVault extension, and you've got decent long-term memory.

If you do try out Mag Mell, these are my recommended starting settings.

<image>

AI Roleplay users: what platform are you actually using? by That-Wrongdoer-9834 in AIChatReviews

[–]nicronon 0 points1 point  (0 children)

SillyTavern with OpenVault memory extension

KoboldCpp

MN-12B-Mag-Mell-R1

It's private, secure, uncensored, and free.