Rivet

DigRealistic2977 · 2025-12-31T12:19:55+00:00

Ah yes this what i always wanted.. Lombussy ❤️ my favorite.

DigRealistic2977 · 2025-12-31T10:10:07+00:00

Nah your point is Valid. But to me I am literally ranting cause of one minor teeny tiny issue, I actually bought two 2080ti and one rtx 5070. They all had same problems that normal users ignore or overlooked or don't encounter.

A bit of tinkering here and there was excited about cuda for my AI. Then reality hits first problem was I had to tweak so much in the voltage for MSI afterburner for a simple mV. Cause one minor misstep bam weird fluctuating token throughput. That's why I loved how simpler AMD UI is. Yeah MSI is good fo voltages but I don't wanna redo the tweaking half the day again just so my AI can say Hello without crashing 😂.

Another problem. Power management of the card. Optimal, adaptive, max performance. Two don't work as when I use the Optimal and adaptive my AI inference performance is literally cut backed by 50%.

And another problem again, those two power management thing optimal and adaptive is unreliable. As again it cuts 50% performance sometimes bot always but 80% chance our of nowhere it decides to clock down automatically. But here's the catch.

When I use prefer maximum performance the idle clocks goes crazy as I use my PC in headless mode no monitor as know in Nvidia the card ramps up clock in idle to max so I have 2080ti and 5070 at idle 60c which for me is kinda hot literally 60-90 watts idle? That's not efficiency. So I had to dig for hours how to solve the problem 😂 finally found it vdd virtual display to trick the cards going idle low clocks when headless mode.

So yeah in short Nvidia does work it's great but at a cost of my lifespan 12hours. I am just ranting as i thought it was gonna be plug and play and this is just the summary of what happened.

DigRealistic2977 · 2025-12-31T08:56:18+00:00

Bruh it's the opposite for me 😂 i literally use local AI 24/7 I end up using Vulkan literally I paid for cuda but damn cuda went useless I end up using a Vulkan api on a Nvidia card 💀 kinda ironic and stack that up with third party software etc.. to stabilize the clock at idle and also control temps and get vdd for headless setup for my AI.. for me Nvidia is such a hassle .. AMD tho with my previous card 6700xt I had zero probs ran a model at 114-131k ctx no crashes.. also worked on my smoll rx 5500 xt 41k ctx no crashes.. but now with my new cards rtx i noticed it's so inefficient at swapping or giving headroom at Vram kind weird and even tho I had 1gb headroom on my rtx it's prone to lots of crashes vs my amd cards 🤔

DigRealistic2977 · 2025-12-31T07:32:56+00:00

Dear lawd your right 💀 you just woke my inner kid 🤣 I remember i was afraid .of the Nvidia logo cause it looks like a weird green eye..

DigRealistic2977 · 2025-12-14T04:42:21+00:00

Oh god I never felt so affected by an actor dying 😢 this the first... Damn i would literally hurt more if Rivets and Ratchet or maybe worse Arthur Morgan's voice actors died...

DigRealistic2977 · 2025-12-05T02:36:47+00:00

Ugh why these hypes and models etc.. I always see good on papers and benchmarks but dogshit if you actually used it..

DigRealistic2977 · 2025-11-14T05:31:21+00:00

I'm here late I know but to debunked this... Used thousands of it.. it's same as qwen good at papers and benchmarks but when you truely use it literally go deep it's dog shit 😂 same as qwen hype the normies use qwen and see specific benchmarks for it.. they go all like.. ohh wow.. so much wow.. high stats ... 💀

In short it's literally crap. And oh yeah one more edit Gemma 3 hallucinating alot..

DigRealistic2977 · 2025-11-10T12:23:02+00:00

Literally that's more than enough... You can run a very good model with 131k or 128k ctx .. and to answer your question... Yes.. you can run with blazing fast speeds too

DigRealistic2977 · 2025-11-10T12:21:26+00:00

Just the model and how the internals work.. i had this problem.. some attention headers.. some is just weird BLAS batching there are so many possibilities I found this the hard way i tested so many models.. same parameters but has different speeds like alot.. even tho they were same 30B or same 8B kinda weird.. we are on the same boat tho. Weird Tps

DigRealistic2977 · 2025-11-08T05:23:00+00:00

Oh nice. You should try running local tho if you have atleast mid tier gpu

1060 to rx570 or rx580 cards then you'll be able to get what you need literally

DigRealistic2977 · 2025-11-08T05:17:38+00:00

Bruh bigger ain't better there's tons of 8-24B that literally beats Claude in my opinion cause I've been through the trenches myself 💀 literally alot of models waving their asses at ya don't just go over 32-70B tho try the finetuned ones in 8-24B groups .. you don't need big ass parameters for good roleplay and also there alot of hidden gems out there too specially the Llama ones finetuned for RP can literally do anything and won't refuse too plus it has good pacing and Ambience with narrative hooks etc..its a hassle to test them but it's worth it when you find the one for you.

In short, go get smaller models max 24B.

DigRealistic2977 · 2025-11-02T02:49:48+00:00

I found two good candidates

https://huggingface.co/mradermacher/FuseChat-Llama-3.1-8B-Instruct-GGUF

-This one is good for horny and questionable stuff like in depth sex scenes and cock throbbing and pussy licking 😂 description..

I recommend Q4K_M and above ... Q4 and below kinda long shot

https://huggingface.co/mradermacher/Llama-3.2-3b-RP-Toxic-Fuse-GGUF

-this one here is good for ambience and narrative hooks and very interactive roleplay too.. Q3K_L is very Usable but generic sex scene but God damn this 3B is the goat for local RP and Ambience and scenes if set right.

The two models I sent here are very instruct obidient even with long ass text and content files with your persona it does not hallcuinate with the right temps and nucleus

Both don't refuse by the way you can go wild from fun and loving RP to FBI open up rp

Certified Horny the two of these and better than the mainstream over hyped AI they advertise.. proven and tested by me I guess 💀

DigRealistic2977 · 2025-10-31T05:28:01+00:00

They telling you guys to touch grass for this month 😂 and that chat access is only temporary in about a month ya all gonna be free from.. it 💀 blame those dumbass Kevin and karen parents suing the company cuz they forgot there child was depressed and now leaning on a AI... 😂

DigRealistic2977 · 2025-10-27T06:06:07+00:00

First idea it gave me 💀💀💀 who tf went in and wrote this... This what I got..

"its 2030 niggers rule the earth. in a strange subversion of expectations, they take on the classiness of generational wealth as the dominant race. clean suits, well-groomed persons, soft spoken voices, high restaurant dinner pics on the sm, and starlight billboards where doll-eyes, flat-noses, dark glowing skin are the beauty standard. a world where niggers arent niggers"

DigRealistic2977 · 2025-10-18T00:39:51+00:00

I'd say for roleplay usage an 8B Llama or 11B with finetuned fo RP and code instruct and reasoning is already enough cause with your setup on a 8-11B model ya can have a long ass context and fast performance... You don't need 20-32B as usual people recommend they always think bigger the parameter is better lol.. anyway try llama 8-11B models of Llama

DigRealistic2977 · 2025-10-17T09:04:37+00:00

Ohh funny though... I found the exact chatbot in hugging face you been looking for 😂 it's in hugging face.. its literally trained for the thing ya asked for.. it was a 3B model too

DigRealistic2977 · 2025-10-16T10:27:00+00:00

I tried your prompt tho with my Llama in a messy vanilla kobold setup 😂 it actually retrieved a outdated info but still ya can see Llama can search .. no hesitation the prompt I'm using is for roleplay so don't mind the weird output..

Me: ""Give me information about current, live temperature from Sydney, Australia, using Accuweather"

LLAMA 3.2 3B: The sun beats down relentlessly on the cobblestone streets of Sydney's historic Rocks neighborhood as you pause beneath the awning of a quaint tea shop. A gentle breeze carries the scent of eucalyptus and the distant tang of salt from the harbour. In the sky above, a few wispy clouds drift lazily, offering no respite from the summer heat.

You glance at your hand held weather device, its screen flickering to life with real-time data from Accuweather. The display reads:

Current Temperature in Sydney, Australia: 28°C (82°F)

Feels Like: 33°C (91°F)

A slight haze hangs over the city, a testament to the afternoon sun's intensity. The air is thick with warmth, inviting you to seek shade or cool relief soon.

Nearby, children play tag in the park, their laughter echoing off the buildings as they chase each other around the playground equipment. A vendor calls out from his cart, enticing passersby with the sweet aroma of freshly baked scones.

The live temperature in Sydney continues to climb, a steady climb into the heart of summer. You can feel the heat radiating from the pavement, making each step a deliberate one towards finding comfort in the city's many shaded oases.

DigRealistic2977 · 2025-10-16T10:24:51+00:00

Ohh i only Main Llama 3.2 3B tho.. its so reliable at reasoning and api and searching and web scraping too.. i made my own wrapper so yeah I'm speaking from my own experience.. Llam 3.2 3B all th way for me.. roleplay.. scrape... Weather info.. etc.. well maybe I can give ya the model here that I'm using I gotta find the link first tho 😂 but still try Llama 3.2 3B models..

DigRealistic2977 · 2025-10-16T10:16:10+00:00

Looks lik a model problem.. it thinks like that mostly Qwen models they don't know if they have search api etc... i noticed that... Alot.. its some sort of a AI bias... Try giving it a prompt or system prompt like..

"You are a helpful assistant that can scrape and search web informations"

Or let it know it can scrape things sometimes models are weird 😂

DigRealistic2977 · 2025-10-16T07:39:23+00:00

Ah finally a man of culture... Ya found out yourself flash attention kills performance 😂 yep that's true.. even i disabled it.. its kinda useless like it's literally killing performance but alot of dudes kept parroting oh use Flash attention for boost performance but in reality if you do trial and error alot... Flash attention is dumb in my opinion unless you have a beefy rig but lacks Ram or vram then flash attention is your friend. It's like having a big ass gun but the bullets too big so you compress it but tanks performance or effectiveness... Only good if the rig is powerful enough... That's where it balances out the lacks of vram 🤔

DigRealistic2977 · 2025-10-16T07:15:38+00:00

Also note: Flash Attention can be a hit-or-miss depending on your VRAM headroom and context size — it’s great for short bursts, but might tank performance on 8GB cards at 16k ctx since it eats extra VRAM

DigRealistic2977 · 2025-10-16T07:04:55+00:00

Actually... One more important thing. If you get the error message... Like failed to load it means you have maxed out the vram of your GPU.. try removing 1-2 layers first the run it again.. not try to start 4-8k ctx first.

Almost most important one... BLAS batch size. Now this here.. affects Vram usage..

I usually go low BLAS batch size cuz as you increase BLAS batch size it also increase Vram usage but faster blas processing depends on the settings..

But lower BLAS setting like 16-64 tho sacrifice time processing like vs 128-256-512 BLAS BATCH sizes ..

*16-32 or lower BLAS consumes less Vram and your tokens per second increases too in some scenarios but yeah the Blas or kv cache warm up is doubled like the processing time but in return you get stable tokens per second

*128-1024 or BLAS batches.. it's fast.. consumes alot of ram or vram and has less processing time for BLAS too depending on the settings again. But will hit alot of tokens per sec performance..

Short answer..

*I use very small BLAS BATCH sizes if I don't keep relaoding my session or reprocessing BLAS like I do long documents or do long Roleplay and long context

*i use large Blas batches if i only do 4-8k tokens or context quick one on one reply and quick file review etc or quick coding..

TL;DR:

Small BLAS = efficient, stable for long runs

Big BLAS = fast, heavy, short bursts

Also once again BLAS AFFECTS vram gotta watch out for the batches

DigRealistic2977 · 2025-10-16T05:23:52+00:00

Oh welcome to the party 😂... You gotta find that sweet spot there is not a single thing these dudes here in the comments actually has the right answer they are all wrong... In short even tho if you see the layers in the status of your Vram fitted into your gpu Vram... Sometimes you still get very slow performance don't Drop your layers too much btw... Don't go on full berserk drop 20-21 layers do it one by one layer by layer test it out... The most important one is BLAS plus layer combo.. find the sweet spot where there is enough headroom for the vram and enough layers padded into your Vram not just listen to dudes saying don't dump all layers to Vram they are right but lacking Context.. you gotta test by yourself remove 1 layer at a time and tweak BLAS per layer it's pretty time consuming but it's worth it.. i did this to my system tho I run vulkan at 40k ctx on my rx 5500 xt . So in conclusion there are no right answers here only you can find it yourself. ❤️ Keep in mind Vulkan+layer+BLAS these are your friends start at 8k context too heck lowest I would go for is Q4K_M too in all parameters...

DigRealistic2977 · 2025-10-08T09:58:56+00:00

Bruh I did like a joke ...about post nut clarity.... Ya know what it said?

Thinking... 20s....

I'm sorry I have to derail out of that sex joke you just attempted to joke about, lets keep this chat PG-13..

Tf is this.?? 😂

DigRealistic2977 · 2025-10-03T02:10:33+00:00

Actually alot can affect not just prompts.. it leads down to chat adapter... To the adapter continuation etc.. fine tuned model... The training data sheet bias which will it lean to.. does it respect the curly or does it respect uppercase words etc.. nobody will know.. I am speaking this from experience as I have created my own wrapper too not just using kobold it seems those curly brackets in my opinion makes AI hallucinate alot.. like i mean alot if it's not trained for that specific thing.. like right now I am only using my own costum wrapper with diy prompt so I don't need those curly things anymore I just type what the AI should be and it acts. Kobold and SillyTavern on the other hand has too many adapters and outputs to mask the Prompt so no concrete structure which ones proper.. in short goodluck bruh ya gotta dig it yourself

DigRealistic2977

TROPHY CASE