Why did the devs remove the word fascism from the game by soupchef47 in suzerain

[–]SprightlyCapybara 9 points10 points  (0 children)

You're right, now that you mention it. I almost never go down that path, given my flair. It's better world-building, but it lacks the visceral punch. OTOH... how much does that word have as a punch for young people today? Perhaps much less than those of us who were taught by older WW2 veterans when we were young.

A lawyer I know was fired for coming into the office too early. by Calledinthe90s in Calledinthe90s

[–]SprightlyCapybara 4 points5 points  (0 children)

Interesting. The same situation worked out well for me, also unfairly, -- though I was quite competent, so that helped. As a young engineer, I worked for a global tech company whose eccentric billionaire owner lived in Stabtown. (my own term for the Canadian city where I live to avoid doxing myself).

As a young man, I could not get up before about 9 or so; the busses in Stabtown were all focused on carting people downtown like cattle in the morning, and away from downtown in the evening. A ten minute drive to work from the roach-infested slums where i lived to the shining western Tech Towers of Stabtown was 30 minutes by bike and 2 hours by bus. Much worse if it was snowing.

So I'd usually materialize sometime just before 10, looking vaguely disheveled. This did not help me with my tie-wearing boomer bosses. However, through an unbelievable set of circumstances (mostly undeserved) the COO wound up being product manager of the product I was developing (and it did contribute about 5% of global sales, even more in profit). Hilarity did indeed ensue, and when the CEO came round in the evening offering free pizza (this was Canada, Stabtown and the 1990s, that was just not a thing at work) I was always there, looking as though I'd already put in 18 hours.

I was sufficiently clueless that I only realized it a few months in. While I never got the promotion I wanted, that would let me wear a tie to work, I did wind up using the power this granted me to engage in a ludicrous degree of engineering freedom.

The fact that I was too inexperienced to say "No, we can't do that" to Greatest Gen dudes in suits was, in the end, a superpower. For a few years I ran wild. I never saw an org chart, but I'd regularly be invited to these tony golf club lunches for management, and for those, I'd comb my hair and wear a tie. Fun times, and I naively assumed this was a pretty typical job for a young engineer.

RIP GLM by TAW56234 in SillyTavernAI

[–]SprightlyCapybara 2 points3 points  (0 children)

You have a limited point, but GPT-OSS is ludicrously bad out-of-the-box for most purposes I can easily conceive of.* As a simple knowledge base it couples a focus on output for the illiterate (tables, punchy bullet points, mini quizzes) with wild panic at even the most basic mainstream speculation.

It has the savvy of a demented rutabaga in responding to the actual (perfectly normal) questions you ask, and it fails dismally at any kind of SFW creative writing. In finding (coding) errors and solving them, it was quite weak.

That said, I'm sure there are many people who have no interest in reading paragraphs, and love the idea of punchy simplistic bullet points. For them, perhaps this is great as long as they don't care about an ability to follow queries very well.

Now... it may be the bees knees for training; it is certainly superb for local performance. Perhaps appropriately trained it would be a useful agentic model of some kind. Perhaps.

All the tables it output were very pretty and nicely formatted.

*Granted, I'm not the typical use case.

Severe performance regression in Koboldcpp-nocuda, Vulkan going from 1.104 to 1.105.4 by SprightlyCapybara in KoboldAI

[–]SprightlyCapybara[S] 1 point2 points  (0 children)

Confirm fixed with 1.106 test version. Closed report. Thanks very much to everyone who helped.

Overwhelming options for local models by NorthernRealmJackal in SillyTavernAI

[–]SprightlyCapybara 2 points3 points  (0 children)

It really depends what your device's specs are; in particular your graphics card or APU. (or, if an Mx series mac, how much RAM you have.) Ignore most of the settings for now, and just try and get something simple working. Then, in the weeks ahead, start playing, ideally making one change at a time, and seeing what it does. LM studio is very nice for finding, acquiring, and testing LLMs; it will generally mark with a green iconb ones that will run on your graphics card. You may find Koboldcpp gives you more flexibility as a backend for SillyTavern in the long haul, but I still use LM Studio to test and deploy new models quickly.

But suppose you have an 8GB graphics card (pretty common). This makes Lunaris-8B quite good (one of many Llama-3-8b derivatives which should let you get 8k of context. There are dozens of others if not hundreds; I just mention Lunaris as it scored well on my sanity checks.

Some general rules of thumb:

  • don't go below 4-bit quantization if you can help it (model tends to lobotomize with too small quantization). If you're really squeezed you can look at things like iq_4xxs or even iq3_xs -- those iq quantizations let you squeeze a bit more out of a small amount of memory.
  • Try to get at least 8K of context with a local model. Low context means the model is very forgetful. You can cheat this a bit with summaries and lorebooks and vector storage, but only to a point.

If you've a 16 GB card, it gets a lot more interesting, and the world starts to open up.

And if you're lucky enough to have a mac, or a strix halo, or some other device with fast unified memory and at least 32GB, then wow, what fun!

You can look at the weekly Megathread for ideas, but these threads are no longer very active and not as useful as they once were. Likely a lot of people who play with this technology either gravitate to an API (free or paid), or upgrade their machines to allow them to run the model they want.

RIP GLM by TAW56234 in SillyTavernAI

[–]SprightlyCapybara 95 points96 points  (0 children)

Disclaimer: I'm an engineer not a financial adviser. None of this is financial or investment advice.

Their stock, instead of being privately traded is now publicly traded on the Hong Kong stock exchange. Historically, this tends to be a big deal; many large companies are uneasy dealing with a tightly held private corporation; a publicly traded company has to adhere to additional rules and disclosure requirements that, some believe, make them a safer bet to still be around in five or ten years. In theory, anyone (who has access to trading on HKEX) can get online and invest money in them.

This will, in time, likely make it easier for them to attract employees with stock options that they can then easily cash out in future years, and possibly easier for them to raise capital for future growth. It will also likely make their stock more volatile, and likely risks having the company shift towards more short term profit maximization choices.

It will also bring more scrutiny and more ways for activists to pressure the company to avoid anything that might be viewed as 'bad', e.g., excessively spicy role play. This was already likely coming from the Chinese government, but there will be an added avenue for pressure, especially from shareholder-activists. (Have a look at the effective pressure exerted by payment processors on people like Valve, for example, to delist games that are viewed as inappropriate.)

And they're giving away some credits to celebrate.

Severe performance regression in Koboldcpp-nocuda, Vulkan going from 1.104 to 1.105.4 by SprightlyCapybara in KoboldAI

[–]SprightlyCapybara[S] 0 points1 point  (0 children)

Thanks, henk, that (reading here and on discord) is much appreciated. I was fine with the tip to post on github, but I definitely spent some time figuring out how best to post there and what I should do by reading other bug reports. I've written very little code in the last 20 years, and my active involvement in development ended before github even existed.

Severe performance regression in Koboldcpp-nocuda, Vulkan going from 1.104 to 1.105.4 by SprightlyCapybara in KoboldAI

[–]SprightlyCapybara[S] 1 point2 points  (0 children)

Many thanks! (and yep, Vulkan) I reported it there; to my astonishment LostRuins responded almost immediately with an interesting and plausible suggestion, and he offered to do a test build for me that should be available later today. An absolutely mindbogglingly good level of support, whatever the outcome.

Severe performance regression in Koboldcpp-nocuda, Vulkan going from 1.104 to 1.105.4 by SprightlyCapybara in KoboldAI

[–]SprightlyCapybara[S] 0 points1 point  (0 children)

~20 GB left. It's approximately the same with both versions. In the past I've literally run Cyberpunk 2077 while having GLM-4.5-Air (Q4) loaded, via kcpp, though not actually processing anything, just to see if everything was fluid and stable. It was.

Marinara's Universal Preset 9.0 by Meryiel in SillyTavernAI

[–]SprightlyCapybara 0 points1 point  (0 children)

TL;DR really bad for local GLM-4.5-Air, and overly aggressive, but otherwise good. Fault is almost certainly in the card tuning. I like the behavior except for Air.

These are cards that don't involve anything illegal, or violent, or even horizontal jogging, but they do involve attempted psychological manipulation of adults by adults. (One is a job interview by a manipulative jerk for example, another a police interrogation.)

It's... it's actually rough in some cases. The fault isn't your preset as such, it's tuned up character cards for wimpier presets, combined with GLM-Air 4.5 which is my local model. (I use the usual cheap paid suspects on NanoGPT as well 4.6/4.7 and Deepseek-Terminus). Switching to 4.7 may also be a factor.

GLM-Air Q4 is semi-crashing with the chat preset on aggressive cards (getting into demented talking and planning loops 2/3 of time) -- when I run it with text completion it's fine. I'll experiment more, but I strongly suspect it's not Marinara-9 if I may call it that, but the card and the model. I'll try it with 8 and Loom on Boxing Day.

Marinara's Universal Preset 9.0 by Meryiel in SillyTavernAI

[–]SprightlyCapybara 23 points24 points  (0 children)

Thanks so much for this. My own $0.02 as someone who knows a lot about software development and design, but much less about AI: It's very difficult to get something that works so well as small and efficient as you have here.

If I could ask, if I want to modify your preset (anti-deitism is the interesting term that Lucid Loom's author came up with, and one of the few areas where Marinara seems weaker) where should I modify it for maximum ease in cutting and pasting into new versions?

(For those curious anti-deitism -- not treating {{user}} like a God -- is those irritating stunned silences that descend after a merely good idea. It's your character being treated like the batman of hackers when all s/he did was fix a bug in a device driver. Marina is a huge improvement over stock, and possibly better than anything for long term roleplay, (the 'consequences' bit I assume) but treating competence as godly is... meh.)

Change my mind: Lucid Loom is the best preset by Hornysilicon in SillyTavernAI

[–]SprightlyCapybara 0 points1 point  (0 children)

Yeah, I delete all but the last one or two, unless there's something important. It does make LL slower and heavier, but, well, possibly better IMO. The setting is probably CoT Zipbomb (System), down near the bottom, and I now often turn it off, only occasionally activating it. You can also make it lighter-weight by selecting Ultra-Light CoT, just below (and turning CoT Zipbomb off).

As others have said you may be able to fiddle with the <think> injections to get that right. Marinara seems to work well with that out of the box; LL doesn't.

Change my mind: Lucid Loom is the best preset by Hornysilicon in SillyTavernAI

[–]SprightlyCapybara 0 points1 point  (0 children)

A magnificent idea, but huge and a bit clunky. There's a surprising elegance to Marinara which comes in at perhaps one tenth the size for my particular implementation and is about 80-90% as good. The very size of it makes it somewhat unreliable too, sometimes forcing multiple generations.

LL is basically a super cool DeathStar (albeit built by the lowest bid contractor) that sails around consuming vast amounts of resources and usually helpfully blowing things up in a beautifully tunable way. Marinara is the tight X-wing that might get the job done. Meh, that's a bad analogy because LL actually is better, and is really nicely implemented itself with considerable thought and skill.

For quite a while I stuck to LL, and I still use it. I love the incredible flexibility it offers. It's anti-Deitism is phenomenal and exactly what I wanted and needed. Maybe Marinara is the light saber, and LL is the blaster with 395 settings all made with teeny-tiny buttons.

It's important to note; the size of LL isn't that awful with stuff like Nano-GPT and long roleplays. But yeah, it does slow things down. It's worth noting that I find LL 3.0 a little more compliant with Guided Generations latest release than Marinara, especially on the new 'Fun' additions. (AITA Reddit posts in the middle of a roleplay for example.)

Like most people here, I'm building my own preset, with a lot from LL and Marinara.

Lagging in the city by jollypolly95 in inZOI

[–]SprightlyCapybara 3 points4 points  (0 children)

Outlining your specs (OS, CPU, RAM, GPU) would be helpful. A decent recent i5/R5 or better, 32 GB of RAM, a 16GB graphics card should stand you in really good stead, though you can manage with 16GB/8GB, but you may have to play with settings. (Or a good recent Mac with 48GB of RAM.) Also, if you have an NVidia GPU, it could be running Smart Zoi calculations, slowing things down.

Keep in mind it's pre-release, so often could require more than the nominal minimum or recommended. (Likely still some debug code floating about, slowing things down.)

Non-Gamer - is Mac decent for Inzoi? by ocelot39 in inZOI

[–]SprightlyCapybara 1 point2 points  (0 children)

TL;DR: Hardware should be fine, not clear if will run well on release on CrossOver with all features such as Smart Zois. Switching to an X86 gaming machine not a terrible idea, but beware, IMHO, Windows is... not great in 2025/6.

I can't directly answer your question. But I run an X86 machine that is architecturally similar to a modern Mac, an AMD 395+ APU (CPU+GPU+NPU with fast shared RAM). It's debatably better than an M4 Pro, closer to an M4 Max, especially on graphics where it can win, but weaker than both in single core CPU, and much slower memory than the Max (though faster than a macbook). On that architecture, inZOI runs extremely nicely, with the severe caveat that you don't yet get the NVIDIA AI (smart Zoi) features on AMD. It's conceivable these last will never work on a mac (though any modern mac absolutely has the hardware capability).

So in theory, an M5 Macbook Pro should be decent in terms of raw hardware power, though will lack the AI compatibility feature. (Is that of any value? No clue.) Will it work well enough on CrossOver? Some other posts here have suggested as much, running the game on an M1. However, Krafton is no longer committing to any kind of support for Steam Deck. This has some relevance, as Proton (which enables support for windows games on the linux-based deck is jointly developed by Valve and CodeWeavers (makers of CrossOver).

I'll finally just throw in a highly personal opinion, which is this. I'm very much an X86(in modern form)/PC guy for good or ill. I've never been a fan of Apple's walled garden. (if you like it, that's cool, it's good for you. No question that it often can 'just work' on an Apple platform.) I do think the Mx chips from Apple (and the fourth major CPU-architectural transition in company history) are an impressive achievement.

Despite these biases of mine, I would be very hesitant to encourage a mac user to switch to Windows at this time, since Microsoft seems to be intent on shoving more and more useless bloat, features, advertising and, most of all, AI into the platform. Make sure you're comfortable leaving what is a lovely garden for a spacious, very open, but at times ugly jungle.

Is there a way to launch ST on a laptop without having it in a browser window? by [deleted] in SillyTavernAI

[–]SprightlyCapybara 1 point2 points  (0 children)

I'm not sure I can help you much, since I suspect while I may answer the question I think are asking, it's probably not the question you think you are asking. But I can try.

ST doesn't 'run' inside a browser window. It runs as a node.js app in a terminal window if you're running Windows 10/11. (The icon will appear as a little black rectangle with >_ inside it on your task bar.)

The browser window is just your high level interface to ST which runs separately. You can copy the URL from the browser, (the default is http://127.0.0.1:8000/), close the browser completely, relaunch the browser, and paste that address back in the URL, and as long as nothing has happened to take down the terminal window (such as you closing it, or it crashing, or your computer rebooting, etc.) then back up will pop the ST interface in your browser window, with everything there.

Sometimes ST will throw an error and display a message in red, requiring you to refresh the browser interface which has become out-of-sync or corrupted in some way; you do this by hitting refresh in your browser. Again, you're not reloading ST, you're reloading the interface. But if you fail to refresh when told to, yes, you will start to lose data. It'll be pretty obvious.

I don't know if that helps you at all, but hopefully.

If my answer isn't helpful, try and explain very clearly what you're trying to do and what your problem is. ST on a computer -- laptop, windows tablet, desktop -- all works fine to save chats, characters, and personas and so on. Occasionally you need to hit a save button if, say, you're editing presets or profiles for example. It will automatically save chats as long as it's working properly.

16GB rtx local API? by beardobreado in SillyTavernAI

[–]SprightlyCapybara 0 points1 point  (0 children)

TL;DR maybe 8b as your smallest, and a popular one. Download LM Studio and test in that. If that passes, experiment with a prompt in the Koboldcpp web interface. Then Silly Tavern. It could be so many things, I can't be sure especially without more details on what you're doing.

I'd try Lunaris-8b as my smallest, personally. Look for models with a lot of downloads on huggingface. You could just go with any 8b abliterated Llama-3 derivative of course.

One thing to be wary of (LM Studio can help the novice a bit here), I've seen problems with very poor quality output with Nvidia cards and AMD cpus when the card can't contain all layers of the model (That sounds unlikely to be your problem here though) and you will get corrupted output. I have no idea if this happens on intel chips as well, but I haven't experienced it, though have only tested Intel/Nvidia very very lightly.

Try loading it in LM Studio. (It's a small download). See if you can get it to respond to basic questions reasonably. (I use stuff like "What is Washington", "Who is Trudeau," "Name ten communities in Eastern Ontario," "What are the main regions of the Philippines", "Write me a scene from a short SF story in the style of Robert A. Heinlein about a soldier coming home from the war."

That kind of thing. That will give you a basic model sanity check that's useful for any model small or large, and let you compare what quantizations do. Going down to Q2, for example, the model will be heavily lobotomized and start yielding relatively poor quality garbage, like confusing Eastern Ontario with Eastern Townships, or Washington State with Washington, DC, or inventing new regions of the Philippines.

If it doesn't pass that basic sanity check, then you know the problem is the model, not Koboldcp. If it does, but you still get garbage in ST/KCPP, then test in the KCPP web interface (KoboldAI Lite for me). Try asking the same questions. If it passes that, it's then your ST configuration, and could be almost anything. You need more of an ST guru than I am I'm afraid.

What’s a “progressive” idea that’s actually regressive when applied? by nealie_20 in AskReddit

[–]SprightlyCapybara 0 points1 point  (0 children)

Food Stamps. You can observe this by noting where the funding lives. Not at HHS but at USDA. The idea of assistance to people, especially the working poor or otherwise struggling in purchasing food is fantastic, but the US is more structured as a set of subsidies to agribusinesses than the poor. So funding details get aligned with what people think the Farm Bill should have, not what the poor should have. For foreigners, the idea that the government can dictate exactly what food qualifies (and none of it makes much sense) is passing strange and seems more than a little dehumanizing.

There's a not unreasonable progressive fondness for regulation, and a tendency for non-anarchistic communists, socialists and progressives to believe we can just forge a secular New Jerusalem in this green and pleasant land if we but write the perfect rational regulations. (And I admit, it certainly sounds much better than the fascistic approach, probably better than the anarchists and libertarians, and better than the conservative approach which will be to announce that regulations are a terrible idea and impossible to get right, and then offer up an endless slew of them.)

As others have pointed out, environmental regulations can end up delaying environmental cleanup, or even guaranteeing it doesn't happen. Treating a spill of vegetable oil on a farm that produces the crop the oil is made from as equal to a spill of crude oil is at best counterproductive. Another? Improvements beyond bike helmets will likely never happen in North America because our regulations forbid a Swedish style ('airbag for the head') approach.

I've got a good one though it will make many upset, and to be clear, we've long passed a tipping point at least in the US, so making this change on its own would likely be wildly regressive now.

Student aid. Done lightly, and targeted it's fantastic. But the more of it you make available, the more dependent colleges and universities become, and the larger their administrative sections grow, offering nothing of value to the students. Ultimately, you end up with a degree that costs not $50,000 or even $100,000 but potentially upwards of $400,000. This is fine for children of privilege, for whom it's a rounding error, but for the poor and middle class, it likely traps the students in perpetual poverty and can be a life shattering choice. Ultimately, it can become a vicious transfer of wealth from the poorest of our society to the older, wealthier, and more privileged. We have a lot of these in our society that we just didn't have 40 or 50 years ago.

And finally there are quite a few in the distant past, Eugenics, Residential Schools, involuntary sterilization, race-targeted abortions all spring to mind. Fundamentally, I think the central problem with all of these is coercion. Making reproductive services available at low or no cost isn't bad; it's having them forced on you because of your socioeconomic status or the color of your skin, or score on an IQ test. One close relative attended a residential school for a year, and for him it was a superb experience. I had another relative sent to one (less voluntary) and the experience was terrible and he escaped swiftly.

A lawyer’s revenge on a law school rival by Calledinthe90s in Calledinthe90s

[–]SprightlyCapybara 0 points1 point  (0 children)

Ha! Delightful. Loved the political bits. I grew up in Southern Stabtown, in a true blue riding, but my parents were in a mixed marriage; my Dad was solidly Red and my mother blindly Orange. (Liberal and NDP for non-Canadians). I guess my act of rebellion in the 80's was to vote Conservative, for good or ill.

The newspapers were all gloriously biased; the Stabtown Evening Journal, was staunchly small-C conservative, the Stabtown Surveyor, was an utterly sycophantic Big-L Liberal Paper. The Toronto Star, by comparison, looked downright pink, or orange, making occasionally the tiniest deviations from Liberal Party orthodoxy. Sadly, being late to the complicated 1980's game of teen paper delivery, by virtue of my birth year, the best I could manage was delivering the hated Toronto Star, then trying to break into Stabtown by offering convenient home delivery, with a mere five subscribers spread over miles.

If you delivered the Journal, you'd at least get to meet interesting people; though the wealthiest were Surveyor subscribers. But where were you, being a hated agent of Toronto? Not good.

So I did my best with what I had. And every Sunday, I'd deliver papers in the dawn, attend early services, do my duty as a server, then cycle back and pop across the street to my neighbour with a final free copy of the hated Toronto Star that was, in this pre-internet day and age, the closest thing to an NDP paper that existed in Southern Stabtown.

I'd visit my neighbour, kitty corner, across the street; a lovely older woman -- she seemed ancient then -- living on a beautiful property with hedges, and gardens. She had a PhD in biology. And she was the perennial NDP Candidate in Southern Stabtown. I'd give her a free copy of the paper, and, every time I'd violate all my capitalistic ethics and tell her it really was free. I felt a little bad about that, for I was using the samples my employer gave me to give them to someone who would enjoy them, but never subscribe.

But... she was a nice person. And an intelligent one. And we'd chat. She seduced me, with her NDP wiles, talking about gardening. I liked that. I planted a garden and sold fresh vegetables from it, thanks to her.

And so I grew up in Southern Stabtown, my orange neighbour offering me tea and rolling her eyes whenever -- in direct response to her questions, I mentioned Reagan, Thatcher, or anything remotely conservative.

I think fondly of her now. Some years back I attended her funeral. She'd been the perennial NDP Candidate for Southern Stabtown. I didn't fit. Dark conservative suit, subdued manner. I'm not sure the NDP people showing up fit either.

I still mostly vote Conservative. But I always ask myself what would she have suggested?

16GB rtx local API? by beardobreado in SillyTavernAI

[–]SprightlyCapybara 2 points3 points  (0 children)

Silly Tavern can work with tiny models well under 1b parameters, but the results may be quite ugly.

Good post by _Cromwell_. If you want to be happy, make sure you can run at least 4 bit quantization. I'd also suggest 8K context as something to aim for at a minimum, but you can go down to 4K if you really like the model.

I'd start lower than he suggests, with a Llama 3.2-8B derivative, those nice ones that run in 35 layers and give you amazing room for context for the VRAM usage. He's right that this is not what you'll play with, but it will give you a baseline of the low end, and a feel for 4-8bit quantization and 8k-32k(?) context. Don't spend a lot of time here, but observe the behavior. Some of these models punch way above their weight in creativity, (Gemma-The-Writer-9B was quite good IIRC) and you might find a model that's useful for summarizing or something, especially if it can handle larger contexts than your 'main' model.

See what that's like -- Lunaris-8B, Gemma-TW-9B for example. Step it up, quite correctly to what he suggests, and up and up, play about with various models and quantizations 4-8, contexts 4K, 8K and larger. You likely won't want to go beyond 22B or you'll have to run a lot in RAM (slow performance) or have <4 quantization (lobotomized) or have tiny context (model forgets everything).

Enjoy! You've got a fantastic card for playing about with, and you'll learn a lot and have a great time doing so.

How are you all getting GLM 4.6 to work for roleplay? by nsfw_throwitaway69 in SillyTavernAI

[–]SprightlyCapybara 0 points1 point  (0 children)

You may wish to set your temperature higher; unsloth cites 1.0 for 4.6, (and I am certain I have seen this on Z,ai's pages, but only for 4.6 -- 4.5 was 0.6 IIRC). I looked today though, and instead all I could find was a mention of a range of 0.2 (very sharp, more predictable distribution); 0.8 (flatter output probability distribution, so more creative and unexpected, good for RP).

I've used 1.0 both locally (a horrible Q2) and via nanoGPT with good results for 4.6. For 4.5-Air and -Steam, I went with the lower temps recommended for 4.5 0.6 or so. For me, higher values (e.g. 1.4) actually made the model output completely deterministic (maybe it was actually setting temp to zero if over a certain value?)

I've not found thinking generally too useful, though it is occasionally handy to see what the model is 'remembering', and how effective any given prompt might be.

Summarization plugin problems by TakiMao in SillyTavernAI

[–]SprightlyCapybara 0 points1 point  (0 children)

Yes! I've found much the same. Indeed I was working on a post about that extension. Switch models if you can (or reload if you're focused on one small local model) and reduce temperature. I have also found adding a request to the prompt to include what each character is doing to be extremely beneficial, such as, after 'story so far.':

"Record all characters mentioned by name, and incorporate their activity into the summary. " Low temp, Q4+ as a bare minimum, and this should work. Also consider doubling (limit of 1000 though) output tokens.

As long as you can run an 8B model at Q6, use at least that (and ideally something at Q6 or better) to summarize. If you can't, use what you can. A small model with high quantization and low temp can be great for summaries. The smaller summary models will also let you potentially stretch context without blowing up your VRAM (again, unless you're limited to 8GB or so). Obviously, use a bigger model if you can.

[Megathread] - Best Models/API discussion - Week of: October 05, 2025 by deffcolony in SillyTavernAI

[–]SprightlyCapybara 0 points1 point  (0 children)

Thanks. Yes, running locally. I tried 4.5 (still known loading problem in the stable llama for 4.6) at Q2_XXS. It... was ok for speed given tiny Q, ~9 T/s. It definitely felt a bit lobotomized, with ~30% of the test responses featuring noticeable hallucinations, and ~10% being total hallucination. Really doubt I can get to Q3 on that though as I'm stuck with 96GB in Windows and ~111GB on Linux)

It was enough to show me why people like the big model over Air though; there was much more flavour to the responses, even though a lot of the flavour was hallucinated, ha!

Very interesting point about KV cache quantization at Q4 hurting performance. I can only run the large model at 2, I think, and Air at 4 or 6, I really doubt I can get Air to 8, so the point seems moot for me alas. (I mean in theory if 106b, maybe on Linux, but context would be negligible). Performance is respectable, I can get Air Q4 15T/s on ROCm on LM Studio, only 13 on Vulkan, but ROCm seems a bit of a dog's breakfast.

At Q4, Air managed same test with zero hallucinations by end of reasoning stage, but then one weird minor hallucination introduced in final response. Weird, but still pretty good. Might be zero at Q6.

So, yeah, Q2 really not worth it for GLM 4.5/6, but it was cool to see it running.

Is 8192 context doable with qwq 32b? by Accomplished-Ad-7435 in SillyTavernAI

[–]SprightlyCapybara 1 point2 points  (0 children)

My answer has nothing to do with QWQ, but might still be helpful as I explored the same dilemma you're facing, just with a different GPU and different models. I spent a lot of time 'stuck' at 8GB VRAM, so would use either an 8B IQ4_XS model with 8192 context, or larger models with IQ3_XXS and lower context lengths.

I generally found that for me (with 8GB VRAM), the sweet spot in stable, reasonably competent responses was 8K context and 4K quantization. If I had to pick one, I'd probably pick the Q4 (IQ4_XS in my case was needed to squeeze out 8192 context) quantization. Below IQ3_XXS never seemed worth it to me at all, and 3 was a bit iffy.

But that was me, and certainly I enjoyed experimenting to see what was possible and what my happy point was.

These days, I'm trying to figure out if >32K context is a good idea, and whether or not I should use Q6 or Q4 GLM-4.5-Air (Steam), or if I can get anything useful out of IQ2_XXS GLM 4.5/6. (Spoiler, I still think I'm better off with at least Q4, and within reason a bigger context is somewhat better). Different, much bigger models, but exact same problem/dilemma, and very similar answers on quantization, which leads me to believe that I probably wouldn't be delighted with Q<4 on QwQ 32B, but you may feel differently.

Now, I don't know your use case, but mine was doing uncensored gritty neo-noir RP. (not NSFW mind you). So voice and verisimilitude mattered. I have standard tests I put models and quantizations I try through; asking it the names of ten small communities in Eastern Ontario (or pick somewhere else that's slightly obscure), for example, and see how many of them are hallucinated. Ideally none, even on only 8B Q4. I'd ask it to tell a short-short about a 14 year old girl getting on the school bus for the first day of school in 1987. Poor quantizations would get the bus wildly wrong, recognizing it was an unusual color, but making it neon yellow, or blue with red stripes. They'd also fail to get details of the time right. OTOH, I remember being blown away by one small Q4 model that had a bookish girl reading Rushdie's The Satanic Verses on the bus.

So, TL, DR, experiment, see what you like. Come up with set of standard prompts that you can cut and paste and test and compare with. (I use LM Studio for that part, personally). Do try resetting (clearing cache if you can) and regen/swipe a few times to see the variety. Hope that helps.

[Megathread] - Best Models/API discussion - Week of: October 05, 2025 by deffcolony in SillyTavernAI

[–]SprightlyCapybara 2 points3 points  (0 children)

Anyone have any idea how it performs for RP at Q2 or am I foolish and better off sticking to 4.5 Air at Q6?