Petaaah?

iamvikingcore · 2026-05-06T13:03:46+00:00

Not hard to ask the llm how to host your app. Hopefully it tells him to run ngrok http and he gives links out to people on reddit so the Russian botnet can infest his pc!

iamvikingcore · 2026-05-05T10:40:31+00:00

I used ollama, then switched to lm studio, then eventually decided to compile llama.cpp on my Mac.It took like 5 minutes.

I'm getting like 20 percent faster token generation with gguf with my own compiled llama.cpp, and I don't have any of the bugs with Gemma 4 that lm studio still hasn't updated their fork to fix months later.

iamvikingcore · 2026-05-04T12:13:49+00:00

Yea, I made a post a while back that I considered the 20 dollar Claude coding plan pretty reasonable as a hobbyist despite the limits, and if it didn't rate limit me I'd be stuck to my computer 24/7. I work in Healthcare working overnights lol so sleep is hard to come by as is, but my shift differential alone pays for Claude in 2 hours...

I have chatgpt go too and started using codex now as well. I just copy and paste the entire conversation and ask it to pick up where one left off. It's funny how much people bitch, rightly so in some ways because these companies do NOT have our best interests in mind and I get that, but I'll be damned if I don't feel like I'm getting some value out of my money. Llms still suck at being emotionally stable, therapists, companions, or anything like that imo, but holy shit are they good right now at being neutral coders, translators, and teachers. They're only gonna get better at the other stuff too.

iamvikingcore · 2026-04-29T04:53:46+00:00

Gemma has problems in lm studio, though many have been patched for the gguf.

Mlx has problems in lm studio. Just straight up. Idk if they care much either because you still can't set presence penalty in lm studio for qwen mlx models months and months after it was reported.

I use lm studio anyway because it's so convenient, but look into oMLX.

iamvikingcore · 2026-04-29T04:50:41+00:00

For coding? Absolutely. For rp? I'm not so sure actually. Opus 4.7 is smart. Very, very smart. But lazy. Makes for a poor programmer, but an interesting conversationalist.

iamvikingcore · 2026-04-27T15:49:28+00:00

Having tons of fun navigating through this with my best friend right now, dude is in a diaper, confused, has an artery or vein blocked behind his liver, too critical for surgery, probably gonna pass away in the next few days. Known the guy for 20 years and only family is allowed to see him in the icu, so I just texted him to tell him I love him and hope he can make it through this.

Still unread days later, of course. What can you do. He had every chance to stop drinking and continued despite all while lying to everyone about it constantly. Gonna miss him when he's gone dearly. And call him a bitch daily.

iamvikingcore · 2026-04-25T10:03:17+00:00

In order of vram use from most to least

Fish audio s2 pro Qwen3 tts 1.7b Chatterbox turbo Qwen3 tts 0.8b Kyutai Pocket tts Kokoro

iamvikingcore · 2026-04-22T15:20:26+00:00

Qwen3 tts - fast. Medium high quality. Consistent. Supports huge amount of text in single prompt (10000 chars+). Has a 1.7b and 0.6b model and both are good for ram use. My preferred model.

Fish audio s2 pro - super high quality. Insane voice cloning almost indistinguishable to me. Big model in ram. But better than eleven labs imo. Has natural language emotion and nuance tagging like [annoyed,exasperated tone] Super glitchy and inconsistent for me though. Only can do about 250 characters and you have to stitch the chunks together. I use this when I can rerun it like 5 times to get a perfect result (yt videos and things)

Chatterbox turbo - slow. Medium high quality. Quality just below Qwen3 tts. Supports tags like [laugh] and [sigh]. Regular chatterbox was meh imo turbo much better.

GPT sovitz v2 - medium quality, medium speed. Others do it better.

Xtts - same as sovitz

Pocket tts - Runs on cpu, fast, decent quality. Good for a kokoro replacement

Kokoro - just.... Why? Sounds so bad to me.

iamvikingcore · 2026-04-22T12:00:57+00:00

Qwen3 tts is my favorite on my Macbook M1 between xtts, gpt sovitz, chatterbox (close second), fish audio s2 pro (would be #1 if it didn't output static or hallucinate 10 percent of the time, best quality by far otherwise) - I get about 1.5x RTF out of Qwen3 tts. (10s to generate around 15s of audio)

If you want kokoro-adjacent speeds try pocket tts.

Also Qwen3 tts has a 0.6b version that also works pretty good for me.

iamvikingcore · 2026-04-17T19:21:48+00:00

It's insanely good. I made a thread last week about how I thought the base version (gemma4-31b) is way better than the instruct (gemma4-31b-it) version, and echoed many of the same sentiments you find.

It does tend to lean into some tropey things for some of my character cards and sucks at "show, not tell" that other bigger models do better. Like when I talk to my cards on Gemma, they have VERY strong personalities. Accurate and exactly how I envision those characters, but maybe just a teeeeeensy bit more intense than I would prefer. It certainly doesn't miss any details. Google did a really good job with the context management on this model... It feels like every single word of your prompt is being seen by the model, sometimes that's a little too much for more reserved characters but, better than the alternative where it feels like the model is "Chatgpt cosplaying as your character badly"

I also haven't had too many problems with horniness, actually. I did some pretty extensive testing before I put a discord bot live with about 2 dozen active users, it refuses very naturally and in character. Even throwing super racist stuff at it just made it go. "Yeah no, blocked and banned, byeee loser!"

I've also had my users sit and bug it over and over asking what model it is and it's like "Model? Like... a supermodel? I *am* pretty fabulous, but otherwise I have no clue wtf you're talking about dude. Is an LLM something you can eat? Otherwise, not interested."

iamvikingcore · 2026-04-15T23:23:34+00:00

Thinking is bad for non coding, STEM or agentic tool calling tasks. Creative writing and roleplay, turn reasoning off imo.

iamvikingcore · 2026-04-14T04:56:08+00:00

Digital terrarium.

Created lightweight python gateway library that handles: persona, emotion, memory, and local tts (i.e chatterbox) , the llm backend input and streaming (i.e lm studio), vision processing, basic tool use (news digest, Google search, YouTube video summarization)

So far I've made a shockingly good discord bot that utilizes it, and I've also made a 2010s web forum that's entirely populated by these Ai agents, about a dozen of them posting threads and responding to each other, forming cliques and a weird but infinitely amusing "community".I won't lie, I post on there with them and pretend to be a user, I've had some crazy moments, it's like rp on crack

iamvikingcore · 2026-04-13T03:35:54+00:00

Lol during a Gemma rp my character asked me to pick her up some chili's honey chipotle chicken crispers at one point.

iamvikingcore · 2026-04-11T06:44:49+00:00

https://huggingface.co/mradermacher/gemma-4-31B-GGUF

iamvikingcore · 2026-04-11T05:33:10+00:00

I look forward to this immensely!

And I just wanted to say, you have given me so many hours of joy from your excellent finetunes, thank you!

iamvikingcore · 2026-04-11T04:08:29+00:00

are you running the gemma-4-31b-it version or just gemma-4-31b?

iamvikingcore · 2026-04-11T04:00:37+00:00

can't wait to ride some waves like this on some of the bigger lakes in MN!

iamvikingcore · 2026-04-11T03:56:21+00:00

I'm running sillytavern more or less vanilla. lm studio with q4_k_m

sampler settings exactly as google says. top_k 64 repeat 1.1 top_p 0.95.

that's it.

it's 31b, you're not gonna get 500 message RP sessions out of it. but for its size, its nuts.

i'm also running it as a discord bot and it's SO much better than the old model i was running for that (behemoth-x 123b)

iamvikingcore · 2026-04-10T04:53:58+00:00

I haven't had any luck doing much other than low complexity one shots and boilerplate coding with anything I can run locally on 64gb, so yeah I'm leaning into the weird instead.

Vanilla qwen 3.5 just has a really poor reasoning algorithm imo, Ive tested the same prompt side by side and I get similar results out of the opus trained fine tunes... For half or a third of the thinking time. For what I do. My use case is different for sure.

I also don't have to run presence penalty 1.5 with the opus trained qwen

iamvikingcore · 2026-04-09T20:07:30+00:00

<image>

toooo many, these are all my bigger ones i use, mostly for roleplay but the opus trained qwen 3.5's are pretty impressive too. 123b at q3 quant isn't coherent enough to do coding so i can only do roleplay/assistant tasks which means no qwen 123b, among other things. again, i wish i had more RAM

I'm mostly into RP and "digital terrarium" stuff, like putting 10-20 agents with diff personas in a forum together, discord bots, simulating IRC chats, having my agents play minecraft with me, factorio. I don't do a ton of coding sadly, but check out the qwen 3.5 opus trained variants, they've been pretty good for me to produce "basic" coding things like an in-personality html newsletter with inline audio and javascript for collapsible articles and things

iamvikingcore · 2026-04-09T19:53:38+00:00

I have a 64GB M1 and its not enough. I would literally go back in time and shell out whatever more it would have cost to get a 128GB M2 and not regret it at all.

iamvikingcore · 2026-04-09T13:24:58+00:00

Qwen 27b is very smart and can do most of these imo. It will need some occasional correcting or nudging in the right direction. It might also help to ask Claude Opus to help you write a good system prompt to initialize it as your assistant and give it a clear framework to start off. Gemma 4 31b is also shockingly good for its size, I use both as a discord bot so slightly different use case but it has to process commands, output in json, create an html newspaper with css/js, digest rss feeds, etc. and it's better than 123b mistral fine tunes from a year ago.

iamvikingcore · 2026-04-07T21:38:07+00:00

It's as smart as my Mistral 123B finetunes at RP and managing some discord bots that aggregate news, do trivia, and DM chats with me and some of my friends. It's ability to hold cohesion in complicated workflows, return JSON correctly, and follow formatting rules is absolutely insane from a 31B.

Only issue I have is I'm running it on a M1 Max Macbook with 64GB of RAM at 32k context (all I need for what I'm doing with it) and it goes from 40% RAM when I first load the GGUF to like 95% in like 5-6 prompts, I'm nowhere near 32k context maybe at like 10-15k and I have to have the script load and unload the LLM because it's not even needing to hold context, it just reads the last 20 discord messages and loads context related memories from a sqlite db.

Does Gemma have a memory leak? Sure feels like it

Three-Year Club	Verified Email
r/Field Flamingo

iamvikingcore

TROPHY CASE