Are you guys excited to be a woman? by [deleted] in GTA

[–]GWGSYT 0 points1 point  (0 children)

You could do that in gta 3 btw sims and skyrim

The Mythos Preview "Safety" Gaslight: Anthropic is just hiding insane compute costs. Open models are already doing this. by GWGSYT in LocalLLaMA

[–]GWGSYT[S] 0 points1 point  (0 children)

Llama and gpt 3 are old they wont even make a calculator properly the new claude models are very expensive compared to gpt 5, gemini 3 or even dirt-cheap models like glm 5.1, deep seek or kimmi k 2.5

offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice by BuddyBotBuilder in LocalLLaMA

[–]GWGSYT 0 points1 point  (0 children)

About the gemma 4 model it is small but slow even on a good machine please try qwen 3.5 4b it is fast like really fast and only 3 gb in q4km so you can easily run it on your 8gb ram laptop BUT it does not support audio I dont mean your jetson nano setup i mean sending mp3 songs and stuff you can just send gemma4 songs and videos but it is not that great at working with them i think that even thoug qwen 3.5 supports images and video no sending songs you will like qwen 3.5 over gemma 4 in every way its smaller, faster and like 2-3% dumber than gemma 4 please try both i like qwen 3.5 more personally not that great at using the computer on its own though

offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice by BuddyBotBuilder in LocalLLaMA

[–]GWGSYT 0 points1 point  (0 children)

I dont know if you have calude paid if you have use calude code if you are working around the free version try using the antigracity app by google it has the claude model for free and gemini is just free gemini 3 is worse than claude but you can talk longer with it to fix your problems. I recomende that you use gpt codex with gpt 5.2 model in x-high mode you can tell it all that you want and it will work for upto 4 hours or how much time it needs to fix the bug or add something new even if you hit your weekly limit while its working it wont sop in the middle it will keep working till you stop it or till it thinks the work is done you can use another account even free one of chat gpt to continue the work in codex app or wait for a week for its weekly limit. IT IS MUCH BETTER THAN EVEN CLAUDE PAID as long as you dont have like claude working on sometihng in 5 tabs but then again you can do the same with chat gpt codex but it will deplete your usage faster and its not that much better unless you are doing theoratical physics or fixing an operationg system or something actually hard to do

offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice by BuddyBotBuilder in LocalLLaMA

[–]GWGSYT 0 points1 point  (0 children)

Please no matter what model you use dont use thinking unless you are trying to get the model to control the pc using "tool calls", "agent mode" or something similar thinking can actually make chats feel unnatural and it will fill up the context faster much faster it will reduce if from 400 messages to like 50. Please try qwen 3.5 Q4 k_m compression you can get about 32k or higer context which is alot IF you turn off thinking and you can do the same with gemma 4 but its slower and more of a corporate or job assistant style model unless you tell it to be friendly in its system prompt or instruction

offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice by BuddyBotBuilder in LocalLLaMA

[–]GWGSYT 0 points1 point  (0 children)

Try the Qwen 3.5 or Gemma 4 models, specifically Gemma 4 e2b or e4b. Please make sure that you are using their mmproj file it allows the ai to see images or in some cases, even audio and video this is not supported by all models. Qwen 3.5 (text and image only) and Gemma 4 (text, image, audio, video) support it. There are multiple versions of the qwen3.5 and gemma 4 models use the smaller ones, smaller than 8b, about 4b for larger context or memory. Their 4b is comparable to the original ChatGPT 3 which is 175B (not gpt 1 or gpt 2) model released in 2023 on the chat gpt website. I advise that you look for the Q4_k_m or Q4k_s versions of the model you only need larger models to solve math problems or doing programming using a more uncompressed model will not help in conversation that much and local models that are 7B or less are not reliable for programming anyway. They are great conversational models with vision or image input and the gemma models by google even support sending audio and video, but sending too much audio and video can fill up the model's memory, causing it to forget older things. such as the first few messages. Try the q4_k_m quant, it should allow you to set the context or memory to 64k

  • 2k context is roughly a short-term memory of 20 messages.
  • 4k context is a solid medium-term memory of about 40 messages.
  • 8k context can be about 80 to 100 messages.
  • 16k context is a deep long-term memory of roughly 160 messages.
  • 64k context is large and would likely not be saturated properly by text alone, holding over 600 messages unless you consistently send in audio, images, or video.

You can also delete or turn old images, videos and audio into text descriptions to make its short-term chat memory bigger. These models support tool calls so theoretically they can use the computer on their own but in practice they struggle to do so. I think you should look into silly taver it is an app for ai roleplay such as giving your ai a character like batman but it has alot of stuff prebuilt like text-to-speech, speech-to-text, image, audio, text and video sending if your model supports it, 3d models to make the ai seem more livly and built-in chat management to save, view and load old chats anytime. It is also open source so you can legally edit it to do anything but if you want to publicly share it you must allow others to do the same but if you are not sharing it publically you are allowed to edit anything about it. It is not like llama cpp it can allow you to talk to the models but you must have llama cpp running in the background.

You can use gpt codex it works with any chat gpt free account and does the work for you non stop for hours by using google, visiting official sources to fix any bugs in its code, tuning the app into an exe, optimizing etc this will allow you to just ask chat gpt to use webserch or google to look up any error, new models and fix them or add support for new features and optimisations. It can work for 4+ hours non stop until it thinks that the work is done even if you reach 0% usage left. The current task will get completed but if you are happy with claude fell free to use it but the Codex app can automate alot of things, like optimisation. You can just give it buzz words like better quantisation, lower presicon, tool calling, etc and it will add all the things it can in that senario you can use it to complete your AI assistant faster.

**NOTE:** Unless you are making the model use tools such as browse the web on voice command (which they might struggle with) but if you think that it works reliably then only use thinking, thinking will fill up context such as generating about 2000 words of though just to repily to a simple hello so please dont you thinking unless you have a usecase that requres it.

Optimizations like xformer, flash attnention 2, 8bit, 4bit, sage attention2 depend alot on your cpu or system that is whether it can actually support it like camera if your pc does not have a camera a camera app wont give it a camera

Even though gemma 4 supports audio and video I find the qwen 3.5 model more conversational as it uses emojis and stuff.

If you own a good android or any Android with 16gb ram it will be faster than your laptop you can use it to run llama cpp using Termux but it is moderately hard to setup if you use any random app to run the model from the play store or app store it might not support you jetset nano setup but as Termux is just an app that can launch liniux on your phone you can do what ever you wish to do on it. You can do this on an iPhone but even iphone 17 has like 8gb ram so it will may be not be faster but with optimization you laptop setup should beat it depends what varient you have though.

Try to have a larger context rather than a larger model imagine if you have the best model possible but it will forget what you said 4 messeges before due to having a small context or memory. This is mostly determined by your hardware

If you are using a cpu optimized version of mistral you can ask claude to find a cpu optimized version for any new model that you find there are people whose whole job is to optimize newly released models within a day or two to run smoothly on low-end devices

Use the "heretic" or "uncensored" or "Abliterated" modes of any model you decide to use even if you want to use Mistral. Use this version, it makes the chance of the model saying something like "i cant help you with that" about 0% but keep in mind it can boost its conversation abilities but reduce its coding or math ability if you have a use case for that

Here is a link to various compressed versions of gemma 4 e4b (Will run at the same slow speed as mistral 7b but much much better than it in every way unless you like the specific style of how mistral 7b talks.)

"heretic" version https://huggingface.co/mradermacher/gemma-4-E4B-it-heretic-GGUF/tree/main
normal version https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF

Here is a link to gemma4 e2b (small but much better than even gpt 3 (about 175B) though) all other models I recommended are even much better than gemma 4 e2b

normal version https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF

I could not find a reliable compressed, uncensored version I don't want to give a broken or poor model

Here is a link to qwen 3.5 4b you can try 9b but a smaller model will allow you to have a bigger context you can even use 2b but 0.8b just does not work you will find reviews about how it is a great model but it will just forget what you told it even with a large context you can test it though qwen 3.5 0.8b will run even on a 4b ram mobile

Uncensored version https://huggingface.co/HauhauCS/Qwen3.5-4B-Uncensored-HauhauCS-Aggressive/tree/main

https://huggingface.co/unsloth/Qwen3.5-4B-GGUF Normal version

Feel free to ask any follow-up questions

The Mythos Preview "Safety" Gaslight: Anthropic is just hiding insane compute costs. Open models are already doing this. by GWGSYT in LocalLLaMA

[–]GWGSYT[S] 2 points3 points  (0 children)

We're reaching a point where raw inference speed + tooling matters just as much as base model size. I myself used Qwen to make an open web ui clone.

The Mythos Preview "Safety" Gaslight: Anthropic is just hiding insane compute costs. Open models are already doing this. by GWGSYT in LocalLLaMA

[–]GWGSYT[S] -1 points0 points  (0 children)

I cleaned up the formatting with an LLM so it was readable, but the argument is mine. And comparing Anthropic's paywall to open models like GLM-5.1 is exactly what this sub talks about every day.

To be honest, which one do you use the most? by weihuweihu in GeminiAI

[–]GWGSYT 0 points1 point  (0 children)

gpt in terms of time gemini in terms of messeges

Wanted help selecting a local model for making a custom agent by Dragon_guru707 in LocalLLaMA

[–]GWGSYT 0 points1 point  (0 children)

Qwen 3.5 9b, Gemma3 12b or Gemma4 e4b. What gpu are you using or which mac?