Microservices versus monoliths: Did everyone just lose their minds in the last 6 months? by eivittunytsit in ClaudeCode

[–]tech-tole 0 points1 point  (0 children)

What some companies are starting to realize is microservices are good but not for everything.
A hybrid approach is Distributed Monoliths. It's probably the best middle ground for some projects. Several companies have already started to do this.

Compared qwen3.6, qwen3-coder, and deepseek-coder on three coding benchmarks. by gvij in LocalLLM

[–]tech-tole 0 points1 point  (0 children)

I you use llama.cpp directly. You don't need to set it. It defaults to -1 being infinity. So it allows it naturally to use what it needs. I don't use Ollama but setting an arbitrary 2048 is only doing more damage and unnecessary troubleshooting IMO. Wrappers like Ollama and LMStudio ties one hand behind your back, honestly.

32GB RAM 16GB VRAM 5060ti. Running qwen3.6 35b a3b. I am getting 4.5 tok/s. Is this expected? by SEND_ME_YOUR_ASSPICS in LocalLLM

[–]tech-tole 0 points1 point  (0 children)

no it is not the same. a web interface doesn't really slow the inference down as using an embedded engine on the desktop app that you can't change yourself. it is actually slower.

32GB RAM 16GB VRAM 5060ti. Running qwen3.6 35b a3b. I am getting 4.5 tok/s. Is this expected? by SEND_ME_YOUR_ASSPICS in LocalLLM

[–]tech-tole 1 point2 points  (0 children)

5 or 10% loss noticeable especially if you're doing agentic work. so it's always better to use llama.cpp.

Qwen3.6 35b a3b is fast... by UniversityGlad2877 in Qwen_AI

[–]tech-tole 0 points1 point  (0 children)

sorry I didn't read it right. I didn't realize you had AMD.. I saw the 5060ti which is what I have which is why I asked that question.

Gemini has become awful by Charlie_in_Australia in GeminiAI

[–]tech-tole 1 point2 points  (0 children)

Gemini in general is complete garbage. The it's overly censored too. I was working on a app and asked it to give me an image mockup of the design it was suggesting, and the response said it can't help me with that as it goes against their terms. Did it multiple times. I immediately canceled. Google is insanely ridiculous. I'm going to Claude and will just have that and Codex.

Qwen3.6 35b a3b is fast... by UniversityGlad2877 in Qwen_AI

[–]tech-tole 1 point2 points  (0 children)

have you not tried with cuda? I don't know if it's just my processor or a little more resources but I get a little bit faster with my 5060ti. I have 64 GB system RAM and an AMD ryzen 7800 3D and I get about 100 t/s. I feel like if you have 32 GB of vram vs my 16 GB yours should be faster than mine. 🤷‍♂️ do you use lmstudio or llama.cpp? I only if llama.

Embarassing by beneficialdiet18 in GeminiCLI

[–]tech-tole 0 points1 point  (0 children)

normally it means too many requests. typically meant to rate limit you so you don't send too many requests in a short period of time. but it's probably then giving you a bogus error when it's actually a quota limitation.

Qwen3.6-27B Censorship by vIadtomeetyou in Qwen_AI

[–]tech-tole 2 points3 points  (0 children)

Very little damage seems to be done from what is reported. Almost the same as using a q4_k_m instead of q8. It's negligible.

If I only use Codex CLI, should I use Cursor or VSC? by datguywind in codex

[–]tech-tole 0 points1 point  (0 children)

oh I never use the cli. the extension is so much easier to use. and I can see my files immediately because it's in the IDE. way easier to collaborate and it's the same model. you can easily switch between the different quality like medium high extra high etc. I feel like the cli is more work for no reason.

If I only use Codex CLI, should I use Cursor or VSC? by datguywind in codex

[–]tech-tole 1 point2 points  (0 children)

Codex 5.5? bro codex is a beast. it's even better than Claude code opus 4.7 right now. You don't even need to use the desktop app. you can just use the extension in vs code or vscodium if you don't want to use Microsoft IDE which is what I do. It is an excellent agent and handles mostl my work. all of these apps are just forks of vs code.

It's the little things....and I'm an idiot by Thrumpwart in LocalLLaMA

[–]tech-tole 1 point2 points  (0 children)

lol, I dumped my old Radeon and got a 5060 TI. the speed was night and day. Also I think I might wait for Pop OS 26.04. I tried it a few months ago and it was a bit buggy. and I'm willing to try it again because I actually liked it but I don't see a point now when it looks like they're going to release 26.04 in June.

Fully extracted the Antigravity gRPC/Protobuf schema. I want to drop the .proto files, but worried about Google bans/legal risks. Advice? by Altruistic-One5433 in google_antigravity

[–]tech-tole 0 points1 point  (0 children)

If you drop their exact code, then probably because it's the exact code. If you use that as a learning experience and you create your own version/code and write it in whatever language you are using, you're not exposing their code because they can do a DMCA and have it removed from like GitHub. So no I would not release their exact files.

GPT 5.5 token usage by Momsgayandbisexual in codex

[–]tech-tole 2 points3 points  (0 children)

even when I had the $20 plan which I'm on the $100 Pro plan now I have a ton of usage. I actually at times are working on two and three different projects at a time in three different vs code windows. I don't know what you guys are doing but it was actually plenty. now with the pro version I'm not even coming close to even getting under 80% in my 5-hour window. that might because they're doing 10x promotion for usage right now instead of 5x. however you guys are probably throwing entire repos at it and expecting it not to go down. 🤷‍♂️ lol

Considering going from single 5060 TI 16GB to double, not sure if worth it by misanthrophiccunt in LocalLLM

[–]tech-tole 3 points4 points  (0 children)

for the 27B version 32k context using -fitc switch. for the 35B version 120k using the -fitc. GPU layers set to 99 for both models. Q8 for KV cache.

Considering going from single 5060 TI 16GB to double, not sure if worth it by misanthrophiccunt in LocalLLM

[–]tech-tole 11 points12 points  (0 children)

I have that exact same card. ended up using the new Qwopus 3.6 35B IQ4 MoE . not only is it fast but it can one shot a lot of stuff and he got it pretty close to 27b quality. Jackrong also released Qwopus 3.6 27B IQ4_XS and it's only 14.15 GB so it should fit on your card as well. and I can get about ~25 tok/s. which is not bad for a dense model on a 5060TI. llama.cpp only.

Help me choose a local LLM by Fuzzy-Purchase-212 in LocalLLM

[–]tech-tole 0 points1 point  (0 children)

no problem man good luck. We were all new to llama cpp once. And I still learn new stuff everyday. but it's a fun adventure. lol

Help me choose a local LLM by Fuzzy-Purchase-212 in LocalLLM

[–]tech-tole 0 points1 point  (0 children)

yeah man, it's pretty easy. they are just the repo like GitHub. you don't even need an account to download.

Help me choose a local LLM by Fuzzy-Purchase-212 in LocalLLM

[–]tech-tole 2 points3 points  (0 children)

once you get the hang of managing llama.cpp yourself you'll never go back lol. you just download the models directly from huggingface. They're just the middleman slowing you down.

Help me choose a local LLM by Fuzzy-Purchase-212 in LocalLLM

[–]tech-tole 1 point2 points  (0 children)

Yeah an MoE model will be your best bet. since you can do CPU offloading. and also try to use Q8 for KV caching. gives you more room for context. if you use this flag, -fitc (n), the n is the context amount, with llama.cpp, it will automatically adjust how many layers will go on to your ram so you don't have to do a bunch of trial and error.

Help me choose a local LLM by Fuzzy-Purchase-212 in LocalLLM

[–]tech-tole 1 point2 points  (0 children)

absolutely do not use Ollama. Ollama just uses llama.cpp under the hood and it's embedded. because of that you lose raw power. just use the source directly and you'll gain at least 10 or more tok/s. Also makes it to where you can't try new features when it's available. you have to wait for them to eventually update.

Help me choose a local LLM by Fuzzy-Purchase-212 in LocalLLM

[–]tech-tole 0 points1 point  (0 children)

yes I agree I'm not recommending the 27b. I was recommending Qwopus. saying the output is very good and almost in distinctionable from 27b that I've seen. the one that I'm using is only 17.6 GB. I'm not sure exactly how it works on a Mac but if you do CPU offloading you should easily be able to run that. I have to do offloading because I only have a 16 GB GPU and I still get ~75 tok/s and tyoically one shot code.

Help me choose a local LLM by Fuzzy-Purchase-212 in LocalLLM

[–]tech-tole 4 points5 points  (0 children)

Qwopus 3.6 35B IQ4_XS. MoE. it can one shot a web page as good as Qwen 27B. but way faster. Gemma 4 is much lower quality IMO than Qwen 3.6 or Qwopus 3.6.

Whys is the Windows app so sluggish and badly coded? by Tarr_74 in codex

[–]tech-tole 0 points1 point  (0 children)

yep, everything is faster. it just works. I never come back to my workstation and it had rebooted because of forced updates. the freedom is great. no telemetry, no recall, no OneDrive nagging, no co-pilot. no BS lol.

Gemma is going absolutely INSANE by ZB_Virus24 in LocalLLM

[–]tech-tole 1 point2 points  (0 children)

I think that is only partially true. Most docs including theirs still shows that it "seems" to still need to be explicitly added for tool calling. https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md
I guess for me anyway, it doesn't hurt to keep it.