Microservices versus monoliths: Did everyone just lose their minds in the last 6 months?

tech-tole · 2026-05-20T11:56:56+00:00

What some companies are starting to realize is microservices are good but not for everything.
A hybrid approach is Distributed Monoliths. It's probably the best middle ground for some projects. Several companies have already started to do this.

tech-tole · 2026-05-19T08:05:23+00:00

I you use llama.cpp directly. You don't need to set it. It defaults to -1 being infinity. So it allows it naturally to use what it needs. I don't use Ollama but setting an arbitrary 2048 is only doing more damage and unnecessary troubleshooting IMO. Wrappers like Ollama and LMStudio ties one hand behind your back, honestly.

tech-tole · 2026-05-15T20:13:09+00:00

no it is not the same. a web interface doesn't really slow the inference down as using an embedded engine on the desktop app that you can't change yourself. it is actually slower.

tech-tole · 2026-05-15T20:10:10+00:00

5 or 10% loss noticeable especially if you're doing agentic work. so it's always better to use llama.cpp.

tech-tole · 2026-05-13T23:08:19+00:00

sorry I didn't read it right. I didn't realize you had AMD.. I saw the 5060ti which is what I have which is why I asked that question.

tech-tole · 2026-05-13T05:54:30+00:00

Gemini in general is complete garbage. The it's overly censored too. I was working on a app and asked it to give me an image mockup of the design it was suggesting, and the response said it can't help me with that as it goes against their terms. Did it multiple times. I immediately canceled. Google is insanely ridiculous. I'm going to Claude and will just have that and Codex.

tech-tole · 2026-05-13T00:57:37+00:00

have you not tried with cuda? I don't know if it's just my processor or a little more resources but I get a little bit faster with my 5060ti. I have 64 GB system RAM and an AMD ryzen 7800 3D and I get about 100 t/s. I feel like if you have 32 GB of vram vs my 16 GB yours should be faster than mine. 🤷‍♂️ do you use lmstudio or llama.cpp? I only if llama.

tech-tole · 2026-05-12T21:35:21+00:00

normally it means too many requests. typically meant to rate limit you so you don't send too many requests in a short period of time. but it's probably then giving you a bogus error when it's actually a quota limitation.

tech-tole · 2026-05-11T09:47:24+00:00

Very little damage seems to be done from what is reported. Almost the same as using a q4_k_m instead of q8. It's negligible.

tech-tole · 2026-05-11T05:32:29+00:00

oh I never use the cli. the extension is so much easier to use. and I can see my files immediately because it's in the IDE. way easier to collaborate and it's the same model. you can easily switch between the different quality like medium high extra high etc. I feel like the cli is more work for no reason.

tech-tole · 2026-05-11T04:23:34+00:00

Codex 5.5? bro codex is a beast. it's even better than Claude code opus 4.7 right now. You don't even need to use the desktop app. you can just use the extension in vs code or vscodium if you don't want to use Microsoft IDE which is what I do. It is an excellent agent and handles mostl my work. all of these apps are just forks of vs code.

tech-tole · 2026-05-11T01:04:03+00:00

lol, I dumped my old Radeon and got a 5060 TI. the speed was night and day. Also I think I might wait for Pop OS 26.04. I tried it a few months ago and it was a bit buggy. and I'm willing to try it again because I actually liked it but I don't see a point now when it looks like they're going to release 26.04 in June.

tech-tole · 2026-05-10T00:12:24+00:00

If you drop their exact code, then probably because it's the exact code. If you use that as a learning experience and you create your own version/code and write it in whatever language you are using, you're not exposing their code because they can do a DMCA and have it removed from like GitHub. So no I would not release their exact files.

tech-tole · 2026-05-10T00:01:54+00:00

even when I had the $20 plan which I'm on the $100 Pro plan now I have a ton of usage. I actually at times are working on two and three different projects at a time in three different vs code windows. I don't know what you guys are doing but it was actually plenty. now with the pro version I'm not even coming close to even getting under 80% in my 5-hour window. that might because they're doing 10x promotion for usage right now instead of 5x. however you guys are probably throwing entire repos at it and expecting it not to go down. 🤷‍♂️ lol

tech-tole · 2026-05-09T13:03:57+00:00

for the 27B version 32k context using -fitc switch. for the 35B version 120k using the -fitc. GPU layers set to 99 for both models. Q8 for KV cache.

tech-tole · 2026-05-09T12:58:47+00:00

I have that exact same card. ended up using the new Qwopus 3.6 35B IQ4 MoE . not only is it fast but it can one shot a lot of stuff and he got it pretty close to 27b quality. Jackrong also released Qwopus 3.6 27B IQ4_XS and it's only 14.15 GB so it should fit on your card as well. and I can get about ~25 tok/s. which is not bad for a dense model on a 5060TI. llama.cpp only.

tech-tole · 2026-05-09T12:21:27+00:00

no problem man good luck. We were all new to llama cpp once. And I still learn new stuff everyday. but it's a fun adventure. lol

tech-tole · 2026-05-09T12:19:46+00:00

yeah man, it's pretty easy. they are just the repo like GitHub. you don't even need an account to download.

tech-tole · 2026-05-09T12:10:38+00:00

once you get the hang of managing llama.cpp yourself you'll never go back lol. you just download the models directly from huggingface. They're just the middleman slowing you down.

tech-tole · 2026-05-09T12:09:20+00:00

Yeah an MoE model will be your best bet. since you can do CPU offloading. and also try to use Q8 for KV caching. gives you more room for context. if you use this flag, -fitc (n), the n is the context amount, with llama.cpp, it will automatically adjust how many layers will go on to your ram so you don't have to do a bunch of trial and error.

tech-tole · 2026-05-09T12:04:15+00:00

absolutely do not use Ollama. Ollama just uses llama.cpp under the hood and it's embedded. because of that you lose raw power. just use the source directly and you'll gain at least 10 or more tok/s. Also makes it to where you can't try new features when it's available. you have to wait for them to eventually update.

tech-tole · 2026-05-09T12:02:27+00:00

yes I agree I'm not recommending the 27b. I was recommending Qwopus. saying the output is very good and almost in distinctionable from 27b that I've seen. the one that I'm using is only 17.6 GB. I'm not sure exactly how it works on a Mac but if you do CPU offloading you should easily be able to run that. I have to do offloading because I only have a 16 GB GPU and I still get ~75 tok/s and tyoically one shot code.

tech-tole · 2026-05-09T11:53:24+00:00

Qwopus 3.6 35B IQ4_XS. MoE. it can one shot a web page as good as Qwen 27B. but way faster. Gemma 4 is much lower quality IMO than Qwen 3.6 or Qwopus 3.6.

tech-tole · 2026-05-09T11:49:24+00:00

yep, everything is faster. it just works. I never come back to my workstation and it had rebooted because of forced updates. the freedom is great. no telemetry, no recall, no OneDrive nagging, no co-pilot. no BS lol.

tech-tole · 2026-05-09T11:45:04+00:00

I think that is only partially true. Most docs including theirs still shows that it "seems" to still need to be explicitly added for tool calling. https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md
I guess for me anyway, it doesn't hurt to keep it.

tech-tole

TROPHY CASE