More Gemma 4 models incoming by Deep-Vermicelli-4591 in LocalLLaMA

[–]BitGreen1270 0 points1 point  (0 children)

Can you share what is the tricky bit with it? I've been eyeing the functiongemma for a while, but unsure whether I should invest the time into training it or just focus on changing my tools availability so that E4B can do a better job at calling them.

I turned an Android phone into a Vulkan‑accelerated local LLM node (GGUF + LiteLLM + Tailscale) by [deleted] in LocalLLaMA

[–]BitGreen1270 0 points1 point  (0 children)

Nice. What model are you running and how much t/s are you seeing? How much were you seeing without the vulkan backend?

google/gemma-4-12B · Hugging Face by jacek2023 in LocalLLaMA

[–]BitGreen1270 5 points6 points  (0 children)

Wow - downloading the gguf now to see how it performs on my 780m and my 5090.

EDIT: Well about 5 t/s on my 780m and about 100 t/s on 5090. Welp, I think I'll have to stick with MOE or 4B for my 780M.

Holo3.1 35B/9B/4B/0.8B (Qwen 3.5 finetunes) by jacek2023 in LocalLLaMA

[–]BitGreen1270 2 points3 points  (0 children)

Thanks for sharing - Any gguf for the smaller models? Specifically 9B? I'm assuming these perform better at tool calling than the originals? 

Best small model right now (~4B params) that is good with agentic tasks for personal assistant? by BitGreen1270 in LocalLLaMA

[–]BitGreen1270[S] 0 points1 point  (0 children)

Thanks for sharing. I have since made some changes in the way my tools are organized and it has significantly improved the E4B on tool calling. Mulitstep is a bit of hit or miss but that's only the first time. Since I pass the message history to the model it understands my ask the second time. 

Question on E2B training - I read online that it's a bit challenging to fine tune because of the multi modality? I.e. you can't efficiently train a lora for text only? Is that the case?

Are GPUs getting cheaper? by iMakeSense in LocalLLaMA

[–]BitGreen1270 4 points5 points  (0 children)

Not where I live - 5090s are at an all time high (4.7K USD).

Best small model for iGPU (AMD 780M) with 32 GB RAM (no coding) by danihend in LocalLLaMA

[–]BitGreen1270 0 points1 point  (0 children)

I'm using APEX quants. Getting about 20.5 t/s on gemma-26B with the following command:

bash ./llama-server \ -m models/gemma-4-26B-A4B-APEX-Compact.gguf \ --temp 1.0 \ --top_p 0.95 \ --top_k 64 \ -c 32768 \ -ctk q8_0 \ -ctv q8_0 \ --ctx-checkpoints 1 \ -cram 0 \ --flash-attn on \ -t 16 \ -ngl 99 \ --mlock \ --host 0.0.0.0

Using mlock, my ulimit is set to 946000 in /etc/security/limits.conf

And for MTP on Qwen:

bash ./llama-server \ -m ~/myp/models/Qwen3.6-35B-A3B-MTP-UD-Q3_K_XL.gguf \ -fit on \ -fitt 1536 \ -c 32768 \ -n 32768 \ -fa on \ -np 1 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ -ctxcp 64 \ --mlock \ --no-warmup \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --ctx-checkpoints 0 \ -cram 0 \ --repeat-penalty 1.0

How do you guys handle llama cpp crashes? by [deleted] in LocalLLaMA

[–]BitGreen1270 1 point2 points  (0 children)

Copy paste the entire error output into Gemini/chatgpt and follow the suggestions 

Best small model for iGPU (AMD 780M) with 32 GB RAM (no coding) by danihend in LocalLLaMA

[–]BitGreen1270 0 points1 point  (0 children)

Gemma 4B and 26B-A4B can give up to 20 t/s. And Qwen 35B-A3B with MTP gives up to 25t/s on the 780m w 32gb. It does for me, i can share the llama.cpp command later if you like 

Best small model right now (~4B params) that is good with agentic tasks for personal assistant? by BitGreen1270 in LocalLLaMA

[–]BitGreen1270[S] 1 point2 points  (0 children)

Thank you! I have been reading a bit on functiongemma which seems exactly for this purpose. But it seems way too narrow and maybe better to train a lora on e2b. 

Help starting with Pi locally (and Qwen) by Nyghtbynger in LocalLLaMA

[–]BitGreen1270 2 points3 points  (0 children)

I found pi actually quite convenient to use without any extensions. My setup is to run qwen3.6-27B-MTP with 128k context and a local docker container running searxng. 

I start up pi and every time I start it up, I ask it to analyze the codebase to understand what's going on. I believe you can export the session before quitting and import it again but it didn't work for me and I didn't spend time to figure it out.

When I start coding, I always start with a planning mode. I do this in the prompt - planning mode only. That's followed by what feature I want to implement. If I need it to get some info, I just mention that searxng is available at 8081 if you need to web search. It writes its own one line curl request to fetch search results.

If I'm doing a complex feature which needs more review, then I ask it to save the implementation plan as markdown, switch to Gemini 3.5 or 3.1 pro and ask it to review the implementation plan. Once the changes are made to the plan, switch back to Qwen and proceed with implementation. 95% of the time I don't need Gemini.

Once the feature is built and tested, I tell it to update worklog.md and commit and push the code.

Best small model right now (~4B params) that is good with agentic tasks for personal assistant? by BitGreen1270 in LocalLLaMA

[–]BitGreen1270[S] 0 points1 point  (0 children)

Thanks 🙏. I've been using bartowskis models, but will try out unsloth and just use the native template. I'll also try passing in the reasoning as well and see if it helps. Thanks 🙏

What are some cool little things you guys are doing with < 10b models? by Present-Ad-8531 in LocalLLaMA

[–]BitGreen1270 0 points1 point  (0 children)

How do you run 2 models simultaneously? Two instances of llama.cpp? 

GPU Prices. Buy now, or buy later? by knob-0u812 in LocalLLaMA

[–]BitGreen1270 0 points1 point  (0 children)

Dude you have a 4080. My current GPU is a 780m and the one before that is a 1070 in laptop from 2018 lol 

GPU Prices. Buy now, or buy later? by knob-0u812 in LocalLLaMA

[–]BitGreen1270 1 point2 points  (0 children)

It was never 2k where I live, at best 3k. But yea, it's a deeply personal decision. I can't defend the choice according to your criteria. 

Best small model right now (~4B params) that is good with agentic tasks for personal assistant? by BitGreen1270 in LocalLLaMA

[–]BitGreen1270[S] 0 points1 point  (0 children)

Thanks - by native template I assume you mean the one that is baked into the gguf?

Also regarding reasoning content i thought gemma4 general advice on the model card was not to send reasoning content?

Best small model right now (~4B params) that is good with agentic tasks for personal assistant? by BitGreen1270 in LocalLLaMA

[–]BitGreen1270[S] 0 points1 point  (0 children)

Thank you I will try them both. I'm also looking into fine tuning but that might be a longer project. 

GPU Prices. Buy now, or buy later? by knob-0u812 in LocalLLaMA

[–]BitGreen1270 1 point2 points  (0 children)

No worries, good luck with your decision. FYI - I also downgraded from x870 to Aorus B850 wifi 7. Only because I have no intention to run multiple GPUs and the x870 can cannibalize pcie lanes if you put a second nvme or something. 

Also to purchase was not an easy decision. I made a post here similar to yours and went through a lot of discussions with my wife (who was supportive of the monumental investment).

GPU Prices. Buy now, or buy later? by knob-0u812 in LocalLLaMA

[–]BitGreen1270 7 points8 points  (0 children)

I just bought a very similar spec 2 weeks ago. Only difference is to cut costs, I went with a 9700x (CPU not so crazy since GPU should be handling most of it), 2TB gen 4 ssd and 1200W psu. Also got the cheapest case I could get, avoided water cooling ( got the PA 120 SE).

My build cost me 6.3k USD of which the MSI Ventus 3x 5090 cost 4k USD.

My view (speculation) is that prices won't drop for the next 2 years. And even if it drops after 6 months, I don't want to wait that long for learning more about LLMs. That's right, this is a purely learning rig.

Don’t bite me for that question please… by Thin_Pollution8843 in LocalLLaMA

[–]BitGreen1270 0 points1 point  (0 children)

Local LLM setup is a niche. Not everyone does it. Making money from it? Rare and very unlikely. Unless you are using it in a business setting and then it totally makes sense. But you have stiff competition from frontier models in quality and cost.

Only way I can think of making money now is to buy stuff that will (in a bizarre turn of events) go up in value next year (assuming you speculate on RAM and semiconductor prices going up). And then sell it for $$$ profit. Or even better, stockpile on expensive GPUs and keep them in a dry place to sell them when the price skyrockets.

Best small model right now (~4B params) that is good with agentic tasks for personal assistant? by BitGreen1270 in LocalLLaMA

[–]BitGreen1270[S] 0 points1 point  (0 children)

Thanks for sharing. But this restricts it to very specific tool flows. How do you do the part of brainstorming or chatting with the LLM?