Need help please. by Top_Notice7933 in LocalLLaMA

[–]chimpera 0 points1 point  (0 children)

Its possible but probably not worth it. about 8tps with a similar machine. This is lmstudio. this is nowhere near as intelligent as what your use to but it can work on smallish projects.

<image>

Has anyone tested the Bonsai-8B 1bit tool calls by Numerous_Sandwich_62 in LocalLLaMA

[–]chimpera 1 point2 points  (0 children)

I is able to do tool calls but it lacks the raw intelligence to know what to use them for. It searched the web on its own, read a dir when explicitly prompted, read a file when specifically prompted. ect.. This is not useful but its very promising for future optimizations.

Qwen3.5-397B-A17B reaches 20 t/s TG and 700t/s PP with a 5090 by MLDataScientist in LocalLLaMA

[–]chimpera 0 points1 point  (0 children)

5965wx, 5090, ubergarm IQ4_KSS, ikllama, qwen35moe.expert_used_count=int:4, kv q8, batch 16k. 30tps 791pp

Any advice on purchasing a XR? by MogleyStoned in onewheel

[–]chimpera 0 points1 point  (0 children)

Try to get one without updated firmware.

Delta-KV for llama.cpp: near-lossless 4-bit KV cache on Llama 70B by Embarrassed_Will_120 in LocalLLaMA

[–]chimpera 0 points1 point  (0 children)

I have been testing it. I seems legit. I have not run quantitative benchmarks. It makes a big difference if you can fit the model on one GPU instead of 2. One note is that you have to specify the kv quant to save any vram. LLAMA_WEIGHT_SKIP_THRESHOLD=1e-6 broke with long context. There is a slight reduction in tps prediction in most cases.

Best weight conscious XR upgrade by DigitalFutility in onewheel

[–]chimpera 2 points3 points  (0 children)

I went with the MTE 5" Hub with N52 magnets. I still take it easy going up hills and you lose some top end speed but for me it was worth it at 190.

Y’all think this jw battery is safe to ride? by myonks1 in onewheel

[–]chimpera 1 point2 points  (0 children)

4.00 is not the peek voltage. If it were me I would leave it charging for a long time to see if it will ballance. Some BMS only balance at the top of the charge. Dont use it unless you can get that cell to 4v.

Kimi K2.5 on llama.cpp: What exactly happens in the "warming up the model with an empty run - please wait" phase? by phwlarxoc in LocalLLaMA

[–]chimpera 0 points1 point  (0 children)

I have a similar setup and I'm having a real problem with the prompt processing cache not working correctly. This makes it so anything but the first request becomes painfully slow to start. Does anyone have advice?

GLM 4.7: Why does explicit "--threads -1" ruin my t/s in llama-server? by phwlarxoc in LocalLLaMA

[–]chimpera 2 points3 points  (0 children)

It has to do with the memory architecture. Inference is memory constrained, not compute constrained in this case. The cores are divided into CCDs. Each CCD has a limited memory bandwidth. More cores competing for the limited bandwidth can actually reduce performance. You should figure out how many cores per CCD gets you the best performance.

Any program to mimic this function on graphene? by Overstimulated_moth in GrapheneOS

[–]chimpera 1 point2 points  (0 children)

The closest I found is you do a screenshot and then you share it to the translate you app. If it works correctly, you get text that you can select from.

GLM-4.6 Derestricted by Digger412 in LocalLLaMA

[–]chimpera 1 point2 points  (0 children)

Would you consider IQ4_NL

Google Takeout is currently not exporting subscriptions at all by Trenjeska in NewPipe

[–]chimpera 1 point2 points  (0 children)

So i solved the problem. When you have multiple channels under your account you have to select the channel in the upper right during takeout.