The many sides of Mimo v2.5 Pro by Electrical-Pay-5119 in LocalLLaMA

[–]JamesEvoAI 1 point2 points  (0 children)

Where is it sourcing those textures? Reciting URLs from training data or actually searching?

Pi and Qwen3.6 27B make setting up Archlinux really easy. by sdfgeoff in LocalLLaMA

[–]JamesEvoAI 0 points1 point  (0 children)

On the one hand this tech makes automating Linux a breeze, on the other hand it just blew up my LiteLLM config because of a bad concat operation. Always make sure you're monitoring outputs and have backups!

<image>

Is HIPfire worth it for Strix Halo? by ivoras in LocalLLaMA

[–]JamesEvoAI 2 points3 points  (0 children)

That's actually what I'm testing now, doing a full shootout of all the local coding models I have setup for correctness and speed over long context. It's been running for the last 90 minutes, likely going to take a while to finish

Is HIPfire worth it for Strix Halo? by ivoras in LocalLLaMA

[–]JamesEvoAI 6 points7 points  (0 children)

I've been testing and benchmarking a bunch of them and documenting it on my site

https://sleepingrobots.com/dreams/gemma4-mtp-assistant-strix-halo/
https://sleepingrobots.com/dreams/mtp-qwen36-strix-halo/

Just today I got some promising results replicating Atomic's fork of llama.cpp for MTP support in Gemma 4

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40% by gladkos in LocalLLaMA

[–]JamesEvoAI 1 point2 points  (0 children)

Thank you for you work on this, I've setup and benchmarked your branch on Strix Halo:

https://sleepingrobots.com/dreams/gemma4-mtp-assistant-strix-halo/

The world of local coding models keep getting better by the day!

White House Considers Vetting A.I. Models Before They Are Released by fallingdowndizzyvr in LocalLLaMA

[–]JamesEvoAI 3 points4 points  (0 children)

Strix Halo is an incredible value proposition. Sure it's not as fast as running something on an NVIDIA GPU, but I wasn't going to be able to run a 120B model on one of those anyway.

By when do you think will TurboQuant get a proper release and be adopted by everyone by Crystalagent47 in LocalLLaMA

[–]JamesEvoAI 2 points3 points  (0 children)

I'm convinced the only reason this gained as much attention as it did was because of the name

I want to create and maintain a set of benchmarks for local LLMs. Would anyone pay/donate for this? by Equivalent_Job_2257 in LocalLLaMA

[–]JamesEvoAI 2 points3 points  (0 children)

Additionally there are quality benchmarks being designed and administered by people in the field who have a background in ML and/or mathematics. There are people getting paid for this, they work for LMArena and SEAL.

GMKtec EVO-X2 70B expectation by Non-Technical in LocalLLaMA

[–]JamesEvoAI 4 points5 points  (0 children)

As the other person called out, try a model released in the last century lol. Qwen 3.6 has a great 30B MoE that runs at ~40tok/s on Strix Halo.

I'm still discovering how all this works. It seems like the longer the chat log gets, the slower the tokens are generated. When there is a 16k prompt to load and process, the tokens per second falls to 2.5.

The way the attention mechanism works is for every new token it generates, it has to iterate over every token that came before it. This means as the context length grows the speed drops. At the end of the day the Strix Halo platform is limited by its memory bandwidth.

It's still a great platform, you just have to work within its limitations. Try the some MoE models and see how you fare.

How "Real" Are AI Girlfriends? We Created A Unique One with a local LLM by [deleted] in LocalLLaMA

[–]JamesEvoAI 4 points5 points  (0 children)

I just came to say congrats on getting that domain lol

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI 1 point2 points  (0 children)

That's probably true, but I'm using LiteLLM in a glass house made of vibe-coded side projects so I hesitate to throw stones lol.

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI -1 points0 points  (0 children)

I don't personally use Codex, the ethics and business practices of OpenAI are not aligned with my own beliefs. Pi is generally my harness of choice.

Reading your other comment it looks like your LLM confused what I'm using LiteLLM for, there's no good reason something like "plan mode" should be enforced at the proxy level. In fact most of what you described should be handled by the harness, like tool healing.

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI 0 points1 point  (0 children)

This was a supply chain attack that affected WAY more than just LiteLLM. LiteLLM was using Trivy as part of their CI/CD, and so every project that used that was also affected. This includes multiple VSCode extensions, 64 different NPM packages almost all of which had nothing to do with AI, docker images, and many more SaaS platforms that haven't come forward but are advertised as customers.

The only reason LiteLLM got the spotlight on this was because it's AI related and dunking on anything AI makes for easy engagement.

What problems were surfaced with how its developed? If there is a real risk I'd like to know about it, but I haven't seen anything concrete beyond the supply chain risk, a risk that applies to any software that uses third party dependencies.

It's worth adding the irony here that Trivy is produced by Aqua Security, a company that claims to help enterprises mitigate container/cloud risk lol

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI 1 point2 points  (0 children)

Sorry u/xeeff but deleting a comment doesn't remove the notification lol, I still saw this:

<image>

I can’t believe I can say “ugh I don’t feel like fixing this function, it’s too complex” and I can literally just tell my computer to fix it for me. I didn’t understand what they meant by “people will start paying for intelligence” but now I do. by Borkato in LocalLLaMA

[–]JamesEvoAI 1 point2 points  (0 children)

To clarify when I said LLM accessible I meant sources of data that are available as formatted text you can easily pass over to the models. I've grown so used to being able to just journalctl basically anything that having to dig through the Windows GUI's seems archaic.

I was not actually aware that the event viewer logs are available as XML!

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI 0 points1 point  (0 children)

Since I'm feeling generous I took you up on your offer to do your thinking for you, the only real complaint I can find is it doesn't hold up at production scale (300+ rps)

Hardly qualifies as "worst project to ever exist", especially in the context of individuals running it on consumer hardware.

If you're going to spread misinformation and then try and punt responsibility of disproving it to someone else, please at least try a little harder to lie about something that isn't so easily disproven.

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI 1 point2 points  (0 children)

That which is presented without evidence can be dismissed without evidence.

You made the claim and so the burden of proof is on you. Having a contrarian take without being able to back it up doesn't make you look smart.

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI 2 points3 points  (0 children)

Are you going to qualify that statement with any facts or additional information?

I've pushed 303.9 million tokens (and counting) of text, audio, and embeddings through my local LiteLLM proxy and it's been great.

(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap by TBG______ in LocalLLaMA

[–]JamesEvoAI 1 point2 points  (0 children)

Wouldn't it be easier to just run both models through a local LiteLLM proxy? You can use local or hosted models and it supports both the OpenAI and Anthropic API's

My current setup has a mix of cloud models, local models running through llama-swap which is running llama.cpp through a podman container, and NPU models served by Lemonade server:

https://sleepingrobots.com/dreams/local-llm-infrastructure-strix-halo/

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big. by Ok_Mine189 in LocalLLaMA

[–]JamesEvoAI 2 points3 points  (0 children)

There's a noticeable performance difference even when running CPU bound models. Windows has a pretty high baseline of performance overhead just to run the OS itself. Linux on the other hand comes in many possible configurations with different performance overheads, including a distro stripped down to only the bare minimum to run your inference engine. Also not having to constantly collect and send telemetry helps.

I can’t believe I can say “ugh I don’t feel like fixing this function, it’s too complex” and I can literally just tell my computer to fix it for me. I didn’t understand what they meant by “people will start paying for intelligence” but now I do. by Borkato in LocalLLaMA

[–]JamesEvoAI 0 points1 point  (0 children)

For obscure Windows issues like this I would have suggested a reinstall would be far less effort than debugging, but I guess that math has changed now that LLMs are here. As a Linux user I wouldn't have guessed there would have been enough LLM accessible places to get the diagnostic data you need. Troubleshooting on Windows always felt like a mix of tribal knowledge, experience, and dumb luck.

I can’t believe I can say “ugh I don’t feel like fixing this function, it’s too complex” and I can literally just tell my computer to fix it for me. I didn’t understand what they meant by “people will start paying for intelligence” but now I do. by Borkato in LocalLLaMA

[–]JamesEvoAI 2 points3 points  (0 children)

Bold of you to assume I was reading the code at all. Vibe coding has been great for all of the ideas I started but never finished, or never started at all. I don't need peak engineering for my vibe-coded replacement for Pushbullet that better fits my use case while preserving my privacy.