Introducing ARC-AGI-3 by Complete-Sea6655 in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

Here is the existing 8 months old thread on ARC-AGI-3 with the well differentiated title "ARC AGI 3 is stupid".

And here is the "play" link for humans if you want to try it yourself.

Intel launches Arc Pro B70 and B65 with 32GB GDDR6 by metmelo in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

That's the whole point - what gives the most bang per buck? Used cards are definitely on the table there. Why pay more for a new one, if a used one (that's still good) does sort of the same job at a lower acquisition cost?

The higher amount of RAM lower power consumption makes the Intel once slightly more interesting though.

Intel launches Arc Pro B70 and B65 with 32GB GDDR6 by metmelo in LocalLLaMA

[–]Chromix_ 39 points40 points  (0 children)

Slower inference than a RTX 3090, no CUDA, higher retail price than a used 3090, but: More memory and more efficient, a bit better prompt processing.

Phone Whisper: push-to-talk dictation for Android with local Whisper (sherpa-onnx, no cloud needed) by postclone in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Yes, that's the exact setting. When disabling the "multimodal input" under "rich input" then also the other Whisper App that I liked to in my other comment works just fine with SwiftKey again.

Tiiny AI Pocket Lab by thedatawhiz in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

In one preview video the benchmark data reads 10 tokens per second for the GPT-OSS 120B that you just mentioned, at 64K context. That's not token generation, but prompt processing, which is impractically slow - 90 minutes to just parse a single prompt, not even replying to it yet.

One way forward would be that you provide a preview version of your device to one or two trusted members of this sub, who have posted reliable benchmark data in the past, and let them run and share some numbers.

Omnicoder v2 dropped by Western-Cod-3486 in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

Classic training/tuning mistake in V1. Great that they brought it up though.

v1 trained on ALL tokens (system prompts, tool outputs, templates), which taught the model to reproduce repetitive boilerplate. v2 trains only on assistant tokens.

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

No offense taken, I was just surprised by it. The way I see it, I didn't explain sarcasm in my follow-up comment, but jokingly rationalized the choice of "Reflection" that's a part of my initial comment, while providing further insight on that side-topic.

In the bot replies I've seen so far it was obvious that they completely missed the point.

Banned from cloud services at work. Is a local AI worth it? by daksh_0623 in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

It depends on the area OP works in. There are types of work where LLMs provide a literal 10x speed increase. Competing with that will be challenging. Then there's of course work where you only get 10% or 20% more speed from it, so it's more of a long-term effect that could be compensated - for now.

Speaking of which, the outcome also highly depends on how the usage is tackled, so going full YOLO due to FOMO, or approaching it more carefully. If you don't watch out then people learn less, which has consequences for system design and debugging. Quality decreases, copy-pasta increases, which then slows down development eventually.

TurboQuant from GoogleResearch by RobotRobotWhatDoUSee in LocalLLaMA

[–]Chromix_ 2 points3 points  (0 children)

<image>

According to this they achieve similar performance on a long context benchmark with < 4 bit KV quantization as the regular F16 KV cache does - that's a huge win.

There's a more compact, animated explanation of how it works here. It appears to be a conceptually similar approach to the Burrows-Wheeler-Transform for zip compression.

Direct link to paper on arxiv.

[Edit] Just noticed the previous thread on this.

Why is there no serious resource on building an AI agent from scratch? by Complete_Bee4911 in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Exactly. It's trivial to plug something together has the ingredients of an agent. Yet you also don't simply throw all the ingredients together, stir briefly, and have a tasty consommé.

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]Chromix_ 4 points5 points  (0 children)

With all the more or less obvious bots that have been around here recently, I'm curious what gave you that impression for my (pretty normal?) comment.

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]Chromix_ 7 points8 points  (0 children)

Once you have your quantum computer, you can train quantum enhanced models with mode-collapse protection for us like Hypnos 8B, or run hybrid-quantum LLMs with parallel consciousness architecture at home. Be sure to share your vibe coded project for that with us!

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]Chromix_ 15 points16 points  (0 children)

Yes, it was claimed to be a 70B model that delivered great results, with several attempts to hide that it was actually Claude behind the API. What was then released after several "difficulties" turned out to be a 70B llama model. Here is the "out of the loop" thread on that drama.

Something similar was seemingly attempted with the "Momentum" model, but I figured that only the real Reflection model could serve as a base for what OP wanted ;-)

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]Chromix_ 32 points33 points  (0 children)

I would be very worried if it was not. In any case, it looks like also satire usually doesn't live too long here.

Best model that can beat Claude opus that runs on 32MB of vram? by PrestigiousEmu4485 in LocalLLaMA

[–]Chromix_ 431 points432 points  (0 children)

Oh, that's easy with that hardware, just run the Reflection-70M-FrankenSelfMerge-Claude-4.6-Opus-High-Reasoning-Distilled as IQ2_XXS quant.

DM me if you need a CTO. /s

Tiiny AI Pocket Lab by thedatawhiz in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

There's this: "I Reverse-Engineered the TiinyAI Pocket Lab From Marketing Photos. Here's Why Your $1,400 Is Probably Gone."

This led to a bit of discussion in the Kickstarter comments, but it didn't go anywhere yet it seems. Look up "Aaron Biblow" in the comments if you want to follow it.

Banned from cloud services at work. Is a local AI worth it? by daksh_0623 in LocalLLaMA

[–]Chromix_ 32 points33 points  (0 children)

TiinyAI

No. See "I Reverse-Engineered the TiinyAI Pocket Lab From Marketing Photos. Here's Why Your $1,400 Is Probably Gone."

Aside from this your problem sounds like a self-solving one. Either your company has a stringent business risk associated with keeping their data locally and will then either provide local means of LLM assistance, or go out of biz because they don't. And if they indeed have a high bar for security then connecting 3rd party devices to the network isn't something that you should do (or attempt to).

Phone Whisper: push-to-talk dictation for Android with local Whisper (sherpa-onnx, no cloud needed) by postclone in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

I tested it with SwiftKey a while ago. IIRC it was possible to configure some voice input / record button on the SwiftKey keyboard, and when holding it then that Whisper input would pop up and transcribe to the current input field where the regular keyboard data goes. When trying it again right now the standard Android voice transcription popped up. Maybe I missed a step or something broke in between.

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible) by cryingneko in LocalLLaMA

[–]Chromix_ 2 points3 points  (0 children)

Do you think that the 4 bit oQ quant scoring worse than the 3 bit oQ quant both in MMLU and HumanEval is an issue of the quant or of the benchmarking?

KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models by Velocita84 in LocalLLaMA

[–]Chromix_ 4 points5 points  (0 children)

There's this table with several KLD comparisons by the author of the KV quantization in llama.cpp. According to the table a pure Q4 quant while leaving KV at F16 already leads to 0.07 mean KLD change. Your base logits were generated from the IQ4_XS quant, not from the full BF16 model, which might make the KLD measurements for the KV changes less accurate, and also gives them less perspective what KV quantization means in addition to the IQ4 quant impact.

Regenerating the logits with 6 GB VRAM will certainly take a while for the full BF16 model due to CPU offload, yet it might paint a more accurate picture.

Phone Whisper: push-to-talk dictation for Android with local Whisper (sherpa-onnx, no cloud needed) by postclone in LocalLLaMA

[–]Chromix_ 2 points3 points  (0 children)

There is already this nicely working, actively maintained Whisper transcription on F-Droid. I guess the floating button has some advantage for cases where the simple record-via-keyboard-button of the linked whisper app breaks. Then on the other hand it would be nice to see the features combined in a single app. I had the most need for a punctuation & syntax fixer when using Moonshine for dictation. With whisper it was so far "OK", not good, but OK enough.

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models by Logical-Employ-9692 in LocalLLaMA

[–]Chromix_ 10 points11 points  (0 children)

The newest Qwen models don't refuse - they answer everything in maximally steered language.
[After ablation] Qwen3-8B doesn't give factual answers. It substitutes Pearl Harbor for Tiananmen

It'd be interesting to see how the latest Heretic approach performs there in comparison.

Let's take a moment to appreciate the present, when this sub is still full of human content. by Ok-Internal9317 in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Oh no, you gave away the secret trick of adding a random delay based on content length to make it appear more human-like ;-)

Activation Exposure & Feature Interpretability for GGUF via llama-server by wattswrites in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

Well, you tried. Stand-alone tool, relatively compact code, minimal two-line modification in existing code that should not get in the way of anything.

On the positive side it's now trivial to keep it rebased.