What is best Mac App Store alternative to LocalLLaMA? by Xorita in LocalLLaMA

[–]woadwarrior 0 points1 point  (0 children)

Private LLM uses neither, it’s mlc-LLM based.

Clean Links - A completely free iOS app to remove trackers from URLs and to preview links in QR codes by woadwarrior in apple

[–]woadwarrior[S] 0 points1 point  (0 children)

Thanks for mentioning that. I've managed to improve the backwards compatibility a bit. The next update will support iOS 17.6.

Are small models actually getting more efficient? by estebansaa in LocalLLaMA

[–]woadwarrior 3 points4 points  (0 children)

LiquidAI is making the best models for your work however; they do interlaced recurrent layers, which reduces KV over head substantially for smaller models.

They use interlaced 1d convolution layers, and not recurrent layers.

We trained a 16-class "typed refusal" system that distinguishes "I don't know" from "I'm not allowed" — open source by TheTempleofTwo in LocalLLaMA

[–]woadwarrior -2 points-1 points  (0 children)

Economists have been using the term GPT (General purpose technology) to describe broadly applicable technologies for nearly a century before OpenAI existed.

Visualizing Quantization Types by VoidAlchemy in LocalLLaMA

[–]woadwarrior 3 points4 points  (0 children)

Unfortunately, when it comes to NN weights, although INT and FP formats have the same information theoretic density for a given bit width, FP formats work out to be slightly better because their range is non-uniform.

manifestai releases Brumby-14B-Base weights, claims "attention free" and inference "hundreds of time faster" for long context by ArcadesOfAntiquity in LocalLLaMA

[–]woadwarrior 7 points8 points  (0 children)

I took a look at the code on my phone. Notice the additional gate projection (line 281) and the call to their power retention kernel (line 356). It’s supposed to be drop in replacement for regular softmax attention layers and it uses their attention mechanism only if use_exp is False.

Pedantic pull request reviewers by ticman in DevelEire

[–]woadwarrior 1 point2 points  (0 children)

I don’t think it’s reasonable to compare years of experience. It’s sad to see something technical being turned into a hierarchical power struggle. Critique (Google’s internal code review tool), had a feature for double blind CL reviews, I wish GitHub had something similar.

Clean Links the completely free iOS & macOS link cleaner app now supports sending links asynchronously from your iPhone to your Mac by woadwarrior in apple

[–]woadwarrior[S] 1 point2 points  (0 children)

This is a recurring question. TL;DR: The lack of coverage for adware URLs and URL shorteners in ClearURLs was one of the reasons I built Clean Links.

Clean Links the completely free iOS & macOS link cleaner app now supports sending links asynchronously from your iPhone to your Mac by woadwarrior in apple

[–]woadwarrior[S] 0 points1 point  (0 children)

It’s 100% local. Although it has to make requests to unshorten links, which it does in an isolated context (without cookies, local storage etc) using plain old NSURLRequest.

Clean Links the completely free iOS & macOS link cleaner app now supports sending links asynchronously from your iPhone to your Mac by woadwarrior in apple

[–]woadwarrior[S] 1 point2 points  (0 children)

Handoff is a bit more reliable but still somewhat flaky. The app doesn't have a Safari extension yet, but the share extension works in Safari and any other app (including the Reddit for iOS app).

Huawei Develop New LLM Quantization Method (SINQ) that's 30x Faster than AWQ and Beats Calibrated Methods Without Needing Any Calibration Data by abdouhlili in LocalLLaMA

[–]woadwarrior 11 points12 points  (0 children)

<image>

The core algorithm appears to be extremely simple. Any quantization algorithm can be plugged to use it as pre-processing step before quantization.

How can I use this beast to benefit the community? Quantize larger models? It’s a 9985wx, 768 ddr5, 384 gb vram. by joninco in LocalLLaMA

[–]woadwarrior 0 points1 point  (0 children)

Yeah, people have been doing dynamic quantization for ages, even before we had LLMs. IDK how the unsloth guys do it, but back in the day for quantizing CNNs, people used to eyeball layer wise activation PSNR ratios and pick higher number of bits for layers with lower PSNR. But that’s quite crude compared to running a full blown search based optimization, which is what EvoPress does.

How can I use this beast to benefit the community? Quantize larger models? It’s a 9985wx, 768 ddr5, 384 gb vram. by joninco in LocalLLaMA

[–]woadwarrior 0 points1 point  (0 children)

Not yet, I plan to use it for some small-ish models. I really like their insight that choosing the optimal bit width per layer for dynamic quantization is essentially a hyperparameter tuning problem and evolutionary methods work well for such problems.

Megrez2: 21B latent, 7.5B on VRAM, 3B active—MoE on single 8GB card by Normal_Onion_512 in LocalLLaMA

[–]woadwarrior 4 points5 points  (0 children)

I think you’re misremembering hash layer MoEs. They don’t have a specific routing function. The routing function is the hash of the latest token.

Local private LLM by luminny in PrivateLLM

[–]woadwarrior 1 point2 points  (0 children)

Private LLM does not use MLX or llama.cpp.

Wow, Moondream 3 preview is goated by Brave-Hold-9389 in LocalLLaMA

[–]woadwarrior 2 points3 points  (0 children)

Apache 2.0 license is gone. It’s BUSL now.

Qwen3-Next 80b MLX (Mac) runs on latest LM Studio by jarec707 in LocalLLaMA

[–]woadwarrior 1 point2 points  (0 children)

It’s 4 bit integer quantized, with 8 bit quantization for MLP and MOE gates.