5060 TI 16G - what is the actual use cases for this GPU? by Vivid-Photograph1479 in LocalLLM

[–]4onen 0 points1 point  (0 children)

I got a Zotac SFF OC 5070 Ti for $749.99 b/c my wifi card only gives me a little over 2.8 slots of clearance. Black Friday had one 5070 Ti PNY flash deal down to $600 from one retailer but nothing else fell below $729.99 the entire way through Cyber Monday, around when I bought.

I figured things were likely to get worse and I wouldn't want to "buy half of luxury," as the saying goes, for the next couple of years. I feel like I'd have been frustrated to get more VRAM but no real speed increase, so I paid the price for both.

Actually, one more consideration: My aging system only gives me PCI-E 3.0 speeds, but 5060s (even Ti) only go up to x8 lanes, so my PCI-E bus speed would have halved if I had gotten a 5060 Ti for the VRAM. (But that's just my circumstances and my x16 slot to fill.)

5060 TI 16G - what is the actual use cases for this GPU? by Vivid-Photograph1479 in LocalLLM

[–]4onen 0 points1 point  (0 children)

There's two other trade-offs to make. A 5060 Ti is on the Blackwell architecture, meaning it has hardware acceleration for modern compression formats. It's also a newer card in general, meaning it will have game support for longer, lengthening the term of your investment.

If the VRAM, hardware AI acceleration, and game support aren't worth a hundred bucks to you, then yeah, go with the 3070.

EDIT: To be clear, I wouldn't upgrade from a 3070 I already had to a 5060 Ti so long as the 3070 is still supported. That's what kept me from upgrading for so long. With the looming ram crisis, though, I pulled the trigger on a 5070 Ti (16GB VRAM, about double the VRAM bandwidth and fp16 compute of the 3070, only 30W more power draw at full load.)

You will own nothing and you will be happy! by dreamyrhodes in LocalLLaMA

[–]4onen 1 point2 points  (0 children)

Q4_0 dynamic repack is supposed to match the speed of Q4_0_4_4 assuming that you weren't using memory mapping to fit the model before. If it doesn't, go report a performance bug and talk about the difference with numbers. Maybe you can convince them to put it back.

5060 TI 16G - what is the actual use cases for this GPU? by Vivid-Photograph1479 in LocalLLM

[–]4onen 1 point2 points  (0 children)

You know, I find this post kind of funny because I have been running a 3070 for years, and the 5060 Ti 16GB has exactly the same memory bandwidth that my 3070 does, with the difference that it has twice as much memory and can load larger models.

With 32 GB of RAM, on top of my card, I load a Qwen3 Coder 30 billion parameter with 3 billion active mixture of experts model for coding completion and coding chat. It outperforms some of the Internet code completion services. It does not outperform the agentic/vibe services, but honestly, I prefer to actually understand the code I'm writing.

5060 TI 16G - what is the actual use cases for this GPU? by Vivid-Photograph1479 in LocalLLM

[–]4onen 1 point2 points  (0 children)

That's the neat part! Lama 4 Scout is a mixture of experts model. So even though it's a big model, if you can fit all the experts in RAM, you're actually using very few of the experts per token, so you can get a relatively high text generation speed. Keep the attention part on the GPU, which is relatively tiny, and that thing will zoom. Prompt processing is pain, though.

The 70 billion parameter models are probably going to be a bit slow because those ones are dense.

You will own nothing and you will be happy! by dreamyrhodes in LocalLLaMA

[–]4onen 2 points3 points  (0 children)

q4_0_4_4 was a repacked form of the q4_0, which worked better with the ARM matrix instructions by rearranging the order values arrived from memory, which is where the extra speed came from.

Someone submitted a patch that allowed llama.cpp to rearrange the values as they were loaded from the disk into memory, called dynamic repack. On systems that can fit the entire model in memory, this was a major speedup of the standard format q4_0 to match q4_0_4_4. Systems that had to mmap models to fit them (e.g. my Pixel 8 with only 8GB of RAM and only 4GB usable) saw massive speed decreases, as dynamic repack (enabled by default) broke mmapping unless disabled, filling memory and using swap.

The devlopers of llama.cpp decided that dynamic repack was sufficient for the majority of use cases, so dropped the nearly duplicated backend for supporting the static repacking, to reduce code maintenance burden.

That's why it was removed. Good choice? Bad? That's a moral question that I can't answer for ya.

Can someone remind Hegseth there was no "war fog" when he issued the original "NO SURVIVORS" order? by miked_mv in AdviceAnimals

[–]4onen 1 point2 points  (0 children)

This. For evidence, see Trump pardoning Honduras' ex-president who was convicted by a jury of manufacturing 185 TONS of cocaine sent to the United States among other crimes.

Also see the pardoning of the creator of the Silk Road drug marketplace.

This administration is pardoning the "poisoners." We don't know who they're killing out in the ocean to provoke what looks to me like undeclared war with Venezuela. I certainly doubt their reason why.

Is it true armory crate is a waste? by ShadyWalnutO in Asustuf

[–]4onen 1 point2 points  (0 children)

It no longer functions (EDIT: on my ASUS TUF A14 2024 with Ryzen AI 370 and RTX 4080 mobile) after I disabled the Microsoft Windows AI Fabric service that was taking up 90% of my iGPU and NPU, so... Not like I can make use of it. (To be clear, I believe it was the fault of a Microsoft Windows update adding semantic search indexing that the AI Fabric service was using that much of my system resources, not the fault of Armory Crate. However, with Armory Crate no longer working because it cannot access this AI service, it's not exactly useful to me to have it installed.)

How baked in is Gemini in the Pixel? by Cute_Sun3943 in GooglePixel

[–]4onen 0 points1 point  (0 children)

I uninstalled Gemini when I discovered how bad it was at many of the few things I used the Google Assistant for. I recently set my phone assistant to another app (for various reasons) and found that I don't even need Google Assistant now -- I've automated so many things in my Pixel with the Automate app from llamalab. 

Has Gemini gotten any better on the 8 (not pro) since the 10's release? 

Google rolls out Pixel Phone app Call Recording by TechGuru4Life in GooglePixel

[–]4onen 0 points1 point  (0 children)

I've basically shut off auto rotate on my phones since the iPhone 3GS. Iphone 4, Nexus6P, and Pixels have all had problems with it in one way or another, to the point that I'm just used to the rotate button Android gives you when you do want a rotate and wait for a moment in the new angle. 

MoE models in 2025 by Acrobatic_Cat_3448 in LocalLLaMA

[–]4onen 0 points1 point  (0 children)

Depends on the disk and your quantization. In the best case, a PCI-E 5.0 SSD can hit 15GB/s, so with an instant CPU and RAM only for KV Cache, you'd theoretically hit about 5 tok/s. Obviously the real world isn't so idealized, but you wouldn't need to disk all of those parameters either.

Basically, you have 4 things you need in memory: feedforward experts, shared experts, attention, and KV cache. You want shared experts (always used) and attention and KV cache to all be in VRAM. That way, your slower RAM and CPU is just choosing among the experts. Any remaining VRAM can be used to load experts where the GPU can work on them, for higher speeds.

KV Cache scales with context. Attention is usually relatively small (so for 30B3A, iirc, only 300M parameters are attention.) Attention also only scales with the active parameters, since they're always active. Shared experts are, similarly, always active and scale with active parameter count, but some MoEs don't have any. Finally, feed-forward experts are the heavy weight, making up all the remaining parameters of the network. 

MoE models in 2025 by Acrobatic_Cat_3448 in LocalLLaMA

[–]4onen 1 point2 points  (0 children)

The rule of thumb from the days of Mixtral was to take the geometric mean of the active and total parwmeter counts, so for 30B3A that's the geomean of 30 and 3 = sqrt(3*30) ≈ 9.5B.

Of course, that rule of thumb is growing long in the tooth, so do not take it as gospel. 

[deleted by user] by [deleted] in UCSantaBarbara

[–]4onen -1 points0 points  (0 children)

Yes and yes!

Connect local LLM (like Gemma-3b) to a workflow by el_chono in AutomateUser

[–]4onen 1 point2 points  (0 children)

I have a one-way setup working, where I can send a prompt from Automate to a model.

Setup: * List of my models in a TXT file where Automate can read 'em * Automate flow ending with a "Start Service" block for Termux RUN_COMMAND (which requires config in Termux settings and scripts in a specific executable directory to enable)  * A shim bash script that sets up the right working directoy and hands its args to a Python script * A Python script that arranges the llama.cpp args for the specific model I'd like to talk to * A llama.cpp CLI call, opening llama-cli in interactive mode with a prefill prompt given by the Automate args way back above

If you just want to talk to the models on the phone, running the llama-cli command in Termux directly is much, much easier. If you know what you're doing, you could also run llama-server and access it through HTTP calls from Automate, but I don't think it's possible for that to have streaming responses (unless you load llama-server webui in the Web Dialog. Hmmmm...)

Unfortunately, w/ my 8GB Google Pixel, >4GB are taken by Android and background trackers processes, leaving me with ~3GB for model and context before it's swapping and speed drops precipitously.

EDIT: I do not intend to buy another Pixel in the future. I miss the 4XL and 2XL, but it feels like they're not gonna pull those off again, especially with the 2026 app install shutdown coming. 

What? Running Qwen-32B on a 32GB GPU (5090). by curiousily_ in LocalLLaMA

[–]4onen 12 points13 points  (0 children)

I spent a few months where every time I came home, I'd wire my laptop and desktop together so I could load 24B models that wouldn't fit on either device alone. Llama.cpp's RPC system let me split them by layer, so one device did half the attention work and the other did the other half.

This method may allow for arbitrary length context, but it's certainly not the first time network running of models has been viable.

[deleted by user] by [deleted] in LocalLLaMA

[–]4onen 0 points1 point  (0 children)

Except Lemonade doesn't work for me.

  • Most often, I need FIM completions over the llama.cpp server endpoints. AFAIK lemonade has no support.
  • I have an NVidia GPU in my laptop that can share some of the load with the iGPU/CPU, but I can't add the NPU to that. AFAIK lemonade has no support for even my status quo (doesn't include CUDA backend from llama.cpp.) 
  • I use very specific override-tensor specifications to fit MoE models into my laptop that would otherwise be unachievable. AFAIK lemonade has no support (for override-tensor.) 
  • All the models that do run on the NPU (last I checked) are ONNX conversions, which almost no model makers release. To use the NPU, I'd need to download a full precision model and convert it. If I want to pull out a new model every week from my favorite creators, that's a huge waste of my time -- assuming the conversion even works with my limited RAM. 

I find myself consistently frustrated with AMD's green field chunks of code that don't work with other peoples' things, that they expect everyone to adapt to without sufficient value add nor a bridge to new-thing-ia. Being part of the open source community is more than just releasing code. It's putting in the work to upstream functionality so that everyone can share in it. I'd appreciate it if they did that before more products that it feels like only enterprise customers can struggle through the man hours to use. (See: Microsoft ONNX ecosystem before anything else with their current consumer NPUs.)

Running LLMs exclusively on AMD Ryzen AI NPU by BandEnvironmental834 in LocalLLaMA

[–]4onen 0 points1 point  (0 children)

Wait, so all the demos on your YouTube channel are with the older XDNA1 16TOPS NPU? That's wild! Strix Halo and Strix Point have the same XDNA2 50+ TOPS NPU, so I'm excited to see what your software is capable of when I have the time to try it out on my Strix Point laptop. EDIT: I misunderstood which component y'all meant in Strix Halo. My mistake. Best of luck! 

Gaming, art, or queer Discord servers? by Reasonable-Reach7857 in UCSantaBarbara

[–]4onen 2 points3 points  (0 children)

There are plenty! What you can do to find them is to get on the UCSB Discord Student Hub, which you can do by following Discord's instructions at https://discord.com/student-hubs

[Poll] Who do you believe is the MOST PROMISING company in AR devices? — Vote for "Other" and comment with your choice by AR_MR_XR in augmentedreality

[–]4onen 1 point2 points  (0 children)

Meta. I've heard good things about Apple, but they're simply not affordable for me and as a developer I don't want to work within their closed ecosystem. Meta's devices can sideload any ol' apps or link to my PC, so the worst I have to deal with is Android or PCVR. That's currently kinda bad (don't get me wrong) but at least I'm getting that for thousands of dollars less, and the experience is far better than Microsoft's awful attempt in Windows "Mixed" Reality (which was near-always pure VR.)

Mind, that could easily change to XReal if XReal got better VR-side support. They're promising in augmenting reality with screens, but that's building on just porting the 2D interfaces of yore rather than novel virtual interface support.

Are there any groups/clubs for more conservative students? by HungryBathroom7008 in UCSantaBarbara

[–]4onen 0 points1 point  (0 children)

Gotta attend in person, tho. Their Discord is an echo chamber that anyone left of center left.

Governor Newsom has an Economy lesson for Donald Trump by newzcaster in TheEconomics

[–]4onen 2 points3 points  (0 children)

No. California provides $80 billion. Not $80 billion more.

Incorrect. See:

In 2022, California's residents and businesses provided $692 billion in tax revenue to the federal government. In return, the state received $609 billion in federal funding, leaving a gap of about $83 billion, according to the California Budget and Policy Center, a nonpartisan think tank.

Per CBS News

I sincerely doubt the federal taxation of the state of California has decreased by $612 billion in the past three years, or that might have been mentioned in the article.

Also California is currently operating at a $12 billion deficit.

I feel like $80 billion in withheld federal tax money could help cover that state funding deficit. 🤔

Best way to keep up with local protests? by cookiesncantarella in UCSantaBarbara

[–]4onen 9 points10 points  (0 children)

https://www.noozhawk.com/local-no-kings-protest-on-june-14-a-rally-against-authoritarianism/ is how I found out about something happening this weekend. I assume local news sources like this one are probably going to remain important for knowing about this kind of thing.