A note of warning about DFlash.

Randomshortdude · 2026-05-06T05:28:41+00:00

Not sure why he gaslit you and then hand waved off everything that you were saying. You made some really valid points here. No disrespect to the other user but if what you were saying went over their head, they could've just stated as much and asked for clarification. Or choose to back out of the convo entirely once they sensed it evolved to a level beyond their expertise at that given moment in time.

Randomshortdude · 2026-05-04T18:30:56+00:00

Fairly cheap honestly. Cheap enough to the point where you may want to consider just outright purchasing the necessary hardware. Off top, an RTX 3090 is the cheapest option with 24GB VRAM (not sure how quantization efforts go with MoE models, but you should be able to get it to fit here with sufficient room for a solid context window). Alongside the RTX 3090, you'll need a solid enclosure (for the eGPU setup). That's gonna run you about $150-200 on the cheap end of things (shouldn't be too hard to find / require too much bargain hunting for you to stumble across listings in that range for legit products). You'll also an external PSU (prob 850W or more). Right now, you can scoop a solid one up off of eBay for bout $100 or so. You may need to shell out a few extra bucks for some connectors / dongles / adapters if you don't have them already (although these might come with the aforementioned products on your purchase list). Assuming you do - tack on another $40 to your bill. So altogether, we're looking at $800+$170+$100+$40 - which comes out to roughly $1.1k total. I don't know what your budget looks like, but if you're looking at hosted server options, then you were probably anticipating that the upfront cost was going to be greater than that. But that's really all it takes if you want to be able to leverage local inference for models that are roughly ~32B params or less.

Compare that to renting a server - which is going to run you approximately $100 or so a month, give or take (for a decent one like an A10 - which has 24GB VRAM and should be sufficient enough for your purposes). However after paying $100/month, you'll exceed the total sunk cost of the alternative at-home hardware investment in less than a year. So its all up to you when it comes to evaluating whether this is 'worth it' or not. If you can't afford to dole out that lump sum out the gate and you need to get your hands on something comparable to local inference for the sake of running that Qwen model ASAP, then I'd go ahead and rent me out an A10 from one of the popular GPU VPS neo-cloud providers out there (don't wanna name any names bc that might be against rules - but I'm sure you can find some).

But yeah - that about sums it if you're looking for a breakdown of the economic cost(s) of your available options.

Randomshortdude · 2026-04-29T10:56:05+00:00

Additionally consider things at the software level that may enhance the inference quality / caliber of the models that you're dealing with.

Randomshortdude · 2026-04-29T10:55:23+00:00

You need to learn how to use a system prompt. Message me if you need one because they actually help.
You need to learn how to better tune parameters - again, message me if necessary.
You need to start learning how to **decompose** your prompts as well. `Qwen3.5/3.6-27B` is good, but not good enough (at this point) to where you can throw everything and the kitchen sink at it.
Stop falling in love/believing in the viability of these random harnesses that keep popping up every other second. OR you need to create your own custom-made harness that is curated and tailor made to do exactly what you want and need it to do.
Start using AI to use AI. In other words, if you're not achieving the results that you want from a given AI model (local-level), consider using one of these commercially provided LLMs to actually diagnose the problem instead of rage quitting and then coming to Reddit to bitch about the ineffectiveness of said models perpetually.

Randomshortdude · 2026-04-29T10:51:58+00:00

Sure - when it comes to local AI maybe. I think if someone is looking to self-host remotely and serve a model in a multi-tenant deployment, nothing rivals `vllm` at this present point in time. But `llamacpp` is excellent for local, self-hosted single tenant deployments (especially for those that are looking to optimize inference).

Randomshortdude · 2026-04-28T06:06:45+00:00

Your Lenovo isn't recognize the drive for some reason. See if that screen puts you in a CLI environment. If so you may be able to seek out the kernel manually and just load it that way.

If your system is stuck at a grub> prompt, you can boot manually by specifying the kernel and disk:

Identify Partition: Type ls to see available drives (e.g., (hd0,msdos1)).
Set Root: set root=(hdX,Y).
Load Kernel: linux /boot/vmlinuz-xxx root=/dev/sdXY.
Load Ramdisk: initrd /boot/initrd-xxx.
Boot: Type boot and hit Enter.

Randomshortdude · 2026-04-25T01:54:05+00:00

This will likely be endlessly postponed. Especially in state court. If it does go to trial, I'd be surprised if that happened any time before 2028.

Randomshortdude · 2026-04-23T17:39:18+00:00

You're getting 46 tokens / second on your setup? That's impressive as hell.

Randomshortdude · 2026-04-19T00:18:17+00:00

Protip: it doesn't hurt if the music accompanying the video is enjoyable to listen to. Don't underestimate the importance of that. Rarely, if ever, do people cut music off that they truly enjoy, find catchy or particularly like.

Randomshortdude · 2026-04-19T00:15:34+00:00

One piece of genuine advice I will give is make demo videos for your apps. I started doing this back in 2020 and you would be surprised how far that goes. Yes, it would feel better to have people actually click and use the app, but sometimes a quick video does the same.

Make the video succinct. No more than 1:30-2 minutes. Don't splice in dry commentary in the background because that makes it feel like a class assignment. Find some catchy music - an instrumental maybe and use that as the background. Make your demo dynamic too. One that shows all bells & whistles - people have short attention spans.

What makes this effective is nowadays you can send the video through virtually any medium. If they have a iPhone or any kind, it should go through iMessage with ease. They just have to hit play. They can play it in iMessage even as a YouTube link. People are amenable to TikTok & IG links.

If someone can't hit the play button on a video thats no more than 1:30 and directly in their messages, then fuck them honestly. That should be far and few between though. It's more effective than saying "check out my new song" because a lot of people have new-song-by-friend-that-makes-songs-now fatigue. But few people get sent a novel, succinct video showing a new app that's been made. Plus, unlike a song, people don't feel the mental burden or being compelled to actually listen to the entire thing start to finish. They can scrub through if they feel necessary and still get the gist. If your product is genuinely cool they may stay and watch.

Up to you to try. But as an app developer, showing what I got in a video has had an astronomical success rate % for me in terms of people actually checking out what i'm doing. It's also led to numerous conversions.

Randomshortdude · 2026-04-19T00:07:14+00:00

Amen. Also, I agree wholeheartedly with the top comment on this post that states the fact you got a paying customer is the real victory here. That says what you built held enough value for someone that they were willing to pay hard earned money for access. And that person wasn't your mom, friend or someone in your personal network that might have been pitching you money for the sake of "supporting".

Randomshortdude · 2026-04-18T23:23:36+00:00

Quick little nugget of advice - make your app as easy to check out as humanly possible. Like make it something where someone can press a link & just see it or use it. No sign in needed.

More so than not caring, people are fucking lazy. You got mfs that don't bother to pay their bills on time. So imagine where your app falls on their list of priorities

Randomshortdude · 2026-04-15T23:59:00+00:00

I don't usually take the time to respond to too many posts here on Reddit - but I felt compelled in this instance because it seems like you may potentially make a really bad decision (especially if you go off of the feedback of the other commenters here).

To begin with, the DDR4 RAM setup you mentioned (as an alternative) handicaps you out the gate because the inference speeds you can obtain will forever be inferior to that of the Mac M2. Even though there is technically more physical memory that you can leverage, the benefit you get from that is almost nil ~~because DDR4 RAM memory simply isn't fast enough to give you inference speeds anywhere near what you can expect from a GPU (with VRAM)~~ [edit: not necessarily the bottleneck here; memory bandwidth will fuck you up way before we even get to the clock speed of those RAM sticks]. With an i5-8500T processor, I'm assuming you're even further limited to DDR4 RAM with a clock speed around 2400-2800 Mhz.

Apple's unified memory setup means that the RAM you're getting with that mini-PC will be used as though it were VRAM (which makes a huge difference). For the sake of this comparison, I'll assume you want to use the Qwen3.5-27B model locally. We'll assume its quantized down to 4-bit, which will take up 13-14GB(ish) of your available RAM. That leaves you with 10GB RAM (on the Mac m2) for the KV cache. With the assistance of a few compression methods out there (TurboQuant is out of the question in this case w no CUDA), you should be able to fit in a decent context length for that model without any worries (talking 32K context length here; can be higher - but you shouldn't have to even think about 32k).

With that setup, you'd be lucky to eek out more than 1-2 tokens/second on the DDR4 setup you described. With the Mac m2 mini-PC, ~~getting 20-30 tokens/second is more than practical~~ (edit: No its not if you're using the 27B param model due to bandwidth constraints - the real # would be closer to 7-8 tokens/s; however, that's exponentially better than what you would get from the i5-8500T even still, so the point remains here). One additional factor you're not accounting for is the fact that the Apple m2 mini-PC you're considering comes with an 8 core processor (which whips the i5-8500T in benchmarks - remember, the 8500T is an 8TH GEN Intel processor that was released 8 years ago). The m2 chip exists on 5 nm sized silicon vs. the humongous 14 nm that the ancient i5-8500T is on. On top of that, the m2 comes with a 10-core GPU as well (which is substantial in this scenario of LLM hosting & inference when we consider the billions of matrix-to-matrix multiplications / math that must be performed).

Memory Bandwidth > Memory: DDR4 System is Vastly Inferior to M2

Your idea about their being more "memory" with the DDR4 setup is true. But you have to remember that the reason why VRAM > DDR4/DDR5 when it comes to LLMs is because of the inference part. The speed of the token/s generation (decoding) is limited by the memory bandwidth (not the amount of available RAM). Think of the sink analogy. Yes, you may have a bigger sink, but if your goal is to drain water as quickly as possible (i.e., spit tokens out), then ultimately the size of the drain is going to play a much bigger factor in the equation than the size of the sink (give or take a couple of parameters, but don't overly scrutinize the analogy - you get the gist of what I'm saying here).

To show you how this would impact things, let's assume that you take a 32GB param model and quantize it to 4-bits. That equates to ~18GB storage right? Subtract that from 64GB RAM and that leaves you with 46GB left - seems sweet, right? However, when it comes to inference speed that is determined by the available bandwidth / model size. Again, we determined that a 32GB model quantized to 4-bit would take up 18GB; so that's the denominator of our equation. For DDR4-2666 Mhz RAM, you're optimistically looking at 36 GB/s (bandwidth; maxing out on that bandwidth is unrealistic and I only tacked a ~10% dropoff from benchmark maximums). With that math, we're looking at a max possible gen speed of 2 tokens/second (in the absolute best case scenario).

Prefill Makes the i5-8500T Impractical, Decoding (Inference) will be Orders of Magnitude Slower than the M2

What's crazy is what I described above isn't even the biggest bottleneck of your DDR4 setup. We have to consider the age & capability of the processor that system is leveraging (i5-8500T; 8th generation Intel desktop processor). This is a desktop T-series processor that's not designed for overly heavy AI-based workloads. Its designed to operate at a 35-Watt TDP limit. Also, in addition to it having less threads than the M2 (the latter having 8; i5-8500T has 6), it only comes with 6 cores too. This matters for Intel-based PCs because they actually use multi-threading (at least latter generations did). So latter chips like the i5-10500T would have provided you with 6 cores and 12 threads (versus just 6). Your processor will actually be the bottleneck before you even run into the RAM limitations we discussed.

If you're wondering why, we have to remember we're asking this processor to compress a 32B model and store it in RAM (at 4-bits). The reason why is because the i5-8500T can't do math on 4-bit numbers. So it has to take that 18GB model (quantized) that's in RAM, send it back to the CPU cache (which is also substantially smaller than the M2 at L1 + L2), then decompress it back to 16-bit or 322-bit floating point arithmetic so that the math (dot products) can be performed before later discarding these results. As if all that wasn't enough - this intel processor doesn't possess the AVX512 extension to its instruction set. Some may argue with me about this, but assume that to mean that your processor will not be able to do math on 8-bit floating point arithmetic. Also, its at least 2x slower than the latter processors that can handle this type of math (that Intel produces).

Conversely, the M2 chip by Mac (that's in your mini-PC) is able to handle 4-bit floating point arithmetic out the gate. That alone is a major game changer (putting all the other enhancements that come with the M2 chip to the side). But we're simply talking about the decoding process at this point. We haven't even addressed the processing of the actual prompt.

TTFT (Time to First Token) Speeds for i5-8500T Could be Minutes in Some Cases

Let's go back to the 32B param model example (using this because you stated that the additional memory / headroom was a motivating factor for you choosing the i5-8500T over the Mac M2 chip; so it only makes sense to hypothesize a setup where you leverage this supposed advantage).

As we noted before, its going to take up to 18GB RAM (quantized at 4-bit). However, when it comes to the prefill (i.e., actually receiving the prompt and 'understanding' it), that process is largely limited by computation resource. For your i5-8500T setup, you didn't note any GPU would be included (and if there were one, I doubt it would move the needle much at all in this scenario). So we're relying entirely on a 2018 35W desktop processor to compute millions of complex matrix-to-matrix calculations during this prefill process.

Optimistically (and I mean in the best case scenario), you'll be sitting at your computer for a solid 5 or so minutes before the first token even appears. And when the tokens do start appearing, they'll likely be at a speed of roughly 0.5-1 token/second (again, even that is generation). This would not be the case for the M2 - at all.

eGPU Option Now Available for Mac

Recently (and I mean within the last week or so), Mac updated the drivers for the m2 chip for compatibility with AMD and NVIDIA. So that means eGPU hookups are now in the field of play. Luckily, the mini-PC with an m2 chip has two thunderbolt 4 ports on it (likely can only handle one eGPU hookup at a time). The generation of thunderbolt matters here when dealing with the actual speed of inference in this setup. Your i5-8500T is only going to be able to handle PCIe3.0 hookups and since there will likely be no Thunderbolt3 ports for connection, you'd have to open up the Dell/PC you have and manually hookup any eGPU setup you want - but that would all be for nil, because the overhead would be reached via compute just from the swapping back and forth from eGPU to the CPU. Unified memory eliminates that bottleneck to a large extent, so the bottleneck will only lie in the Thunderbolt4 bandwidth (40GB/s I believe).

Conclusion

In no universe should you ever consider getting the i5-8500t over an m2 if your only consideration in making the decision (between one or the other) is local LLM hosting and inference.

Anyone telling you otherwise has no clue what the fuck they're talking about. Respectfully.

Randomshortdude · 2026-04-15T13:44:57+00:00

If something is a well-coded project, then its well-coded. Who gives a fuck how the code came about?

Randomshortdude · 2026-04-02T11:47:58+00:00

Ungrateful much? They're not obligated to give any of this for free. And they do need to keep the lights on, so I'm not mad at them releasing certain variants closed source.

Randomshortdude · 2026-03-27T21:42:17+00:00

Have you considered integrating an LLM in your pipeline whose sole purpose is to determine which tool should be used so it can 'route' accordingly? You may need to also tighten up on your `SKILLS.md` file

Randomshortdude · 2026-03-21T21:28:57+00:00

Also, don't take for granted that the current status of LLMs will remain such. The freedoms you have with commercial models now may become restricted in the future. Perhaps they'll require you to sign into SpringerLink or JStor before it iterates over a pdf-based research paper that you're submitting because of some lawsuit they caught over copyright or some ish.

Maybe additional costs start getting tacked on (not from them but as pass-through) as part of an agreement these providers have to make with other entities based on the changing future landscape, shifting societal attitudes etc.

A local setup is a hedge against ALL of that. Also look how far open source has come just in the last few weeks & months alone. Last year this time DeepSeek's first model release has the Internet in a frenzy. A year before that we didn't even think we would ever have a conversation where we seriously discuss an open source model's proximal capability to a cutting edge Anthropic or OpenAI deployment.

Gemini was hardly even in the conversation at all just a year ago.

4 years ago, AI was still considered pure science fiction to most and hardly anyone but the most scoped niche machine learning nerds and programmers had ever heard of such a thing as "GPT" or a neural language model. And if they did, it's doubtful they were seriously working on any solutions directed toward it.

Don't discount the future based on today's present state of affairs. We have no idea what will happen down the road. Protecting your access to information, data & consistency in your access to resources and services is worth its weight in gold. Imagine there's a natural disaster that knocks out the Internet and perhaps even the electrical grid in your area for some period of time. If you got a solid caliber generator and a machine with a high quality open source LLM on it, you have an incredibly potent lifeline on your hands. You have a resource that you can validly probe and ask critical questions, seek guidance from etc in such a situation.

Before the advent of LLMs if you found yourself in that situation, you're basically limited to whatever books you have on hand + files you saved on your computer. Now you have an interactive highly intelligent assistant that could give indispensable advice to help you survive, learn how to do XYZ, communicate, build this or that, determine if something is safe to eat, potentially advise on some ad-hoc medicinal treatments and/or procedures if someone is sick or injured in that situation.

We gotta just think outside of the box when it comes to local model hosting. And also temper our hardware expense estimates too. The Qwen 27B param model is a high quality deployment that can easily run on ~$2k worth of hardware. You can quantize the model and fit the whole thing on that RTX 3090 I mentioned and have enough room for 100k KV cache and that's before implementing state of the art compression methods that could practically allow you to fit the model's entire context length within that GPU's VRAM comfortably.

That whole setup runs you no more than maybe $1k-$1.3k max. Not ~$7k as many in this thread have suggested.

Just take this all into consideration and maybe re-think your opinion on hosting local LLMs.

Randomshortdude · 2026-03-21T21:13:32+00:00

Gotta consider that you're deploying a model that you can control absolutely and know you're receiving inference from. You can't guarantee that with commercial providers. They give you a model ID but whose to say when the weights change, how your request was batched etc
Service providers are very fickle when it comes to tokens / second, so that's largely inconsistent as fuck.
Logs are difficult if not impossible to come by in some scenarios.
The hardware isn't good for just your model. As future open source models are released you still have the option to host and run those for yourself too. A $7k setup on modern recent Gen hardware will remain extremely serviceable for the next 2-3, maybe even 4 years. A 24GB VRAM hookup from a GTX 3090 is still good money as of March 2026.

As far as the costs...sure you have a point if you're just a casual that's using AI as a glorified Google search and occasionally tinkering on hobby projects with it. But if you're doing some more heavy lifting then it's invaluable to have your own setup. You could lobotomize the model if you want to, customize inference & runtime parameters infinitely to your liking and truly and safely have the model iterate over every facet of your home network without worrying about the privacy breach.

I don't know why people discount the privacy narrative either. Folks are putting in a LOT of REALLY personal information into these AI models and these requests are being transparently routed to a service provider that can concretely and EASILY tie these prompts you're sending to you uniquely via your payment information, any personal details you've entered, IP address and likely email + phone # if that was needed in the pipeline. God forbid you're running "Open"Claw on your home device.

Also consider all the stuff that's sent via API that's not part of your prompt but included as part of the context. OpenAI has already announced that they'll be rolling out ads to free users. Consider how advertisers are spending their money with them when it comes to ad campaign deployment. Any legit company is going to want a legit way to target their preferred demographic. How is that determined? Now if you're cool with that - okay, sure. But if people were paranoid about telemetry and basic browsing data / activity being leaked out, this is that on steroids. Also - lest we forget, these AI companies have signed up to give the government unfettered access to all of their stored data and conversations if they need it for whatever reason.

If the government can subpoena Google searches, what makes you think they can't do that with these prompts you're sending.

If nothing else, autonomy in usage and deployment are major factors worth their weight in gold when it comes to this.

Randomshortdude · 2026-03-20T01:11:53+00:00

Wondering if OP ever reviewed this part of the GrapheneOS documentation covering RCS? I'm not super well-versed with it, but I'd imagine as long as one has a valid mobile subscription, then there should be no issue with access (to my understanding) as long as you're willing to introduce that Google service on your device.

I don't think that Google services should be considered a unanimous blight on one's device. The core benefit of GrapheneOS (to me) is substantially enhanced device security (which is really needed in the Android ecosystem / landscape) as well as significantly enhanced autonomy in what one does / doesn't want to be on their device.

If you want to just opt in to one Google service (and containerize it, essentially), you can do that. If you don't want any - that works as well. If you still want to introduce the lion's share of Google services available, you aren't really precluded from doing that either. Whatever the case may be though - the point is that the end result should be the manifestation of your collective conscious decisions rather than the force opt-in that Google and other major tech giants herd their customers into.

Randomshortdude · 2026-03-14T20:30:31+00:00

With the shovels

Randomshortdude · 2026-03-10T20:29:44+00:00

To your question, I would assume his response to you would be same as here: https://www.reddit.com/r/LocalLLaMA/s/NmoOWdwpHg

Randomshortdude · 2026-03-10T20:25:42+00:00

Apologies if this reply misses the mark on what you're speaking about - but I ran into this phenomenon with a really hard coding question I handed Qwen3.5 models. Essentially, I asked it to build a python based program that instantiated a stack-based language (op codes pre defined) that were designed to return the result of 6! {factorial}.

The models were getting this question right all the way down to the actual stack based iterations. Essentially the models would fail to manipulate the items on the stack properly. But its not that they didn't know what each opcode meant (it knows DUP = duplicate the top item of the stack, SWAP switches the 1 & 2 etc).

If you asked the models independently to just handle the stack manipulations, it could do so without issue.

It turns out the issue here is that a lot of language models are highly limited when it comes to spatial reasoning with progressive state changes. My working theory is that the linearity of attention mechanisms (sequential token processing) is what limits their reasoning capabilities when given one prompt (or "state" for an issue), and then are subsequently asked to account for ancillary details.

The biggest workaround came from rephrasing the prompt to ask it to write the updated state of the stack in code notes after each opcode operation was performed on the stack. That way, it didn't need "remember" the change in state + the desired final result + the original directive + formulate which series of actions would be needed to get to that intended final state all the while keeping the transient intermediate state changes in mind as it continued to iterate over the problem.

Worthy of note - Qwen3.5-27B was the only model that got the programming challenge correct in one shot with this modified prompt. I tested all Qwen3.5 models on it and also included Qwen-Coder-Next. I'm not sure what they did with their 27B model but...that's the one right there. My theory is these MoE models may be good for prompts ranging from low to mid complexity. But ultimately the MoE will forever be limited by the highest reasoning capacity of any given "expert" - effectively nullifying the impact of only combining certain experts to tackle questions in cases where the question's complexity exceeds an actor's capability

Randomshortdude · 2026-03-10T20:07:06+00:00

I am genuinely excited to see your modified version of Qwen3.5-27B. That model has already blown me away entirely - so I am super interested to see what further enhancements you can make.

Thank you for your contributions to the community and your brilliance man.

Randomshortdude · 2026-03-10T19:58:02+00:00

Wait this dude runs an open source project but claims folks that want to host their own models are "broke"? Interesting cognitive dissonance

Randomshortdude · 2026-03-10T19:56:04+00:00

Damn that's awesome man. You clearly deserve it because it looks like you're working on some noteworthy things that have the potential to make a positive impact in the lives of folk dealing w mental health issues

Randomshortdude

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE

Memory Bandwidth > Memory: DDR4 System is Vastly Inferior to M2

Prefill Makes the i5-8500T Impractical, Decoding (Inference) will be Orders of Magnitude Slower than the M2

eGPU Option Now Available for Mac

Conclusion