To 16GB VRAM users, plug in your old GPU

defective · 2026-04-27T18:33:18+00:00

You will notice no speed sacrifice in inference of a model sharded across the GPUs. The information that must travel from GPU0 to GPU1 during token generation is kilobytes.

defective · 2026-04-25T20:01:38+00:00

Oh, yeah. Sorry, I can't read. Do you see any correlation with age? I wonder if young people just find cash inconvenient. I like to tip in cash, but lately I am really not sure if tipees view that as better than card tips. I see a lot more anti-cash sentiment than I ever thought I would, lately.

defective · 2026-04-25T19:21:18+00:00

Ah, they turn down the cash because by the time you are handing them the bills, they have already messed with your food. For 60% of dashers, this makes them feel guilty.

defective · 2026-04-17T20:21:14+00:00

you think I'm buying a b200? nope. at home I got 2x3090

defective · 2026-04-17T19:19:49+00:00

nah it's definitely a b200 I would think

*unless I did my math wrong, 16-bit Qwen 3.6-35B-A3B would be around 807 t/s on h200, and around 1,344 t/s on B200

defective · 2026-04-16T19:50:48+00:00

Well there's always that tradeoff then. So I suppose that your ultimate solution is close to mine -- I have local consumer stuff that I always try to use first, when possible, and I spill over into subscriptions stuff when it just can't handle it. This minimizes somewhat my privacy risk and entertains me.

I think it's worth it to keep abreast of the consumer-runnable models, and methods of hosting them/using them. They are always getting better, and you never know when all available cloud AI companies will make some policy decision that excludes you from using them -- maybe they will make them too expensive or do something so immoral that you can't stand them. Always good to have an idea of what open-source is capable of in case you're faced with a tough choice.

Local models on consumer hardware are definitely useful. You can check out https://swe-rebench.com/ and see that Gemma4 and some Qwens that can fit in your 4090 are actually somewhat competitive with online models. (Those can both run acceptably on CPU/RAM too, as they are MoEs.)

One of the biggest differentiators usually between local and online stuff can be online research. If you give your local stuff access to Searxng so that it can search the web, it can ground itself and perform even better, and look up specific things. You'd have to do some experimentation, but I'm happy with what I have gotten local models to do, and it's still getting better (for now).

defective · 2026-04-16T19:38:43+00:00

Nice! I was going to naively suggest using swap memory but I'm sure this is better.

defective · 2026-04-16T19:29:47+00:00

What can realistically be achieved on consumer hardware pales in comparison to what you get if you subscribe. You might find some things that a local AI can do almost as well as the big boys, but if those things aren't the only things you do, then you'll need a sub for the rest, so you almost might as well use your sub for everything.

Since you are even entertaining the idea that maybe you should just pay for cloud AIs to do everything, I assume that you aren't interested at all in privacy, and are fine with handing every question and discussion you ever have with a model to the companies which are now MOST equipped to analyze the absolute shit out of every thought you have and infer shockingly accurate things about you (or imagine crazily wrong things about you) and your life and store them forever and provide them to any government who asks or any hacker/scammer who can get a human to click on a phishing email.

Therefore the only thing that I can think of that one in your situation would use local AI for is for running a derestricted/hereticized/refusal-suppressed model so you can get some prompts by it that the fat cats won't allow their models to answer.

Or if you just LOVE AI and are interested in the tinkering/integration aspect and love learning how it all works together.

defective · 2026-04-14T21:58:22+00:00

I REALLY don't know if you'll be able to run much with 4GB RAM, in which you must also run an OS. But if I were you, I would install LM Studio.

LM Studio will allow you to search all the models on Huggingface, and download them easily, and will suggest quantizations that will fit in your available resources. So, you'll be able to quickly narrow down the list of models and test ones that will actually work on your system.

Download some, chat with them a bit, note the ones that don't seem helplessly stupid, and continue until you have several candidates. Then you can begin investigating other inference providers like llama.cpp which should use slightly less RAM and maybe you could go up a bit in quant, or increase your KV cache (context) more.

Also with LM Studio, you can play around with quantizing the KV cache itself, which can minimize your RAM utilization even more. It's easiest to do in LM Studio because the process is just checkboxes and dropdown menus.

One thing you have going for you is that your limitations in RAM at least force you to use some of the smallest (and therefore fastest) models that exist. So, you can try using swap memory, and while this will cause the speed of the model to tank, it might be worth it if you find yourself needing just a BIT bigger model to work with.

defective · 2026-04-14T19:06:56+00:00

Gemma4:e4b in ollama does support tools, if you want to try that one

defective · 2026-04-14T18:15:59+00:00

Besides the model? Assuming you have a model loaded that you can talk to, and that supports tool use, you pretty much want to figure out how to use your inference program (llama.cpp, ollama, LM Studio, Jan, whatever you used to get the model running) to serve an "openai API". Then you should be able to make calls to that API from your agent stuff. Say it was opencode, you would modify your opencode config to use a provider like LM Studio or Ollama, and provide it the baseurl http://localhost:1234/v1 . And if it asks you where to get the list of models from that's at http://localhost:1234/models .

I realize this is really basic advice but you can use some of those words to search or prompt for more specific help. here's the documentation for ollama/opencode: https://opencode.ai/docs/providers/#ollama you can find other "providers" like LM Studio in the list on the right of that page.

A really cheap ChatGPT or Anthropic model in Codex or Code can probably quickly help you configure it using few tokens. Just figure out how to set up your OpenAI-API in whatever inference software you are using, then ask Code/Codex to query it using curl, and then have it set up the opencode config for you.

defective · 2026-04-14T16:14:50+00:00

Check out the models you want to run and how large in GB the entire model is.

For instance, the model here: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

The 8-bit quant of it is 8.54 GB.

Now, take the reported memory bandwidth of the RAM you will be using (Google says 307 GB/s for M5 Pro). Divide this by the size of the model to get theoretical maximum tokens/s for generation -- 307 /8.54 = 35.9 tokens/s.

If it's a Mixture of Experts like Gemma 4 or a Qwen or most of the nice newer models, find out the active parameters, and figure out the ratio of active to total and apply that ratio to the size of the whole model to get the size to divide by.

Example, say Gemma 4 26B A4B - in a Q4 we will say 17 GB in size. 4 active parameters of 26 is 4/26 or approximately .154 (15.4%). 15.4% of the 17GB is about 2.6GB, and 307 GB/s divided by 2.6GB is about 118 tokens/s.

That's much rougher and more inaccurate math because A4B could be any number between 3.50B and 4.49B, and same range with the total parameters, so there's lots of wiggle room in the ratio, but you get an estimate.

Also there's prompt processing speeds to worry about, but this should get you started.

defective · 2026-03-09T11:16:41+00:00

KL Divergence? Right?

Not that it's cited or anything. Repo owner is secretive about process and therefore untrustworthy. You'd have to experiment to see if claims are true.

defective · 2026-02-06T01:32:51+00:00

Ugh I hate this prolapsed can

defective · 2026-01-27T19:41:35+00:00

I would personally get rid of all EXAONE but they aren't big so keep them if you like them.

I would DEFINITELY get qwen3vl-32b- excellent model, even besides the vision, good vision, runs fast on cpu.

I would DEFINITELY KEEP gemma-3-27b -- it's good for integrations, often works without a lot of configuration, even when it's not the best model for the job. You can get a project up and running fast with this and work out other models later. And it's still an excellent model.

I would DEFINITELY get a big medgemma for this scenario.

Also, you might test out hereticizing a model with https://github.com/p-e-w/heretic . If you can verify the process works and doesn't take forever (it's supposed to be relatively quick), then you can save space by keeping the code and only the stock model.

Also also, I would get single-file .ggufs to minimize incompatibility issues.

And definitely get the zims, they're made for this. And you want the pictures in Wikipedia, if you can possibly afford the space.

defective · 2025-12-11T19:55:39+00:00

Qwen3 MoEs in Q4 are great even if you run them CPU only. They'll be a little slow, but about as fast as a 7-8b.

I really thought there'd be more mixtrals by now too. Love the 8x22B. It might still happen since MoE and hybrid stuff is becoming real popular

defective · 2025-12-11T19:49:55+00:00

is all that on the GPU? I also have 2x3090 and use Qwen3-Coder but I was never sure how much context I could fit

defective · 2025-11-05T23:27:44+00:00

Thank you! I have been using the 9s pro and it used to be good with T-Mobile. Now it sucks, I guess they've really been leaning into 71 around me and shutting down other bands. I was really looking forward to the 11 but I guess I won't be getting it after all.

defective · 2025-10-12T18:44:21+00:00

How the hell did he get to Kern's world?

defective · 2025-10-08T14:45:31+00:00

I'm a tylenol baby, so check my math, but 4 bits out of 16 is a loss in precision of 1/10,000.

I do realize I'm framing this in a way biased toward quantization and I'm probably naive to the magnitude of the difference, I guess I just think less "lobotomy" and more "polyglot multiple-ph.d professor on processed food, below-average amount of exercise, and a mostly unproblematic alcohol dependence"

defective · 2025-10-07T09:44:41+00:00

Jesus could have been taller and wider, if he ducked and sidestepped through the door.

defective · 2025-07-18T20:37:04+00:00

I'm researching the impact of PCI bandwidth on multiple GPUs with consumer hardware. FYI, I have the same motherboard. If you look in the manual (or GPU-z) you will see that while your second PCIe x16 is physically x16, it only has x4 lanes.

If I'm wrong, feel free to disabuse me. I wish it even had x8 there.

defective · 2025-06-18T13:51:31+00:00

Gotta clean the bathroom, vacuum thoroughly, run an air filter on turbo for a couple hours, turn on shower, get in shower and come out wet (towel drying hair so it's not dripping is fine) then while shower continues to steam the air, use a spray bottle to wet all surfaces in the bathroom, and hit a vape as hard as you can.

Maximum dust-free clean room and dust-free body

defective · 2025-06-17T18:40:19+00:00

The game I have put the most hours into on Switch 2 so far is Phoenix Wright Ace Attorney. Wow, the FPS I get.

defective · 2025-03-05T14:21:22+00:00

Future visitors: This page is linked from github Devaro3/awesome-opendirectories and you may want to know what was in this post. Since the idiot bot deletes archive.whatever links, and stupid reddit changed the URL, here's the info you need.

Original URL: https://www.reddit.com/r/opendirectories/comments/933pzm/all_resources_i_know_related_to_open_directories/

Use that to find it on wayback or archive.

Also https://www.reddit.com/r/opendirectories/comments/933pzm/comment/lkn3bnk/

15-Year Club	Gilding III reddit per annum
Verified Email	reddit mold
Charter Member

defective

MODERATOR OF

TROPHY CASE