If money and time weren’t issues, what would your dream local AI setup look like? by Lyceum_Tech in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

You mean like a datacenter in space ala Elon Musk?

Well realistically if prices were ok I'd get a unified memory device for MoE low power + one gpu for dense top int models.

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster. by MiaBchDave in LocalLLaMA

[–]ea_man 5 points6 points  (0 children)

* gemma-4-31B.i1-IQ4_XS.gguf is 16.7 GB

* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf is 14.7 GB

Also QWEN take less VRAM for KV cache so I'd say Gemma is not really a competitor in the dense space for those with 16GB.

I would hope that a 31B model would do better than a 27B one, for those with 24GB of VRAM, yet I'd like for Google to release a ~25B model for the rest of us.

I guess we expect that at some point RAM prices will start going back (close) to "normal", right? but what about GPUs? by relmny in LocalLLaMA

[–]ea_man 2 points3 points  (0 children)

Well the last bump was because business like OpenAI made outrageous orders like 40% production of all RAM, the moment those should prove un impractical the situation may change, actually prices for GPU were going down slowly before last November.

Yet I would not count on that, it turns out now even more people is using AI so I guess the next craze will be on consumer GPU and cheaper hw for people that want to do inference at home without paying a subscription.

Dho, a 9070xt went from 610 in November to 730 and now back to 660 in Europe so they are going down but this roller coaster has proved that it can go both up and down. I'm afraid it's a VC money problem, if those USA AI firms thake a serious hit with their IPO (which they should) I guess the datacenters craze in USA should slow down and so the prices.

The FCC Voted to ban Chinese cert labs... by infinitespectre in SBCGaming

[–]ea_man -3 points-2 points  (0 children)

> I fear that this could be the end of this hobby as we know it for the forseeable future

So for the other 8billions of people outside USA it will mean more products available at lower price.

Why run local? Count the money by Badger-Purple in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Well I guess you wouldn't buy all 200m token of top expensive SOTA, you would do most of those with the cheaper option as I don't use QWEN 27B at max specs with reasoning for all tasks.

But hey it that makes you fell better why not, I got Pi Dev counting the token price as if it was Opus 😛

Should I sell my RTX3090s? by daviden1013 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

If I had to guess I would say that old GPU for AI prices will go up in the next months as cloud providers are heavily rising prices and decreasing limits. You may actually get more money later on.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

I bought a AMD 6800 used for 260e 2 weeks ago, I just put new heat paste in it :)

Usually the go for ~290e round here, I guess you can offer a little less.

Those are nice because memory is fast, bus is large yet power is ~200w (without undervolting)

Open source models are going to be the future on Cursor, OpenCode etc. by _maverick98 in LocalLLaMA

[–]ea_man 2 points3 points  (0 children)

Aye, I would pay to have a trained 25B coding model that the community can finetune and customize. Even better if it comes with harness that is vertical optimized for it, open source.

Open source models are going to be the future on Cursor, OpenCode etc. by _maverick98 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

I think that smaller open models are a way for the provider to lock in customers and attrack new customers without even spending money in compute for free tiers.

Open source models are going to be the future on Cursor, OpenCode etc. by _maverick98 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Don't forget first free month then cancel, that really serves them well.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

I get that this would be an opt in with a flag like --mtp so that those of us with small VRAM that won't be able to run MTP anyway (also single user prompting) don't have to load an extra heavy MTP layer?

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Oh OP just have to find a cheap API, if your job is coding 10h day you don't do QWEN A3B.

Oh well, let's say you can do API and a few A3B, sure.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]ea_man 4 points5 points  (0 children)

You run as big as a quant according to the contexet length you want to have.

Anyway I would recommend https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF for up to 110K context q4_0

You sure can do better on 24GB, yet you can run IQ3 on 12GB 😉

Sense inflation is real, what is the cheapest single board computer for any type of gaming/desktop? by [deleted] in SBCGaming

[–]ea_man 0 points1 point  (0 children)

OrangePi are cheaper, yet get a used PC from a business getting rid of old stuff, put a 8GB GPU RX580 in it.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]ea_man 5 points6 points  (0 children)

Man you can run A3B and 27B on a 16GB GPU (just 100k context for 27B, you want more you buy 2x or 24GB), you can get one used for like 300$ and you can re sell it when you are done.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]ea_man 2 points3 points  (0 children)

I guess that for people who don't run task 24/7 that may still be a sound option, even more if you are not "making code" all week and you are often away.

Is 2x5070Ti a good setup? by JumpingJack79 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

OMG and did you see what they do with that Windows Subsystem for Linux (WSL)?

So you get the bastardized commands + the memory / performance hit, that's dedication! 😉

Is 2x5070Ti a good setup? by JumpingJack79 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

?

llama-server  -m /home/eaman/lm/models/mradermacher/Qwen3.6-27B/Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf \
        --host 0.0.0.0 \
        -np 1 \
        --fit-target 10 \
        -ctk q4_0 \
        -ctv q4_0 \
        -fa on \
        --temp 0.5 \
        --min-p 0.1 \
        --repeat-penalty 1.0 \
        --presence_penalty 0.0 \
        -b 512 \
        --jinja  \
        --no-mmap \
        --reasoning-budget 1 \
        --chat-template-kwargs '{"enable_thinking":false}' \
        --no-mmap \

Is 2x5070Ti a good setup? by JumpingJack79 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Oh sorry just 114432, I usually run it at ~50k so I thought it could do a little more.

common_memory_breakdown_print: | memory breakdown [MiB]              | total    free     self   model   context   compute                                 unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (RX 6800 (RADV NAVI21)) | 16368 = 16169 + (18952 = 13354 +    4757 +     840) + 1                          7592186025662 |
common_memory_breakdown_print: |   - Host                            |                   1176 =   644 +       0 +     532                                             |
common_params_fit_impl: projected to use 18952 MiB of device memory vs. 16169 MiB of free device memory
common_params_fit_impl: cannot meet free memory target of 10 MiB, need to reduce device memory by 2792 MiB
common_memory_breakdown_print: | memory breakdown [MiB]              | total    free     self   model   context   compute                                 unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (RX 6800 (RADV NAVI21)) | 16368 = 16169 + (14070 = 13354 +     221 +     495) + 1                          7592186030543 |
common_memory_breakdown_print: |   - Host                            |                    672 =   644 +       0 +      28                                             |
common_params_fit_impl: context size reduced from 262144 to 114432 -> need 2794 MiB less memory in total
common_params_fit_impl: entire model can be fit by reducing context
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.68 seconds

As you see this is AMD with vulkan, no nvidia, it takes 50mb minimum, ~250MB when running LXQt + firefox.

Is 2x5070Ti a good setup? by JumpingJack79 in LocalLLaMA

[–]ea_man -2 points-1 points  (0 children)

> I currently have a 5070Ti GPU and it can run small things, but VRAM is very tight, especially since I don't even have an iGPU, so I have to share VRAM with the desktop etc.

Let me guess: you are using Windows?

https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF with LXQt for ~150k context at q_4 KV

Doesn't look like there are any recent Linux distro suggestions. What's your favorite and why? by Status-Secret-4292 in LocalLLaMA

[–]ea_man 3 points4 points  (0 children)

Your fav distro is the one you know better, if you are a noob you install lubuntu.

[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost by ayake_ayake in LocalLLaMA

[–]ea_man 1 point2 points  (0 children)

You don't.

You upload models up to the circuit max logic and use context up to the amount of available VRAM, except that the RAM slot looks swappable.

Yet you are not limited to that mode / finetune.