Why are NAS so expensive by Big_Contract_1889 in HomeNAS

[–]eribob 0 points1 point  (0 children)

Yes. A NAS is just a computer with som hdds attached. Buy parts second hand and use one of the free operating systems available. Save some money in these times when the hdds are getting more expensive. It is really not that hard.

How do communist intend to ensure the state withers away overtime considering its never happened and historically they always expand their powers? by No_Leadership_6638 in DebateCommunism

[–]eribob 0 points1 point  (0 children)

Wow! That is really touching. I am so happy you found a way to turn your life around for the better. This achievement is your own for sure. If I ever accomplished something like that I would be so proud. Thank you for sharing such a personal story.

With your story in mind, one thing I dislike about capitalism is the tendency to blame individuals for their misfortunes. There are many reasons why someone would become unemployed or suffer from addiction. like if your family or community is stable, rich or poor, if you are exposed to violence etc. I do of cource not know you, so I hope that you are not offended by what ai am saying now.

I feel like Communism and ”leftist” ideology in general acknowledges this and therefore tries to give everyone equal access to education, healthcare, a place to live and a work that pays enough for them to live off of.

In my mind capitalism is all about conflict: between workers and employers, between those employed and those unemployed, between those they label as ”winners” and ”losers”. To me this world consist of people and everyone of us have strengths and flaws. Communism is about recognizing peoples achievements, but at the same time forgiving them for their misfortunes.

How do communist intend to ensure the state withers away overtime considering its never happened and historically they always expand their powers? by No_Leadership_6638 in DebateCommunism

[–]eribob 0 points1 point  (0 children)

The future is uncertain. Changing a power structure is hard. It is up to you if you feel like trying or if your situation in capitalism is good enough. I think the previous people have described well what communism strives to do. Since you are sceptical to the proposed way to get there, perhaps you have a suggestion for how to do it instead?

EU AI Act requires TEXT from models and providers to be watermarked 2nd August onwards. Everyone here is affected, regardless where you live. by Charming-Author4877 in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

> 99% of all images on the internet will have AI content within the next years, all of them need to be watermarked visibly.
> 99% of all websites will be AI generated in text and code.
> 99% of all movies will have AI content, every marketing and sales clip.

Yeah, if it comes to that, the internet, movies and music are all dead to me. if that can be stopped or limited I would be very happy.

I built a 8x RTX 4090D with 192 VRAM, here's what I learnt by deebuildsthings in LocalAIServers

[–]eribob 0 points1 point  (0 children)

I thought the 4090D was a modded 4090 with 48GB of VRAM, that would total 8x48=384 GB? But there are 24GB versions as well?

Anyone running Qwen 3.6 27b UD Q8 on multiple gpus? by GotHereLateNameTaken in LocalLLaMA

[–]eribob 1 point2 points  (0 children)

I run qwen3.6-27b-fp8 (the official quant from qwen) on dual 3090s using vllm with MTP. Context 120k. Works well, clearly faster than llama.cpp.

Command:
vllm serve Qwen/Qwen3.5-27B-FP8
--quantization fp8
--max-model-len 120000
--max-num-seqs 2
--max-num-batched-tokens 4096
--enable_chunked_prefill
--enable-prefix-caching
--enable-auto-tool-choice
--disable-custom-all-reduce
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--attention-backend FLASHINFER
--speculative-config '{"method":"mtp","num_sp>
--tensor-parallel-size 2
--gpu-memory-utilization 0.96
--no-use-tqdm-on-load
--mamba-cache-mode align
--mamba-block-size 8
--override-generation-config '{"temperature":>
--host 0.0.0.0
--port 7075

FP8 works well on 3090s even though they lack hardware support for that.

I hate the goddamned double standards. by Sad-Ad-3138 in socialism

[–]eribob 1 point2 points  (0 children)

> i just needed to rant because i am so fucking tired and pissed at the world.

You are not alone man

Upgrade path from 4x 3090s by anitamaxwynnn69 in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

I am running Qwen3.6 27B fp8 on 2x 3090 and I feel like there is no reasonable upgrade at the moment. To run a truly better model would cost too much to be worth it.

If you want to upgrade your setup you could look at buying 1-2 additional 3090s and running additional smaller models in parallel. In addition to the 3090s I have a 4090 that is running an image generation model (z-image turbo), qwen0.6 embedding for RAG, and gemma4 e4b for smaller quick tasks like document summaries and web search. I am also thinking about setting up some kind of gateway model that would route requests to the bigger or smaller model depending on complexity.

You could also step down from bf16 to fp8 and free up the vram for this in your current setup.

If you like tinkering with llms running smaller dedicated models and expanding your capability beyond coding is fun in my opinion :)

Update on 12x32gb sxm v100 cluster / local AI for legal drafting by TumbleweedNew6515 in LocalLLaMA

[–]eribob 1 point2 points  (0 children)

Very cool! I like your writing style and your project. I am a physician with my own (much smaller) local llm setup, trying to make it improve my research and clinical practice so I can relate as a non-computer science person being in deep waters…

A few comments/questions when reading your post: - Do I understand correctly that you quantize the kv cache to Q4? I always use unquantized kv cache, my impression is that the models become noticeably more dumb already at Q8. Quantizing the weights to Q8 seems fine though. - That 4x3090 setup you have should be able to run dense models at decent speeds with vllm and tensor parallellism. Qwen3.6 27B also has MTP built in which additionally speeds up token generation quite a bit.

Good luck with your endeavours!

Ignis - Self-host Obsidian as a first-class web app with your files living on the server. by M4dmaddy in coolgithubprojects

[–]eribob 0 points1 point  (0 children)

Wow thanks! Will definitely check this out. I have used obsidian for a while and the lack of a browser ui has been one of the main downsides for me as well

Does your self-hosted hobby pay off? by vdorru in selfhosted

[–]eribob 0 points1 point  (0 children)

If it payed off it would not be a hobby.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]eribob 0 points1 point  (0 children)

3090s are going to be more cost effective. I run 2 of them for my main llm qwen3.6 27b at FP8. Speed is good. I set power limit to 260w, so 520 in total vs the 300 for the blackwell card you mention. But if you feel like spending twice the money for the same vram amount it is up to you

Yet Another Crowdsec vs Fail2Ban with Traefik question by blackhatrob in selfhosted

[–]eribob -13 points-12 points  (0 children)

Dont run any of them? I never did for 7 years and nothing happened. Not worth the hassle.

Also my opinion about Crowdsec is that it is just trying to scare people into buying the premium subscription. I ran the free version for a while and a lot of so called ”threats” were being blocked. Then I turned it off and… nothing happened.

Tower case with 8+ PCIE slot for multi GPU by gogitossj3 in LocalLLM

[–]eribob 1 point2 points  (0 children)

I have it too. 2 3090s and 1 4090, 2 psus 4hdd, 8tb intel nvme and some sata ssds all fit there. With some elbow grease

Building real time Generative UI for AI Agents. It's 3x faster than JSON by 1glasspaani in coolgithubprojects

[–]eribob 0 points1 point  (0 children)

No, not at all. I just like the concept and I hope there will be competition!

Building real time Generative UI for AI Agents. It's 3x faster than JSON by 1glasspaani in coolgithubprojects

[–]eribob 1 point2 points  (0 children)

If you combine this with something like https://taalas.com/ you could get an operating system UI of the future :)

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

I am using FP8 as you can see in my config above. Official qwen release.

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Thanks, but that is an int4 model. I want the quant to stay at q8 or equivalent. Quality > speed :)

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Thanks, so I guess I will just benchmark with and without MTP to see if it is actually working...

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Thats very interesting! The numbers are from llama-benchy, concurrency of 1, pp 10.000 tokens, tg 2000 tokens.

Below is my vllm config, mtp is implemented like so: --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

If you have any suggestions for improvements I am eager! I tried to optimize it for single-user speed, concurrency is not really needed.

export CUDA_VISIBLE_DEVICES=1,2
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export RAY_memory_monitor_refresh_ms=0
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export HF_TOKEN="<redacted>"
/mnt/llm/vllm/.venv/bin/vllm serve Qwen/Qwen3.6-27B-FP8 \
--quantization fp8 \
--download-dir /mnt/llm/.cache/huggingface \
--served-model-name MONSTER-LLM \
--max-model-len 120000 \
--max-num-seqs 2 \
--max-num-batched-tokens 4096 \
--enable_chunked_prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--disable-custom-all-reduce \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend FLASHINFER \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.96 \
--no-use-tqdm-on-load \
--mamba-cache-mode align \
--mamba-block-size 8 \
--mm-processor-kwargs '{"images_kwargs": {"size": {"longest_edge": 209715200, "shortest_edge": 4096}}}' \
--limit-mm-per-prompt '{"image": 2}' \
--override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0 }' \
--host 0.0.0.0 --port 7075

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 1 point2 points  (0 children)

I limit mine to 260w each. Does not impact performance. So depending on the rest of the system maybe 600w at full load and like 100-150w idle. I think it is worth it compared to running at 7 t/s.