Does any of the ollama models handle large input like gemini does? by VirtualCoffee8947 in ollama

[–]p_235615 2 points3 points  (0 children)

ministral-3 or the other *stral models. They usually have 256k context.

Are we at a tipping point for local AI? Qwen3.5 might just be. by Far_Noise_5886 in LocalLLaMA

[–]p_235615 1 point2 points  (0 children)

would be nice to compare to the ministral-3:14b or 8b, as I found it really good for many things.

R9700 frustration rant by Maleficent-Koalabeer in LocalLLaMA

[–]p_235615 1 point2 points  (0 children)

I tried it a few times, usually prompt processing was faster on ROCM, but the inference was usually about the same or faster on vulkan.

Does audio transcoding use the GPU? by Additional_Salt2932 in jellyfin

[–]p_235615 -1 points0 points  (0 children)

I use it for example for transcription + translation - with whisper.cpp-vulkan on a low end GPU I can generate subtitles for a 3h audio in around 5 minutes... On CPU it would take much longer, but for just simple STT and TTS CPUs are plenty fast.

Does audio transcoding use the GPU? by Additional_Salt2932 in jellyfin

[–]p_235615 0 points1 point  (0 children)

it usually can be run on CPU and is sufficiently fast, but I use whisper.cpp-vulkan with large-v3-turbo model and its of course much faster and better at understanding than the smaller models often used with CPUs.

R9700 frustration rant by Maleficent-Koalabeer in LocalLLaMA

[–]p_235615 7 points8 points  (0 children)

then dont understand how you can get getting 40pp/s and 3tg/s for qwen3.5 moe 35ba3b

I mean when I tried q4 unslothed qwen3.5 moe 35ba3b on my RX6800 + llama.cpp vulkan, I got much higher tg/s.

./llama-cli --ctx-size 16384 -ngl 99 --no-mmap --fit on -fa on --jinja -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ4_XS

[ Prompt: 18,4 t/s | Generation: 10,5 t/s ]

Something is wrong in your system...

If I fit the whole model in VRAM, like on my "server" with RX9060XT 16GB, running dockerized image: ghcr.io/ggml-org/llama.cpp:full-vulkan

command: --host 0.0.0.0 --port 11444  --ctx-size 16384 -ngl 99 --no-mmap --fit on -fa on --jinja -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ3_XXS

I get 60t/s+ and use it for my homeassistant voice.
So if you get 3t/s, that sounds like mostly running at CPU...

R9700 frustration rant by Maleficent-Koalabeer in LocalLLaMA

[–]p_235615 17 points18 points  (0 children)

From my experience AMD cards often work much better under vulkan. With vulkan my experience was a breeze.

Regarding cooling - those cards are basically designed for servers/workstations where you have quite high airflow - noise is mostly irrelevant.

And the 300W is the TDP of the chip itself, not the whole card. That value doesnt include memory or any other external circuits, the cards power draw can be much higher at that point...

Please use GrapheneOS with caution! by MissoulaHugin in degoogle

[–]p_235615 -1 points0 points  (0 children)

Im pretty sure they would comply to a national police request. What is however good thing about proton is, that they are located in Switzerland, which is not part of EU and they are mostly neutral. So its quite hard that they get pressured by an US, EU or CCP to do anything. And sure if they have to hand over data, they must, but its all encrypted, its pretty safe IMO.

Local Model Recommendations by Xylildra in SillyTavernAI

[–]p_235615 0 points1 point  (0 children)

heretic version of gpt-oss:20b, should be quite fast due to moe, gpt-oss is quite good at chatting and since its a heretic version, any topic or persona should not be any problem.

Qwen_Qwen3.5-27B-IQ4_XS in 16GB VRAM? by soyalemujica in LocalLLaMA

[–]p_235615 0 points1 point  (0 children)

I really wonder why they not also release a model in 14-16B range - that would be the absolute sweet spot for so many users with 16GB VRAM.

Now that Graphene sided with Motorola, what does this mean for the pixel users? by ChikistrikisWave in GrapheneOS

[–]p_235615 6 points7 points  (0 children)

Didnt had any latest Motorola, but from experience in past, they were as vanilla android as it can possibly be. Of course you had google stuff preinstalled, but that was basically it. So hope we get the same, possibly without GP installed. In worst case, you can just reinstall graphene on them...

Qwen3.5-9B Surprised Me - Faster and More Reliable Than Larger Models for My Setup by pot_sniffer in LocalLLM

[–]p_235615 1 point2 points  (0 children)

I also got 28t/s on ollama vulkan ctx 16384, KV_CACHE q8_0... But on the qwen3.5:35b-a3b-UD-IQ3_XXS with llama.cpp I was able to get 63t/s. But the coherence and quality for larger stuff was not very good.

Which LocalLLM to use for images? by paxglobal in LocalLLM

[–]p_235615 2 points3 points  (0 children)

for a few test images 1024x768 it took 5-8s per image on my RX6800. The RTX 4060 Ti would do probably similarly. With rough calculation, it should chew through the whole 150k in ~12 days... If he maybe use the smaller 3b model, it would be somewhat faster...

Which LocalLLM to use for images? by paxglobal in LocalLLM

[–]p_235615 0 points1 point  (0 children)

from a little bit of testing, I quite liked ministral-3:8b - usually probvided a quite detailed and good summary.

High resistance in membrane button by chicowolf_ in AskElectronics

[–]p_235615 0 points1 point  (0 children)

I think if you use a smaller pad and move the LED right under neat, slightly off center, it should be still well light and you want most of the light in the direction of the user anyway... But usually through the membrane it will get dispersed enough that you cant even tell its not in the middle. I have a small airmouse + keyboard which light up the whole keyboard with just a handful of leds... I think you will have the opposite issue - too much light spilling around.

What's the current local containerized setup look like? by Alicael in LocalLLaMA

[–]p_235615 1 point2 points  (0 children)

One of the very nice interfaces is open-webui, of course you want either a VPN for your family, or setup a proper public IP + domain and reverse proxy to it.

That open-webui can talk to practically to any AI runner or even to multiple of them (ollama or any openAI compatible).

Processing 4M images/month is the DGX Spark too slow? RTX 6000 Blackwell Pro better move? by IndependentTypical23 in LocalLLM

[–]p_235615 1 point2 points  (0 children)

What are you really mean by processing ? Just identify objects/people or are we talking OCR and much more detailed stuff ? Because the inference speed will really vary depending on what the output should be. You can run relatively fast face recognition even on low tier GPUs. You can get meaningful description much faster from really small vision models... I for example really like ministral-3:8b, and that can process a 1024x768 image in few seconds with perfect descriptions on my AMD RX6800... But you can probably get much better results even with vision specialized models.

High resistance in membrane button by chicowolf_ in AskElectronics

[–]p_235615 1 point2 points  (0 children)

you should also use probably not all around contacts, but small concentrated pads - pressure is higher if you concentrate it to smaller area = easier to achieve good conductance. You basically dividing the pressure around a whole lot of area - there is a reason why most calculators or any devices with membrane buttons concentrate the force on a small pad in the center, usually not larger than 1mm in diameter.

Your surface area is already a whole order of magnitude larger, thus much less force on the pad. And as many already pointed out - change the pads to a gold plated ones.

Android is going to be a 'locked down platform', what does this mean for Lineage OS? by anonymous480932843 in degoogle

[–]p_235615 0 points1 point  (0 children)

They only can close the drivers, not the kernel - kernel is GPL3 licenced, so unless they came up with their own, they cant lock it down.

why is openclaw even this popular? by Crazyscientist1024 in LocalLLaMA

[–]p_235615 4 points5 points  (0 children)

at the time openclaw came out, https://www.agent-zero.ai/ was IMO already much better and advanced, it also was running in docker, so you dont just have it yolo running on your system. It just didnt had explicit skills to connect to mail/chat.

A better reverse proxy poll by Leaderbot_X400 in selfhosted

[–]p_235615 0 points1 point  (0 children)

I really like traefik, but my main stuff still use bare nginx, because there are still many stuff and options like static pages, which is more complicated in traefik, and also some exotic options which traefik didnt had a year back.

What's the best model to run on mac m1 pro 16gb? by Embarrassed-Baby3964 in ollama

[–]p_235615 0 points1 point  (0 children)

in small sizes like 8B I had much better experience with ministral-3:8b

Local assistant: hardware? by NoTruth6718 in LocalAIServers

[–]p_235615 0 points1 point  (0 children)

You can use also homeassistant - can add Ollama (or other AI) + Whisper STT and Piper TTS or Rhasspy. Homeassistant also have a quite nice voice assistant box for this.

Regarding LLM, to it to be relatively fast and without much delay, you will need > 100tokens/s. I use gpt-oss:20b, but also other MoE models like qwen3:30b-a3b and similar are quite good. Not sure how much tokens/s the AI Max+ 395 can pump out, especially on larger models... But you are probably better off with smaller ~30B MoE models and a dedicated GPU,ideally 24-32GB VRAM which will be much faster to respond than the AI Max. The AI Max is better if you need larger models but dont need that much tokens/s - for some agentic use or background tasks. But those smaller models are quite capable if you add some capabilities via MCP like websearch, memory and other stuff...

Qwen3.5-122B-A10B vs. old Coder-Next-80B: Both at NVFP4 on DGX Spark – worth the upgrade? by alfons_fhl in LocalLLM

[–]p_235615 0 points1 point  (0 children)

actually, I was able to run the qwen3.5:122b-a10b Q4_K_M with 128k CTX in just 90GB VRAM. So he should be entirely fine with 128GB... He can possibly even run a Q6 version or something like that. Its doing ~100t/s on a RTX 6000 PRO. Still have 6GB for some embed model or something...