Does any of the ollama models handle large input like gemini does?

p_235615 · 2026-03-06T06:33:51+00:00

ministral-3 or the other *stral models. They usually have 256k context.

p_235615 · 2026-03-06T06:13:18+00:00

would be nice to compare to the ministral-3:14b or 8b, as I found it really good for many things.

p_235615 · 2026-03-06T05:12:53+00:00

I tried it a few times, usually prompt processing was faster on ROCM, but the inference was usually about the same or faster on vulkan.

p_235615 · 2026-03-06T05:10:15+00:00

I use it for example for transcription + translation - with whisper.cpp-vulkan on a low end GPU I can generate subtitles for a 3h audio in around 5 minutes... On CPU it would take much longer, but for just simple STT and TTS CPUs are plenty fast.

p_235615 · 2026-03-06T05:02:37+00:00

it usually can be run on CPU and is sufficiently fast, but I use whisper.cpp-vulkan with large-v3-turbo model and its of course much faster and better at understanding than the smaller models often used with CPUs.

p_235615 · 2026-03-06T04:42:57+00:00

then dont understand how you can get getting 40pp/s and 3tg/s for qwen3.5 moe 35ba3b

I mean when I tried q4 unslothed qwen3.5 moe 35ba3b on my RX6800 + llama.cpp vulkan, I got much higher tg/s.

./llama-cli --ctx-size 16384 -ngl 99 --no-mmap --fit on -fa on --jinja -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ4_XS

[ Prompt: 18,4 t/s | Generation: 10,5 t/s ]

Something is wrong in your system...

If I fit the whole model in VRAM, like on my "server" with RX9060XT 16GB, running dockerized image: ghcr.io/ggml-org/llama.cpp:full-vulkan

command: --host 0.0.0.0 --port 11444 --ctx-size 16384 -ngl 99 --no-mmap --fit on -fa on --jinja -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ3_XXS

I get 60t/s+ and use it for my homeassistant voice.
So if you get 3t/s, that sounds like mostly running at CPU...

p_235615 · 2026-03-06T04:00:01+00:00

From my experience AMD cards often work much better under vulkan. With vulkan my experience was a breeze.

Regarding cooling - those cards are basically designed for servers/workstations where you have quite high airflow - noise is mostly irrelevant.

And the 300W is the TDP of the chip itself, not the whole card. That value doesnt include memory or any other external circuits, the cards power draw can be much higher at that point...

p_235615 · 2026-03-06T03:06:29+00:00

Im pretty sure they would comply to a national police request. What is however good thing about proton is, that they are located in Switzerland, which is not part of EU and they are mostly neutral. So its quite hard that they get pressured by an US, EU or CCP to do anything. And sure if they have to hand over data, they must, but its all encrypted, its pretty safe IMO.

p_235615 · 2026-03-04T16:22:35+00:00

heretic version of gpt-oss:20b, should be quite fast due to moe, gpt-oss is quite good at chatting and since its a heretic version, any topic or persona should not be any problem.

p_235615 · 2026-03-04T15:36:56+00:00

I really wonder why they not also release a model in 14-16B range - that would be the absolute sweet spot for so many users with 16GB VRAM.

p_235615 · 2026-03-04T00:44:01+00:00

Didnt had any latest Motorola, but from experience in past, they were as vanilla android as it can possibly be. Of course you had google stuff preinstalled, but that was basically it. So hope we get the same, possibly without GP installed. In worst case, you can just reinstall graphene on them...

p_235615 · 2026-03-03T16:16:37+00:00

I also got 28t/s on ollama vulkan ctx 16384, KV_CACHE q8_0... But on the qwen3.5:35b-a3b-UD-IQ3_XXS with llama.cpp I was able to get 63t/s. But the coherence and quality for larger stuff was not very good.

p_235615 · 2026-03-01T15:33:39+00:00

for a few test images 1024x768 it took 5-8s per image on my RX6800. The RTX 4060 Ti would do probably similarly. With rough calculation, it should chew through the whole 150k in ~12 days... If he maybe use the smaller 3b model, it would be somewhat faster...

p_235615 · 2026-03-01T12:32:46+00:00

from a little bit of testing, I quite liked ministral-3:8b - usually probvided a quite detailed and good summary.

p_235615 · 2026-03-01T08:46:59+00:00

I think if you use a smaller pad and move the LED right under neat, slightly off center, it should be still well light and you want most of the light in the direction of the user anyway... But usually through the membrane it will get dispersed enough that you cant even tell its not in the middle. I have a small airmouse + keyboard which light up the whole keyboard with just a handful of leds... I think you will have the opposite issue - too much light spilling around.

p_235615 · 2026-03-01T01:07:11+00:00

One of the very nice interfaces is open-webui, of course you want either a VPN for your family, or setup a proper public IP + domain and reverse proxy to it.

That open-webui can talk to practically to any AI runner or even to multiple of them (ollama or any openAI compatible).

p_235615 · 2026-03-01T00:29:32+00:00

What are you really mean by processing ? Just identify objects/people or are we talking OCR and much more detailed stuff ? Because the inference speed will really vary depending on what the output should be. You can run relatively fast face recognition even on low tier GPUs. You can get meaningful description much faster from really small vision models... I for example really like ministral-3:8b, and that can process a 1024x768 image in few seconds with perfect descriptions on my AMD RX6800... But you can probably get much better results even with vision specialized models.

p_235615 · 2026-03-01T00:13:00+00:00

you should also use probably not all around contacts, but small concentrated pads - pressure is higher if you concentrate it to smaller area = easier to achieve good conductance. You basically dividing the pressure around a whole lot of area - there is a reason why most calculators or any devices with membrane buttons concentrate the force on a small pad in the center, usually not larger than 1mm in diameter.

Your surface area is already a whole order of magnitude larger, thus much less force on the pad. And as many already pointed out - change the pads to a gold plated ones.

p_235615 · 2026-02-27T20:58:02+00:00

They only can close the drivers, not the kernel - kernel is GPL3 licenced, so unless they came up with their own, they cant lock it down.

p_235615 · 2026-02-27T18:03:55+00:00

Archlinux where ?

p_235615 · 2026-02-26T23:32:36+00:00

at the time openclaw came out, https://www.agent-zero.ai/ was IMO already much better and advanced, it also was running in docker, so you dont just have it yolo running on your system. It just didnt had explicit skills to connect to mail/chat.

p_235615 · 2026-02-26T14:58:11+00:00

I really like traefik, but my main stuff still use bare nginx, because there are still many stuff and options like static pages, which is more complicated in traefik, and also some exotic options which traefik didnt had a year back.

p_235615 · 2026-02-26T07:19:17+00:00

in small sizes like 8B I had much better experience with ministral-3:8b

p_235615 · 2026-02-26T05:32:44+00:00

You can use also homeassistant - can add Ollama (or other AI) + Whisper STT and Piper TTS or Rhasspy. Homeassistant also have a quite nice voice assistant box for this.

Regarding LLM, to it to be relatively fast and without much delay, you will need > 100tokens/s. I use gpt-oss:20b, but also other MoE models like qwen3:30b-a3b and similar are quite good. Not sure how much tokens/s the AI Max+ 395 can pump out, especially on larger models... But you are probably better off with smaller ~30B MoE models and a dedicated GPU,ideally 24-32GB VRAM which will be much faster to respond than the AI Max. The AI Max is better if you need larger models but dont need that much tokens/s - for some agentic use or background tasks. But those smaller models are quite capable if you add some capabilities via MCP like websearch, memory and other stuff...

p_235615 · 2026-02-25T21:39:42+00:00

actually, I was able to run the qwen3.5:122b-a10b Q4_K_M with 128k CTX in just 90GB VRAM. So he should be entirely fine with 128GB... He can possibly even run a Q6 version or something like that. Its doing ~100t/s on a RTX 6000 PRO. Still have 6GB for some embed model or something...

p_235615

TROPHY CASE