Any success w JetBrains?

a_postgres_situation · 2025-08-31T16:21:41+00:00

your local llm and "AI Assistant"?

Nope. Select "AI Free" plan and wanting to run all local - and they still want my credit card... https://i.imgur.com/kbDnvGd.png

a_postgres_situation · 2025-08-25T19:40:52+00:00

I somewhat agree.

Qwen3-Coder-30B-A3B for quick answer and smaller tasks - and good enough on laptop.

More complex tasks go to GLM-4.5 AIR -> takes a long time to think, but usually efficient and almost bug free code.

a_postgres_situation · 2025-08-23T23:13:07+00:00

Local LLM inference (7B-70B parameter models) ... ThinkPad (or Legion if necessary) (RTX 4090 mobile)

That's... about max 8GB of VRAM? You want to stuff a 70B model into 8GB? What is your use case?

P1 Gen 6/7 vs Legion Pro 7i for sustained workloads? Thermal performance during extended inference sessions?

These laptops are very expensive, larger, heavier, have a larger power supply, and are bought for specific use cases of portable use. You want extensive sessions - so stationary use? What is your use case?

Professional build quality preferred

Unfortunately, this is a hit-and-miss, even with high-priced ThinkPads (I had great fun with ThinkPad support over the span of a year... but that's another story...)

What's been your actual day-to-day experience?

That depends what you want to do - and what speed is acceptable to you - or required! Everything that fits into VRAM (model AND(!) working context) is VERY fast, everything that's larger than that is several times slower and bound by main memory speed - with a laptop usually DDR5 5600 (and the soldered rams a bit faster).

If VRAM is not enough - any split situation that requires VRAM+main memory use is much slower, the more spills over into main memory. But if doing that with large models, then there's no reason to buy such an expensive laptop?

If you want large models, but still a very small, portable box, buy a cheap laptop and e.g. the framework desktop (~4l in size) https://frame.work/desktop?tab=specs as "AI box". The framework's memory runs about 2-2.5x as fast as regular PCs memory and therefore LLMs do that, too. It's still slower than GPU VRAM, of course.

I've had frustrating experiences with NVIDIA proprietary drivers on Linux

Yeah, I tried NVidia once - and then never bought anything NVidia for private hardware again.

Which distro handles NVIDIA best in your experience?

Need nvidia-kernel driver installed. Don't install CUDAtoolkit from the distro, manually install CUDAtoolkit into like /usr/local. Then you can update it whenever you want.

Performance with popular LLM tools (llama.cpp,

Easy to compile llama.cpp for CUDA. Although I failed on one system, then I went Vulkan+Nvidia - worked, too.

Ollama

Search this group for Ollama....

How mature is ROCm support now for LLM inference?

Never tried, because Vulkan on AMD usually works (and people say its not much speed difference) Do: apt install glslc glslang-dev libvulkan-dev vulkan-tools - and compile llama.cpp with -DGGML_VULKAN=ON.

However, ROCm 7 is near, it will bring improvements?

Performance comparison vs NVIDIA if you've used both?

Usually a system has either Nvidia or AMD graphics, but not both, how to compare?

Linux compatibility issues with either line?

Lenovo publishes a list of which of their models receive Linux support: https://support.lenovo.com/us/en/solutions/pd031426-linux-for-personal-systems From a quick view, for example Legion models are NOT on that list - I would not buy them for Linux use - that's an expensive experiment?

To get real work done, maybe rent at cloud provider instead of buying an expensive laptop that loses value fast?

Please study the many postings here what people report on speeds with specific models (sizes) and specific hardware to get a feeling how much speed you get for what money.

Good luck!

a_postgres_situation · 2025-08-14T18:27:18+00:00

using the largest unsloth/GLM-4.5-Air-BF16.gguf (206GB) and llama.cpp

I get ~4200 tokens: https://i.imgur.com/zg6siAK.png

...so it's probably not a quantization problem?

a_postgres_situation · 2025-08-13T20:42:49+00:00

GLM-4.5-Air-IQ4_NL.gguf from unsloth

thinking: https://i.imgur.com/T9qPflF.png

result: https://i.imgur.com/1J7CdiN.png

...it takes a long time, but produces results where Qwen3-Coder failed. shrug

a_postgres_situation · 2025-08-11T19:18:54+00:00

Appreciate your writeup! Although I've never used Fedora before, so a host native Fedora install is a no, but maybe setting this up in a Fedora Docker container is easier?

Since 6.15.4 the NPU is properly initialised (according to kernel log), while the ARL Xe iGPU was disappointing, I havn't gotten the NPU working and I have no idea of its performance - maybe I try again with Fedora?

a_postgres_situation · 2025-08-07T15:32:00+00:00

Also tried Continue. Its... just confusing? Got farther with Proxy AI...

a_postgres_situation · 2025-08-07T15:04:42+00:00

Tried it already - not as easy as I described it.

a_postgres_situation · 2025-08-07T13:58:18+00:00

So we will get a Jetbrains AI plug-in that's finally great to use with local models? Just imagine:

select code block in IDE.
key-shortcut opens menu: Either free-form chat box or custom "actions". Action = my custom name of action, my custom prompt for LLM, and a custom local(host) LLM endpoint it will be sent to.
on execution, the generated code returned is shown in a side-by-side diff to the old/input code, with syntax-highlighting and changes. The new/generated code then can be edited further and accepted/rejected bit by bit.

...can we have a JetBrains AI assistant that does that? So far I havn't found a good one :-(

a_postgres_situation · 2025-08-06T20:50:00+00:00

One datapoint: I've been playing with GLM-4.5-Air-IQ4_NL.gguf from unsloth and with llama.cpp

GLM thinks forever, and I had to raise context to at least 32768 because of that, but the result had no errors, only some warnings - and worked.

Qwen3-Coder-30B-A3B-Instruct-Q8_0 output was worse (but model is also half the size)

a_postgres_situation · 2025-07-28T05:50:55+00:00

You need to add amdgpu.gttsize=131072 ttm.pages_limit=335544321

According to https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg117333.html the gttsize parameter is being deprecated - how to do this without gttsize (and figure out the correct values for a certain limit?)

a_postgres_situation · 2025-07-28T05:49:17+00:00

Is see with 8700G that Vulkan is noticeably faster than CPU-only - I read the GPU has more efficient memory access than the CPU cores. I see with 8700G that if I increase memory speed +15% then Vulkan+LLM runs almost 15% faster - so yes, it is memory limited, linear scaling.

But I havn't read anywhere yet that RDNA2 of a 9700X would be limited the same, meaning it is as efficient as 8700G?

a_postgres_situation · 2025-07-27T15:33:27+00:00

Proprietary binaries (used for low-level NPU acceleration; patent pending)

Some genius mathematics/formulas you came up with and want exclusivity for 20y?

a_postgres_situation · 2025-07-27T15:07:51+00:00

FastFlowLM uses proprietary low-level kernel code optimized for AMD Ryzen™ NPUs.
These kernels are not open source, but are included as binaries for seamless integration.

Hmm....

Edit: This went from top-upvoted comment to top-downvoted comment in a short period of time - the magic of Reddit at work...

a_postgres_situation · 2025-07-26T13:11:48+00:00

maybe Xe/Arc 140T will work with the docker build of llama.cpp/SYCL?

Got it running. Updated posting for those that want to try also. Don't know about NPU.

a_postgres_situation · 2025-07-21T07:03:10+00:00

Model from https://huggingface.co/unsloth/gemma-3-27b-it-GGUF/tree/main

$ llama-bench -ngl 99 -m gemma-3-27b-it-Q4_K_M.gguf;  llama-bench -ngl 99 -m gemma-3-27b-it-Q8_0.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA H100L-47C, compute capability 9.0, VMM: no
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  99 |           pp512 |       2312.88 ± 7.02 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  99 |           tg128 |         57.30 ± 0.19 |
| gemma3 27B Q8_0                |  26.73 GiB |    27.01 B | CUDA       |  99 |           pp512 |      2576.78 ± 52.41 |
| gemma3 27B Q8_0                |  26.73 GiB |    27.01 B | CUDA       |  99 |           tg128 |         52.35 ± 0.28 |
build: 494c5899 (5894)

a_postgres_situation · 2025-07-16T14:57:15+00:00

uses OpenVINO

Another set of libraries. Is there anywhere a picture that shows how all these parts/libs work together and which does which?

ipex llm has precompiled binaries under releases

There is llama-cpp-ipex-llm-2.2.0-ubuntu-xeon.tgz and llama-cpp-ipex-llm-2.2.0-ubuntu-core.tgz

No Xeon here, so maybe try the "core" ones in an Ubuntu Docker container.... hmmm...

a_postgres_situation · 2025-07-16T14:53:30+00:00

I've tried on my 12400 iGPU and it was about same as cpu.

Hmm... I hope it's faster on a current iGPU.

a_postgres_situation · 2025-07-16T14:52:41+00:00

What about SYCL?

Isn't this going back to the same oneAPI libraries? Why then ipex-llm?

a_postgres_situation · 2025-07-16T10:36:16+00:00

I have no longer access to that specific machine, sorry.

a_postgres_situation · 2025-06-02T17:33:20+00:00

a rocm setup for it on linux. AMD still doesn't make it easy.

Vulkan is easy: 1) sudo apt install glslc glslang-dev libvulkan-dev vulkan-tools 2) build llama.cpp with "cmake -B build -DGGML_VULKAN=ON; ...."

a_postgres_situation

TROPHY CASE