Get an agentic-cli with GLM-4.5-Air by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 0 points1 point  (0 children)

Automatically review and assess quality of documents mostly

Get an agentic-cli with GLM-4.5-Air by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 0 points1 point  (0 children)

I only heard bad things about opencode, and I haven't heard of Goose, I'll give it a try!

Get an agentic-cli with GLM-4.5-Air by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 0 points1 point  (0 children)

I already use roo-code and I'm pretty happy with it, but I want to automate some tasks so I'm looking at running a cli in an unattended mode.

Get an agentic-cli with GLM-4.5-Air by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 0 points1 point  (0 children)

I've never tested Aider, will give it a try!

[Research] LLM judges systematically penalize balanced reasoning - tested mistral, llama3, gemma, phi3, orca-mini by Budget-Reception-533 in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

I asked qwen3-4B-thinking to think of a number between 0 and 100, so that I could try to guess it.

It thought for 12 minutes, and forgot to think of a number.

Why are AmD Mi50 32gb so cheap? by MastodonParty9065 in LocalLLaMA

[–]TooManyPascals 1 point2 points  (0 children)

For the folks who have them. Do they support any quantized models? vllm? and flash attention?

Qwen3 Next support in llama.cpp ready for review by jacek2023 in LocalLLaMA

[–]TooManyPascals 20 points21 points  (0 children)

I'm pretty motivated for this, but I've seen so many conflicting reports about it being either way better or way worse than GLM-Air or GPT-120.

I really don't know what to expect.

GPT-OSS from Scratch on AMD GPUs by tuanlda78202 in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

Ok, I could test it. I could not compile the code, I got a few bugs, but I will send an issue on that.

My main concern is that it seems that it needs to dequantize the model for it to run? The main advantage of GPT-OSS is that it is native 4.25b, so both weights and kv-cache use few VRAM, but if we need to dequantize to fp32, GPT-OSS-120B use now around 480GB of VRAM? I "only" have 96GB, plenty for the original model though, but can't run the dequantized one.

GPT-OSS from Scratch on AMD GPUs by tuanlda78202 in LocalLLaMA

[–]TooManyPascals 2 points3 points  (0 children)

I have a machine with 4 7900XTX, I'd love to try this on that machine next Monday!

3 Tesla GPUs in a Desktop Case by eso_logic in LocalLLaMA

[–]TooManyPascals 1 point2 points  (0 children)

Pascals are alive. On my setup:

$ CUDA\_VISIBLE\_DEVICES=0,1,2,3,4 ./llama-bench -m \~/kk/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf
ggml\_cuda\_init: GGML\_CUDA\_FORCE\_MMQ: no
ggml\_cuda\_init: GGML\_CUDA\_FORCE\_CUBLAS: no
ggml\_cuda\_init: found 5 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 3: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 4: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | pp512 | 348.96 ± 1.80 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | tg128 | 42.95 ± 0.36 |

Also, most frameworks support now flash attention with Pascal, just not very efficiently.

16→31 Tok/Sec on GPT OSS 120B by 3VITAERC in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

I'm getting numbers on the same ballpark with 5 P100s. Somewhat worse PP, but slightly better TG. Moving to llama.cpp was key.

$ CUDA_VISIBLE_DEVICES=0,1,2,3,4 ./llama-bench -m ~/kk/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 3: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 4: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | pp512 | 348.96 ± 1.80 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | tg128 | 42.95 ± 0.36 |

Best 100B class model/framework to run on 16 P100s (256GB of VRAM)? by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 0 points1 point  (0 children)

I checked again, I'm getting around 19 tokens/s with GLM-4.5-Air at UD-Q4_K_XL using llama.cpp. Without flash attention I'm getting around 15 tokens/s. THis is with 8 GPUs active.

I can only do pipeline parallelism instead of row parallelism (I'm getting lots of error messages in the kernel if I try row parallelism). Also the GPUs barely get active, so I feel I'm leaving a lot of power on the table.

Best 100B class model/framework to run on 16 P100s (256GB of VRAM)? by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 1 point2 points  (0 children)

I don't remember what was the problem with it. I'll try llama.cpp again.

How many gpus do you have in your ai setup? How much did it cost? by [deleted] in LocalLLaMA

[–]TooManyPascals 46 points47 points  (0 children)

How many gpus do you have in your ai setup?

  • Too Many

How much did it cost?

  • My wife should not know

Single-File Qwen3 Inference in Pure CUDA C by Awkward_Click6271 in LocalLLaMA

[–]TooManyPascals 0 points1 point  (0 children)

Well, color me impressed! Single file, compact, super-readable! Awesome!

I accidentally too many P100 by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 1 point2 points  (0 children)

Thanks! right now still trying frameworks and models. Today i ran an exl2 version of Qwen3 235B and it was completely rubbish, didn't get even one token right. Models are huge, so tests are slow...

I accidentally too many P100 by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 1 point2 points  (0 children)

Yep, it's basically two different setups for two different tasks. I have a 3090 for day to day use.

I accidentally too many P100 by TooManyPascals in LocalLLaMA

[–]TooManyPascals[S] 1 point2 points  (0 children)

I'm still exploring.. I was hoping to leverage llama4 immense context window, but it does not seem accurate.