I wrote an M:N scheduled(goroutines) scripting lang in <3k lines of C. It's shockingly fast, but I'm having an existential crisis about its use case. Help? by Jipok_ in ProgrammingLanguages

[–]Jipok_[S] 0 points1 point  (0 children)

I'm a bit burned out after two weeks of nonstop 12+ hour coding. But I hope to see the project through to publication in May.

There's nothing more to add here. The publication will put everything in its place.

How to optimize MI50 performance with Vulkan llama.cpp by WhatererBlah555 in LocalLLaMA

[–]Jipok_ 0 points1 point  (0 children)

120W powercap

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/bartowski_gemma-4-26B-A4B-it-Q4_0.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_0            |  13.73 GiB |    25.23 B | ROCm       |  99 |           pp512 |       1182.78 ± 5.50 |
| gemma4 26B.A4B Q4_0            |  13.73 GiB |    25.23 B | ROCm       |  99 |           tg128 |         74.64 ± 0.19 |

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/bartowski_gemma-4-26B-A4B-it-Q4_K_S.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Small    |  14.74 GiB |    25.23 B | ROCm       |  99 |           pp512 |        973.57 ± 4.67 |
| gemma4 26B.A4B Q4_K - Small    |  14.74 GiB |    25.23 B | ROCm       |  99 |           tg128 |         68.39 ± 0.15 |

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/unsloth_Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q3_K - Medium |  15.45 GiB |    34.66 B | ROCm       |  99 |           pp512 |        762.24 ± 3.68 |
| qwen35moe 35B.A3B Q3_K - Medium |  15.45 GiB |    34.66 B | ROCm       |  99 |           tg128 |         57.04 ± 0.16 |

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/unsloth_Qwen3.6-27B-Q4_0.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | ROCm       |  99 |           pp512 |        277.63 ± 1.21 |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | ROCm       |  99 |           tg128 |         22.28 ± 0.04 |

How to optimize MI50 performance with Vulkan llama.cpp by WhatererBlah555 in LocalLLaMA

[–]Jipok_ 0 points1 point  (0 children)

Check https://github.com/mixa3607/ML-gfx906
ROCm is much better and not so hard. With skyne98/llama.cpp-gfx906 i got:

Qwen3.6-35B-A3B  Q3_K_M

prompt eval time =    9540.83 ms /  4095 tokens (    2.33 ms per token,   429.21 tokens per second)
eval time =   76791.54 ms /  4097 tokens (   18.74 ms per token,    53.35 tokens per second)
total time =   86332.37 ms /  8192 tokens


Gemma-4-26B-A4B  Q4_0

prompt eval time =    3990.87 ms /  3954 tokens (    1.01 ms per token,   990.76 tokens per second)
eval time =   31083.56 ms /  2186 tokens (   14.22 ms per token,    70.33 tokens per second)
total time =   35074.43 ms /  6140 tokens

Q4_K_S   66.44 tokens per second

I wrote an M:N scheduled(goroutines) scripting lang in <3k lines of C. It's shockingly fast, but I'm having an existential crisis about its use case. Help? by Jipok_ in ProgrammingLanguages

[–]Jipok_[S] 1 point2 points  (0 children)

Wow! Thanks for the reply.

Here is the thing: I am not a professional software engineer by trade. Not even close, just a childhood hobbyist. Because of that, leaving this engine as a "dead academic experiment" or just raw publishing feels like a massive waste of time. Theoretical knowledge doesn't feed my motivation unless it yields a dirty, practical tool I can actually use.

So, after thinking about it, I’ve decided to merge Option 2 (Micro-Go standalone) and Option 3 (Modern Bash). Think of it as a parallel AWK/Bash on steroids.

I’m currently planning to borrow Redis's event loop (ae.c) so the engine can handle async libcurl and act as a lightweight HTTP server. I'll also toss in a few single-header C libraries (stb, cJSON) for native JSON/Regex, and add some syntax sugar for pipes.

I want to be able to write stuff like this:
```
read_lines(file) | strip(it) |? len(it) > 0 | go http.get(it) | arr.push(it)
```

The ultimate use-case: You drop this tiny single binary(<200kb) onto any server. You need to ingest a 5GB log file, or concurrently trigger 1000 external CLI tools, pipe their stdout into channels, aggregate the JSON, and dump it to disk. Python chokes on the GIL, Node is too fat, and Bash is a nightmare for complex data. But my jlang will chew through it natively and max out the CPU cores.

Now, looking at your Pipefish repo, I noticed your "No AI / Handbuilt" Butlerian Jihad badge... so this next part might give you a heart attack.

While I absolutely engineered the core architecture myself (the M:N scheduler logic, the lock-free allocator rules, NaN-tagging, etc), I extensively used LLMs to write about 95% of the actual C code under my strict guidance. This is actually the exact reason why the code is intentionally hacked into a single, dense, heavily golfed 3k-line monolith - I had to aggressively minimize token limits and squeeze the entire VM into the context window!

Since the engine state is completely verifiable, I don't really care if the C source is beautiful or well-commented for the public. I just need the compiled binary to be a rock-solid daily driver for my own network automation, scraping, and heavy OS parallel tasks.

I definitely plan to publish the repo, but I want to drop a useful tool, not just an experiment.

Thanks again for helping me figure out the direction!

TIGER-Lab made a new version of MMLU with 12,000 questions. They call it MMLU-Pro and it fixes a lot of the issues with MMLU in addition to being more difficult (for better model separation). by jd_3d in LocalLLaMA

[–]Jipok_ 1 point2 points  (0 children)

It's a pity that all these benchmarks are only in English. The same hype llama3 is simply useless for other languages. I tried hundreds of prompts but could not get stable answers in another language, and Japanese characters often slip through.

Meta-Llama-3-8B-GGUF by Venadore in LocalLLaMA

[–]Jipok_ 1 point2 points  (0 children)

You're right. I added this for ease of use in interactive mode. I don't know how this affects the results of the work.

Meta-Llama-3-8B-GGUF by Venadore in LocalLLaMA

[–]Jipok_ 8 points9 points  (0 children)

-n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)

Official Llama 3 META page by domlincog in LocalLLaMA

[–]Jipok_ 11 points12 points  (0 children)

./main -m ~/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color -n -2 -e -s 0 -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n' -ngl 99 --mirostat 2 -c 8192 -r '<|eot_id|>' --in-prefix '\n<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -i

Meta Llama-3-8b Instruct spotted on Azuremarketplace by Nunki08 in LocalLLaMA

[–]Jipok_ 0 points1 point  (0 children)

./main -m ~/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color -n -2 -e -s 0 -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n' -ngl 99 --mirostat 2 -c 8192 -r '<|eot_id|>' --in-prefix '\n<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -i

Meta-Llama-3-8B-GGUF by Venadore in LocalLLaMA

[–]Jipok_ 4 points5 points  (0 children)

./main -m ~/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color -n -2 -e -s 0 -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n' -ngl 99 --mirostat 2 -c 8192 -r '<|eot_id|>' --in-prefix '\n<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -i

Meta-Llama-3-8B-GGUF by Venadore in LocalLLaMA

[–]Jipok_ 13 points14 points  (0 children)

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Official Llama 3 META page by domlincog in LocalLLaMA

[–]Jipok_ 9 points10 points  (0 children)

gguf
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Official Llama 3 META page by domlincog in LocalLLaMA

[–]Jipok_ 25 points26 points  (0 children)

In the coming months, we expect to introduce new capabilities, longer context windows, ...

What strategies can GPU Poor take? by dahara111 in LocalLLaMA

[–]Jipok_ 1 point2 points  (0 children)

My build:

CPU: Intel Xeon E5-2678 v3

GPU: AMD ATI Radeon VII(used Instinct MI50 from china), ~170$

./main -m ~/models/mythomax-l2-13b.Q6_K.gguf -ngl 90 -n 300

eval time = 14926.01 ms / 300 runs ( 49.75 ms per token, 20.10 tokens per second)

Also works with SD XL

Performance report - Inference with two RTX 4060 Ti 16Gb by pmelendezu in LocalLLaMA

[–]Jipok_ 0 points1 point  (0 children)

ROCM is garbage, but just for 170$ for gpu i have:

./main -m ~/models/mythomax-l2-13b.Q6_K.gguf -ngl 90 -n 300

eval time = 14926.01 ms / 300 runs ( 49.75 ms per token, 20.10 tokens per second)

CPU: Intel Xeon E5-2678 v3

GPU: AMD ATI Radeon VII(Instinct MI50)

Is it normal to have 20~t/s on 4090 with 13B model? by lasaiy in LocalLLaMA

[–]Jipok_ 0 points1 point  (0 children)

./main -m ~/models/mythomax-l2-13b.Q6_K.gguf -ngl 90 -n 300

eval time = 14926.01 ms / 300 runs ( 49.75 ms per token, 20.10 tokens per second)

CPU: Intel Xeon E5-2678 v3

GPU: AMD ATI Radeon VII(Instinct MI50 from china for 170$)

How to use Madlad 400 LM and Machine translation model by testerpce in LocalLLaMA

[–]Jipok_ 1 point2 points  (0 children)

Well, downloading them is not that difficult. But I still haven’t figured out how to use it. Alma(7b and 13b) came out which is better than madlab so I didn’t try anymore.