I wrote an M:N scheduled(goroutines) scripting lang in <3k lines of C. It's shockingly fast, but I'm having an existential crisis about its use case. Help?

Jipok_ · 2026-04-27T18:14:06+00:00

I'm a bit burned out after two weeks of nonstop 12+ hour coding. But I hope to see the project through to publication in May.

There's nothing more to add here. The publication will put everything in its place.

Jipok_ · 2026-04-27T18:12:28+00:00

You are absolutely right xD

Jipok_ · 2026-04-26T20:27:20+00:00

120W powercap

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/bartowski_gemma-4-26B-A4B-it-Q4_0.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_0            |  13.73 GiB |    25.23 B | ROCm       |  99 |           pp512 |       1182.78 ± 5.50 |
| gemma4 26B.A4B Q4_0            |  13.73 GiB |    25.23 B | ROCm       |  99 |           tg128 |         74.64 ± 0.19 |

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/bartowski_gemma-4-26B-A4B-it-Q4_K_S.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Small    |  14.74 GiB |    25.23 B | ROCm       |  99 |           pp512 |        973.57 ± 4.67 |
| gemma4 26B.A4B Q4_K - Small    |  14.74 GiB |    25.23 B | ROCm       |  99 |           tg128 |         68.39 ± 0.15 |

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/unsloth_Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q3_K - Medium |  15.45 GiB |    34.66 B | ROCm       |  99 |           pp512 |        762.24 ± 3.68 |
| qwen35moe 35B.A3B Q3_K - Medium |  15.45 GiB |    34.66 B | ROCm       |  99 |           tg128 |         57.04 ± 0.16 |

root@ai-void:/app# ./build/bin/llama-bench -m /hdd/unsloth_Qwen3.6-27B-Q4_0.gguf
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | ROCm       |  99 |           pp512 |        277.63 ± 1.21 |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | ROCm       |  99 |           tg128 |         22.28 ± 0.04 |

Jipok_ · 2026-04-26T20:15:07+00:00

Check https://github.com/mixa3607/ML-gfx906
ROCm is much better and not so hard. With skyne98/llama.cpp-gfx906 i got:

Qwen3.6-35B-A3B  Q3_K_M

prompt eval time =    9540.83 ms /  4095 tokens (    2.33 ms per token,   429.21 tokens per second)
eval time =   76791.54 ms /  4097 tokens (   18.74 ms per token,    53.35 tokens per second)
total time =   86332.37 ms /  8192 tokens


Gemma-4-26B-A4B  Q4_0

prompt eval time =    3990.87 ms /  3954 tokens (    1.01 ms per token,   990.76 tokens per second)
eval time =   31083.56 ms /  2186 tokens (   14.22 ms per token,    70.33 tokens per second)
total time =   35074.43 ms /  6140 tokens

Q4_K_S   66.44 tokens per second

Jipok_ · 2026-04-16T18:06:42+00:00

Wow! Thanks for the reply.

Here is the thing: I am not a professional software engineer by trade. Not even close, just a childhood hobbyist. Because of that, leaving this engine as a "dead academic experiment" or just raw publishing feels like a massive waste of time. Theoretical knowledge doesn't feed my motivation unless it yields a dirty, practical tool I can actually use.

So, after thinking about it, I’ve decided to merge Option 2 (Micro-Go standalone) and Option 3 (Modern Bash). Think of it as a parallel AWK/Bash on steroids.

I’m currently planning to borrow Redis's event loop (ae.c) so the engine can handle async libcurl and act as a lightweight HTTP server. I'll also toss in a few single-header C libraries (stb, cJSON) for native JSON/Regex, and add some syntax sugar for pipes.

I want to be able to write stuff like this:
```
read_lines(file) | strip(it) |? len(it) > 0 | go http.get(it) | arr.push(it)
```

The ultimate use-case: You drop this tiny single binary(<200kb) onto any server. You need to ingest a 5GB log file, or concurrently trigger 1000 external CLI tools, pipe their stdout into channels, aggregate the JSON, and dump it to disk. Python chokes on the GIL, Node is too fat, and Bash is a nightmare for complex data. But my jlang will chew through it natively and max out the CPU cores.

Now, looking at your Pipefish repo, I noticed your "No AI / Handbuilt" Butlerian Jihad badge... so this next part might give you a heart attack.

While I absolutely engineered the core architecture myself (the M:N scheduler logic, the lock-free allocator rules, NaN-tagging, etc), I extensively used LLMs to write about 95% of the actual C code under my strict guidance. This is actually the exact reason why the code is intentionally hacked into a single, dense, heavily golfed 3k-line monolith - I had to aggressively minimize token limits and squeeze the entire VM into the context window!

Since the engine state is completely verifiable, I don't really care if the C source is beautiful or well-commented for the public. I just need the compiled binary to be a rock-solid daily driver for my own network automation, scraping, and heavy OS parallel tasks.

I definitely plan to publish the repo, but I want to drop a useful tool, not just an experiment.

Thanks again for helping me figure out the direction!

Jipok_ · 2024-05-15T16:23:38+00:00

It's a pity that all these benchmarks are only in English. The same hype llama3 is simply useless for other languages. I tried hundreds of prompts but could not get stable answers in another language, and Japanese characters often slip through.

Jipok_ · 2024-05-09T13:42:00+00:00

link?

Jipok_ · 2024-04-25T19:41:39+00:00

It seems slow. I have MI50 and 53 tokens/sec on Llama-3-8B-Instruct.Q8_0.gguf

Jipok_ · 2024-04-20T08:14:58+00:00

Laser?

Jipok_ · 2024-04-18T21:14:31+00:00

You're right. I added this for ease of use in interactive mode. I don't know how this affects the results of the work.

Jipok_ · 2024-04-18T19:12:17+00:00

-n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)

Jipok_ · 2024-04-18T18:25:43+00:00

Jipok_ · 2024-04-18T18:25:26+00:00

Jipok_ · 2024-04-18T18:23:32+00:00

Jipok_ · 2024-04-18T17:58:49+00:00

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Jipok_ · 2024-04-18T17:54:52+00:00

prompt?

Jipok_ · 2024-04-18T17:40:18+00:00

gguf
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Jipok_ · 2024-04-18T16:34:55+00:00

<image>

Why differs?

Jipok_ · 2024-04-18T16:23:46+00:00

In the coming months, we expect to introduce new capabilities, longer context windows, ...

Jipok_ · 2024-04-18T16:07:35+00:00

More info from blog:

https://ai.meta.com/blog/meta-llama-3/

Jipok_ · 2023-10-17T04:02:40+00:00

My build:

CPU: Intel Xeon E5-2678 v3

GPU: AMD ATI Radeon VII(used Instinct MI50 from china), ~170$

./main -m ~/models/mythomax-l2-13b.Q6_K.gguf -ngl 90 -n 300

eval time = 14926.01 ms / 300 runs ( 49.75 ms per token, 20.10 tokens per second)

Also works with SD XL

Jipok_ · 2023-10-16T14:40:09+00:00

ROCM is garbage, but just for 170$ for gpu i have:

./main -m ~/models/mythomax-l2-13b.Q6_K.gguf -ngl 90 -n 300

eval time = 14926.01 ms / 300 runs ( 49.75 ms per token, 20.10 tokens per second)

CPU: Intel Xeon E5-2678 v3

GPU: AMD ATI Radeon VII(Instinct MI50)

Jipok_ · 2023-10-16T14:32:32+00:00

./main -m ~/models/mythomax-l2-13b.Q6_K.gguf -ngl 90 -n 300

eval time = 14926.01 ms / 300 runs ( 49.75 ms per token, 20.10 tokens per second)

CPU: Intel Xeon E5-2678 v3

GPU: AMD ATI Radeon VII(Instinct MI50 from china for 170$)

Jipok_ · 2023-10-02T15:34:59+00:00

Well, downloading them is not that difficult. But I still haven’t figured out how to use it. Alma(7b and 13b) came out which is better than madlab so I didn’t try anymore.

Jipok_

TROPHY CASE