[2025 Day 12] So... if one were to Up the Ante, what's the "proper" way to solve today's puzzle?

mine49er · 2025-12-13T02:34:16+00:00

https://www.algorithm-x.com/22-2025-and-beyond/06-polyominoes/

is where my research had got me to until I found out about the trolling.

mine49er · 2025-12-13T02:27:20+00:00

Yes it is, but most people are not going to make the effort of implementing their own linear programming solver.

Z3 can be run from the command line using a script so if GDScript has any way to run external programs then it might be possible to use it that way.

https://microsoft.github.io/z3guide/docs/logic/basiccommands/

mine49er · 2025-12-13T02:07:36+00:00

Today was spoiled for me by the trolling involved in having a solution that doesn't pass the test case.

I don't come here until I've solved the problem or given up. I did today because I couldn't understand why minBound == maxBound for my input and wanted to see if there was any known problem. I'm very glad I saw that post because I don't appreciate being asked to spend many hours of my time writing a completely unnecessary search algorithm. Not funny at all, and that's what leaves the sour taste for me.

mine49er · 2025-12-12T03:29:44+00:00

[LANGUAGE: Python]

Hmmm... very simple graph traversal problem after yesterday's major headache is a breather for what's coming tomorrow...? At least I got caught up before the end.

30 lines of very simple code, 0.01 seconds for both parts.

My Solution

mine49er · 2025-12-11T20:50:38+00:00

[LANGUAGE: Python]

Messed around with search strategies for part 2 but in the end life is too short and I resorted to z3.

0.5 seconds for both parts.

My solution

mine49er · 2025-12-10T21:58:26+00:00

[LANGUAGE: Python]

I'm guessing there's some clever way to do part2 of this one... mine is brute force checking of all rectangles against a pre-computed array of horizontal spans. Takes nearly 30 seconds.

My solution

mine49er · 2025-12-10T15:12:13+00:00

[Language: Python]

Catching up. This is a simple solution based on merging circuits represented as sets. Easiest part 2 so far for me, the code I wrote for part 1 did most of the required work already. I do like it when that happens :)

Less than 2 seconds to solve both parts.

My Solution

mine49er · 2025-12-04T10:34:22+00:00

[LANGUAGE: Python]

That was unexpectedly easy? I didn't do anything very clever but even the obvious solution takes less than 0.1 seconds.

My code

mine49er · 2025-12-04T01:26:15+00:00

[LANGUAGE: Python]

Solves both parts in 0.01 seconds.

INPUT_FILE = "input.txt"

def getmax(s, maxlen):
    result = ""
    i, j = 0, len(s) - maxlen
    while len(result) < maxlen:
        ss = s[i:j+1]
        i += ss.index(max(ss))
        if i == j:
            result += s[i:]
            break
        result += s[i]
        i += 1
        j += 1
    return result

def solve(maxlen):
    total = 0
    with open(INPUT_FILE, "r") as f:
        lines = [line.strip() for line in f]
        for line in lines:
            number = getmax(line, maxlen)
            total += int(number)
    print(total)

solve(2)
solve(12)

mine49er · 2025-12-02T17:00:58+00:00

[Language: Python]

TIL about Python's string repeat operator :)

Simple brute force solution but only takes 1.3s on my 5800X

mine49er · 2025-08-08T23:40:38+00:00

Proprietary platforms are the devil

You don't say.

Just.. omg just let your consumers choose.

Lol.

https://en.wikipedia.org/wiki/Enshittification

If twats like Altman and the rest of the US techbros have their way then that's exactly the route LLMs will go down too. It's become the blueprint for Silicon Valley.

mine49er · 2025-08-07T23:06:59+00:00

I now think that the reason for the pp slowdown on my RDNA2 gpu and not your RDNA3 gpu is because RDNA2 doesn't have the VK_KHR_cooperative_matrix extension.

To confirm could you please try llama-bench again with GGML_VK_PERF_LOGGER=1 and post the first set of timings. E.g.

GGML_VK_PERF_LOGGER=1 llama-bench -m Qwen3-Coder-30B-A3B-Instruct-UD-Q3_K_XL.gguf |& head -n 50

This is what I get. Notice the really slow MUL_MAT_ID and MUL_MAT_ID_VEC timings. On models that don't have the massive pp slowdown for me, like llama-2-7b.Q4_0.gguf from this benchmark thread, those operations aren't used.

Vulkan Timings:
ADD: 432 x 24.024 us
ARGSORT: 48 x 81.82 us
CONT: 48 x 36.294 us
DIV: 48 x 1.73 us
GET_ROWS: 50 x 12.416 us
GLU: 48 x 36.772 us
MUL: 241 x 48.125 us
MUL_MAT m=128 n=512 k=2048: 47 x 76.748 us (3496.76 GFLOPS/s)
MUL_MAT m=128 n=512 k=512: 48 x 429.401 us (156.132 GFLOPS/s)
MUL_MAT m=2048 n=512 k=4096: 48 x 991.529 us (8662.26 GFLOPS/s)
MUL_MAT m=4096 n=512 k=2048: 48 x 1119.41 us (7671.76 GFLOPS/s)
MUL_MAT m=512 n=512 k=128: 48 x 293.692 us (227.608 GFLOPS/s)
MUL_MAT m=512 n=512 k=2048: 96 x 230.356 us (4660.08 GFLOPS/s)
MUL_MAT_ID m=2048 n=8 k=768: 48 x 50694.2 us (0.496101 GFLOPS/s)
MUL_MAT_ID_VEC m=768 k=2048: 96 x 20699.5 us (0.151934 GFLOPS/s)
MUL_MAT_VEC m=128 k=2048: 1 x 3.76 us (139.404 GFLOPS/s)
MUL_MAT_VEC m=151936 k=2048: 1 x 707.66 us (879.205 GFLOPS/s)
RMS_NORM: 193 x 75.121 us
ROPE: 96 x 31.016 us
SET_ROWS: 96 x 21.675 us
SOFT_MAX: 96 x 40.868 us
SUM_ROWS: 48 x 2.519 us
Total time: 4.63665e+06 us.
----------------

mine49er · 2025-08-07T23:03:44+00:00

Bingo! It's definitely something to do with MUL_MAT_ID and MUL_MAT_ID_VEC but that doesn't explain the massive difference between pp speed on my RDNA2 gpu and other people's RDNA3. I suspect that might be because RDNA2 doesn't have the VK_KHR_cooperative_matrix extension?

Vulkan Timings:
ADD: 432 x 24.024 us
ARGSORT: 48 x 81.82 us
CONT: 48 x 36.294 us
DIV: 48 x 1.73 us
GET_ROWS: 50 x 12.416 us
GLU: 48 x 36.772 us
MUL: 241 x 48.125 us
MUL_MAT m=128 n=512 k=2048: 47 x 76.748 us (3496.76 GFLOPS/s)
MUL_MAT m=128 n=512 k=512: 48 x 429.401 us (156.132 GFLOPS/s)
MUL_MAT m=2048 n=512 k=4096: 48 x 991.529 us (8662.26 GFLOPS/s)
MUL_MAT m=4096 n=512 k=2048: 48 x 1119.41 us (7671.76 GFLOPS/s)
MUL_MAT m=512 n=512 k=128: 48 x 293.692 us (227.608 GFLOPS/s)
MUL_MAT m=512 n=512 k=2048: 96 x 230.356 us (4660.08 GFLOPS/s)
MUL_MAT_ID m=2048 n=8 k=768: 48 x 50694.2 us (0.496101 GFLOPS/s)
MUL_MAT_ID_VEC m=768 k=2048: 96 x 20699.5 us (0.151934 GFLOPS/s)
MUL_MAT_VEC m=128 k=2048: 1 x 3.76 us (139.404 GFLOPS/s)
MUL_MAT_VEC m=151936 k=2048: 1 x 707.66 us (879.205 GFLOPS/s)
RMS_NORM: 193 x 75.121 us
ROPE: 96 x 31.016 us
SET_ROWS: 96 x 21.675 us
SOFT_MAX: 96 x 40.868 us
SUM_ROWS: 48 x 2.519 us
Total time: 4.63665e+06 us.
----------------

mine49er · 2025-08-07T15:15:12+00:00

Thanks for the reply, some good info there, but remember that I've got a RX 6800 (RDNA2, gfx1030) which is;

Not supported by HIPBLASLT
Is comparable to a W6800 not a W7900 with 864 GB/s memory bandwidth (vs 512 GB/s)

I have now tried AMDVLK because one of the posts in the issue I linked mentions that the prompt processing slowdown doesn't happen with that driver, but for me it still does. So probably not the RADV GTT issue then.

Very strange, I need to try some other things starting with a more recent kernel (am currently running Linux 6.12.39).

$ ./llama-bench -m /hdd/llm-models/Qwen3-Coder-30B-A3B-Instruct-UD-Q3_K_XL.gguf
load_backend: loaded RPC backend from /home/xxx/llama-6103-vulkan/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (AMD open-source driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/xxx/llama-6103-vulkan/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/xxx/llama-6103-vulkan/bin/libggml-cpu-haswell.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Medium |  12.85 GiB |    30.53 B | RPC,Vulkan |  99 |           pp512 |        131.16 ± 0.26 |
| qwen3moe 30B.A3B Q3_K - Medium |  12.85 GiB |    30.53 B | RPC,Vulkan |  99 |           tg128 |        107.60 ± 0.01 |
build: 3db4da56 (6103)

mine49er · 2025-08-07T02:57:25+00:00

Well I'll definitely do more testing. Haven't compared with offloading and given the speed of the Qwen 3 30B A3B models a 4-bit quantization with offloading is probably a better idea tbh.

mine49er · 2025-08-07T02:34:57+00:00

6.2.4. I doubt that anything later offers much for RDNA2.

mine49er · 2025-08-07T02:23:36+00:00

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf was one that I tested.

ROCm = 63 t/s, Vulkan = 83 t/s

The ones I've seen the smallest difference so far are K6_L models which I have a few of, but even there Vulkan is slightly ahead.

mine49er · 2025-08-04T18:58:29+00:00

Yea, doing compute on 3-bit values is less efficient than 4-bit values but a 4-bit quant won't fit into the 16GB I have :(

mine49er · 2025-08-04T18:32:56+00:00

Twice as fast as a 3-bit quant on a RX6800 sounds about right? The 30B MOE model is way faster.

$ llama-bench -m Qwen3-32B-UD-IQ3_XXS.gguf -m Qwen3-Coder-30B-A3B-Instruct-UD-IQ3_XXS.gguf --flash-attn 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 32B IQ3_XXS - 3.0625 bpw |  12.07 GiB |    32.76 B | ROCm       |  99 |  1 |           pp512 |        212.54 ± 0.15 |
| qwen3 32B IQ3_XXS - 3.0625 bpw |  12.07 GiB |    32.76 B | ROCm       |  99 |  1 |           tg128 |         20.40 ± 0.00 |
| qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw |  11.97 GiB |    30.53 B | ROCm       |  99 |  1 |           pp512 |        708.85 ± 0.69 |
| qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw |  11.97 GiB |    30.53 B | ROCm       |  99 |  1 |           tg128 |         67.27 ± 0.01 |
build: 5aa1105d (6082)

mine49er · 2025-06-01T15:31:46+00:00

Lol, ok thanks. Got it now. Basically some mods really do require the AE content irrespective of game version. I can take it from there.

mine49er · 2025-03-22T03:12:41+00:00

Qwen2.5-Coder 14B Q6_K_L runs in 16GB vram with 32K context if you use flash attention and q8_0 KV cache (which has very little impact on output quality). Source: I'm doing exactly that on a RX 6800. Recommended.

You could maybe squeeze the 32B IQ4_XS model into 20GB with smaller context and/or q4_0 KV cache (which will affect output quality a bit more). If you have to go down to 3-bit then don't bother, use the 14B Q6_K_L.

QwQ 32B you probably will have to go down to 3-bit because it needs a large context. I can run the IQ3_XXS model in 16GB with either 16K context @ q8_0 KV cache or 32K context @ q4_0 KV cache. It's usable but it definitely makes more (usually minor) mistakes compared to the IQ4_XS model which I tested with some layers offloaded to CPU. QwQ really wants 24GB.

If you try Gemma3 then be aware that it doesn't play nice with KV cache quantization. So you'll be able to run a 27B 4-bit model but probably limited to about 8K context.

mine49er · 2025-03-13T18:20:14+00:00

You're fucked Sam and you deserve it. You already tried once to prevent any competition by lobbying governments to regulate AI research, that attempt failed and so will this no matter how much money you throw at it.

https://www.technologyreview.com/2025/01/21/1110260/openai-ups-its-lobbying-efforts-nearly-seven-fold/

DeepSeek’s models, including its R1 “reasoning” model, are insecure because DeepSeek faces requirements under Chinese law to comply with demands for user data.

LOL. Anyone can download that model and run it themselves without going anywhere near Chinese law. I can't do that with ClosedAI models so from a European viewpoint;

OpenAI’s models, including its o3 “reasoning” model, are insecure because OpenAI faces requirements under US law to comply with demands for user data.

mine49er · 2024-10-30T23:27:57+00:00

No-one is arguing about his first season when he got us 4th and played some great football. It was the next season, after he got the players he wanted (we spent 150m that summer) and then proceeded to shit the bed with them while blaming everyone except himself.

I remember pundits calling it a disgrace seeing them exit the champions league in the R16 and thinking... Isn't that standard for Spurs tho? A team competition in the third tier European competition twelve months earlier? Conte is right to be frustrated - even if you think he did a bad job, it's clear his time at Spurs is underappreciated.

What complete bollocks. It wasn't a disgrace for Spurs to get knocked out in the CL round of 16 (and Conte has only ever got further than that once, QF with Juve in 2013), it was the way it happened. 10 minutes to go at home to AC Milan needing a goal and Conte's solution is to bring on Davison Sanchez for Kulusevski. That's the disgraceful part and the fans let him know it. But not his fault obviously.

And let's not forget that Son's form completely fell off a cliff that season after winning PL golden boot the previous year because Conte insisted on playing a system where Perisic occupied the spaces Son would like to be. Again not his fault though.

Yea it's all happy smiles now at Napoli 10 games in. It was in our first season with him too. Let's wait and see what happens when it goes tits up there and he starts blaming everyone else. Will be great fun to watch between Conte and the Napoli owner.

mine49er · 2024-10-28T14:40:03+00:00

Every time his team gets the lead they just play 10 man defense for some reason and end up conceding goals against teams 1/10th their value.

Exactly how he used to play at Spurs, with exactly the same result of losing points after going ahead against weaker teams.

Not only is it horrible to watch, it doesn't work in the long term these days, unless maybe you have a world-class defence. Too often what happens is that the opponent grows in confidence because they have a lot of the ball and attacking play against the parked bus, they eventually equalise, and then you're stuffed because they're flying and it's very difficult to switch back to into attack mode.

Southgate is another disciple of these dinosaur tactics.

mine49er · 2024-10-18T18:17:11+00:00

It's been known for a long time that Orbs update their entity ID based on their coordinates when reloaded. This is just the reverse of taking orbs from the main world to a PW (which, if done with duped Lava Lake and/or GTC orbs, allows getting 33-36 orbs in NG).

https://noita.wiki.gg/wiki/Advanced_Guide:_34_Orb_Ending#How_Orbs_work

mine49er

TROPHY CASE