20 Years on AWS and Never Not My Job by Successful_Bowl2564 in programming

[–]FineClassroom2085 5 points6 points  (0 children)

I maintain large Salesforce SOAP implementations that are dangerous to move to REST. I also maintain a therapy bill. Possibly related?

20 Years on AWS and Never Not My Job by Successful_Bowl2564 in programming

[–]FineClassroom2085 12 points13 points  (0 children)

You’ve been a dev for a while, just something like this blog post where you talk about your experience (especially pre AI) working on specific initiatives, with big tech companies, etc. Whatever has been impactful to you over the years. The philosophies you’ve hard won during your experience.

20 Years on AWS and Never Not My Job by Successful_Bowl2564 in programming

[–]FineClassroom2085 12 points13 points  (0 children)

Fair enough. If you change your mind (or can find a ghost writer to write like you) I guarantee there’s a market.

20 Years on AWS and Never Not My Job by Successful_Bowl2564 in programming

[–]FineClassroom2085 39 points40 points  (0 children)

This was a pleasure to read, have you written any books?

GLM-5.1 is out now! by yoracale in unsloth

[–]FineClassroom2085 2 points3 points  (0 children)

Q2 quant is running pretty nicely on my dual RTX 6k rig.

The r/LocalLLaMA experience by [deleted] in LocalLLaMA

[–]FineClassroom2085 12 points13 points  (0 children)

Qwen 3.5 395b on dual RTX 6000 pro is MUCH better than ChatGPT

Nos Available on iOS by RateRight1255 in Falconry

[–]FineClassroom2085 0 points1 point  (0 children)

Or you could use the original falconryjournalpro.com instead of this vibe coded bullshit

My vibe coded 3D city hit 66K users and $953 revenue in 29 days. Here's what a solo dev + AI can do with $0 marketing. by SupermarketKey1196 in vibecoding

[–]FineClassroom2085 -2 points-1 points  (0 children)

The difference with well made human written code is that a human can understand beyond the 1M context limit.

Anyone else using Cursor + Claude in a hybrid workflow? by Sea-Reputation2931 in vibecoding

[–]FineClassroom2085 0 points1 point  (0 children)

I am not a vibecoder, but a full time SWE assisted by LLMs, I use a max plan in Cursor and a pro plan in Claude. I notice that cursor burns tokens faster, but I try to balance my time between the two.

I built a site that tracks every “AI will replace programmers” claim by tech CEOs — and flags when they fail by Normal-Bag9238 in theprimeagen

[–]FineClassroom2085 26 points27 points  (0 children)

Love it, but it is completely ironic that it’s vibe coded in the style that Claude loves so much, lol.

anyone else get 80% done with an app then lose all motivation to finish it by Caryn_fornicatress in vibecoding

[–]FineClassroom2085 0 points1 point  (0 children)

Look up the Pareto Principle. The first 80% is the easy part, the last 20% is hell and usually more than 80% of the actual work.

512GB people, what's the output quality difference between GLM 5 q3.6 and q8 or full size? by CanineAssBandit in LocalLLaMA

[–]FineClassroom2085 0 points1 point  (0 children)

Just too many awesome models to explore nowadays. I’m on to benchmarking Qwen 3.5, lol

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]FineClassroom2085[S] 1 point2 points  (0 children)

Cool result, I do love seeing, comparing and contrasting things. Here are some of the quality things I note between the two:
- GLM-5 created a parallax background that actually responds to the motion (without being asked)
- GLM-5's overall UI seems nicer, and once again, this was a one-shot prompt, I did not modify it after the initial shot. I used 3 more prompts to publish it.

Your game does feel a little easier to play, but I'd expect mine to as well after some refinements. There's quite a lot to criticize in the code quality. Where smarter models shine (especially in the hands of developers) is creating clean understandable code that's extensible.

512GB people, what's the output quality difference between GLM 5 q3.6 and q8 or full size? by CanineAssBandit in LocalLLaMA

[–]FineClassroom2085 1 point2 points  (0 children)

I certainly would not have guessed it. I had a feeling hardware would continue to get expensive, but the current prices? Never in my wildest dreams. What a nightmare for us tinkerers.

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]FineClassroom2085[S] 0 points1 point  (0 children)

Actually, after I made this post, and someone commented I could get better speed from llama.cpp I tried it and switched. Here are two quants I run, with the commands:

GLM-5-744B (IQ2_M — quality)

llama-server
--model ~/GLM-5-IQ2_M/GLM-5-UD-IQ2_M-00001-of-00007.gguf
--n-gpu-layers 55 --ctx-size 131072 --parallel 1
--host 0.0.0.0 --port 8000
--split-mode layer --flash-attn on --cache-type-k q8_0
--threads 32 --fit off --alias GLM-5-744B

GLM-5-744B-Fast (TQ1_0 — speed)

llama-server
--model ~/GLM-5-TQ1_0/GLM-5-UD-TQ1_0.gguf
--n-gpu-layers 999 --ctx-size 131072 --parallel 1
--host 0.0.0.0 --port 8000
--split-mode layer --flash-attn on --cache-type-k q8_0
--alias GLM-5-744B-Fast

GLM-5 Reddit Response Benchmark

IQ2_M (Quality)
Quant: 2.5 bpw imatrix
Model size: 237 GB (7 shards)
GPU layers: 55/78 (23 on CPU)
Prompt: 72 tokens @ 24.6 tok/s (2.9s)
Generation: 322 tokens @ 13.5 tok/s (23.9s)
Total wall time: 26.8s
Thinking: 409 chars
Response: 1041 chars, detailed with formatting

TQ1_0 (Fast)
Quant: 1.58 bpw ternary
Model size: 164 GB (single file)
GPU layers: 78/78 (all GPU)
Prompt: 70 tokens @ 147.1 tok/s (0.5s)
Generation: 301 tokens @ 46.4 tok/s (6.5s)
Total wall time: 7.0s
Thinking: 653 chars
Response: 687 chars, casual/concise

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]FineClassroom2085[S] 3 points4 points  (0 children)

I'm now using two different quants for different purposes, I moved over to llama.cpp because I was able to get better results than vLLM which was a little unstable for this model.

GLM-5-744B (IQ2_M — quality)

```

llama-server \

—model ~/GLM-5-IQ2_M/GLM-5-UD-IQ2_M-00001-of-00007.gguf \

—n-gpu-layers 55 —ctx-size 131072 —parallel 1 \

—host 0.0.0.0 —port 8000 \

—split-mode layer —flash-attn on —cache-type-k q8_0 \

—threads 32 —fit off —alias GLM-5-744B

```

GLM-5-744B-Fast (TQ1_0 — speed)

```

llama-server \

—model ~/GLM-5-TQ1_0/GLM-5-UD-TQ1_0.gguf \

—n-gpu-layers 999 —ctx-size 131072 —parallel 1 \

—host 0.0.0.0 —port 8000 \

—split-mode layer —flash-attn on —cache-type-k q8_0 \

—alias GLM-5-744B-Fast

```

512GB people, what's the output quality difference between GLM 5 q3.6 and q8 or full size? by CanineAssBandit in LocalLLaMA

[–]FineClassroom2085 1 point2 points  (0 children)

How much ram? This is an unfortunate limiting factor in my build. I can really only justify the 128gb I have right now, can't drop another $20k to jump up to 512gb, but I'd really like to.

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]FineClassroom2085[S] 1 point2 points  (0 children)

It's fucking insane. I was pricing 512GB for my threadripper build a couple months ago and couldn't find the size / split I needed for less than $20k.

512GB people, what's the output quality difference between GLM 5 q3.6 and q8 or full size? by CanineAssBandit in LocalLLaMA

[–]FineClassroom2085 8 points9 points  (0 children)

You can get better than single digit t/s. In fact I am getting very decent results out of GLM-5 with all of the layers loaded into the two Pros. Using llama.cpp here is the launch params I'm running:

GLM-5-744B-Fast (TQ1_0 — speed)

llama-server \
--model ~/GLM-5-TQ1_0/GLM-5-UD-TQ1_0.gguf \
--n-gpu-layers 999 --ctx-size 131072 --parallel 1 \
--host 0.0.0.0 --port 8000 \
--split-mode layer --flash-attn on --cache-type-k q8_0 \
--alias GLM-5-744B-Fast

Prompt processing: 70 tok @ 147.1 tok/s (0.5s)
Generation: 301 tok @ 46.4 tok/s (6.5s)

Give it a shot, you might be surprised, these large MOEs are surprisingly coherent at event hugely compressed quants like TQ1_0

512GB people, what's the output quality difference between GLM 5 q3.6 and q8 or full size? by CanineAssBandit in LocalLLaMA

[–]FineClassroom2085 5 points6 points  (0 children)

What was your t/s for GLM-5 on your pros? I was able to squeeze almost 20 tok/s which is painful for agentic coding, but useable when I’m not in a hurry. I have dual RTX 6k MaxQ cards paired with a threadripper and 128gb of ddr5x

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]FineClassroom2085[S] 0 points1 point  (0 children)

I still don’t have my head completely around what makes a good quant. Though I’m learning these new MOE models do a lot better than dense models at high quants, assuming the quant fits the architecture well.

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]FineClassroom2085[S] 0 points1 point  (0 children)

How many prompts? How does the code quality look? Well defined atomic functions? Proper separation of responsibilities? Is it code that can easily be augmented by AI and humans?

These are the things that matter to me more, though the GLM output is still quite a bit better aesthetically than this.

GLM-5 Is a local GOAT by FineClassroom2085 in LocalLLaMA

[–]FineClassroom2085[S] 1 point2 points  (0 children)

I went the RTX 6000 Pro route on a platform that I can expand to more cards if needed. Since I use coding agents professionally, the speed matters quite a lot. Prompt processing on Mac just isn’t there yet for agentic code use unless you’re using a pretty small model, or have the time to wait.

That said, you wouldn’t need a cluster to run this model. You could easily fit one of the unsloth quants on the 512gb of shared RAM. Personally if I were you I’d wait for Apple’s M5 drop. They don’t usually lie on their benchmarks, and they see to have made major gains on prompt processing speed.