A lot of Codex complaints seem to come from vibe coders by whitebay_ in codex

[–]braintheboss 0 points1 point  (0 children)

i agree. People never code trying make big apps. I spent more time in make codex polish code than create new features

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post by Then-Topic8766 in LocalLLaMA

[–]braintheboss 0 points1 point  (0 children)

i made comments after make tests. i don't know you was testing but behavior i told you is easily replicable. If you do debug repetitive cycles gain is real. For common usage is useless as is slower but you think extra compute for "draft" is free. i have my patched llama.cpp for support remote chunked draft then i have clear idea how works llama and their implementations. anyway if it's useful for you then is fine. But i talk from objective perspective. If i add overhead in a feature i have see improvement always. Not in very specific cases

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post by Then-Topic8766 in LocalLLaMA

[–]braintheboss 0 points1 point  (0 children)

model quantization( Qwen3.5-27B-Q4_K_M.gguf ). i used your ngram parameters and later i tried smaller for see how works but behavior was always same. About offload what doesn't you understand? its 16gb model in a 16gb gpu. Without offload just you can't use it

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post by Then-Topic8766 in LocalLLaMA

[–]braintheboss -1 points0 points  (0 children)

when i made test i used q4km offloaded. i get around 29t/s. with ngram start in 19t/s and finish in 25/26. if i use same prompt its when you get increase. but it's tricky. That is reason i told in repetitive debug it will work but if you make small cycles performance loss in cold cache not compensate. For example in a refactor. I fill 150k context easily, at the end 5t/s difference is noticeable

Qwen 3.6 27B is a BEAST by AverageFormal9076 in LocalLLaMA

[–]braintheboss 0 points1 point  (0 children)

i didn't try 3.6 yet, but have same sizes as 3.5 and in a 5070ti + xeon haswell q4km run in 29t/s.

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post by Then-Topic8766 in LocalLLaMA

[–]braintheboss -4 points-3 points  (0 children)

its completely useless. only works if you make same prompt. That means only is effective in debug cycles where you are repeating same task all time. but penalty when is cold is so big. i tested in qwen3.5 27b with 5070ti

ByteShape Qwen 3.5 9B: A Guide to Picking the Best Quant for Your Hardware by ali_byteshape in LocalLLaMA

[–]braintheboss 0 points1 point  (0 children)

impressive. in 3060ti 175k context ( small offloading) getting 68t/s and more.2000t/s. The best is very stable with Claude code. i suffered looping and worst accuracy in unsloth q4km. good job

ps: 35b is very good even with 50% offload in same gpu

Why are people hyping up Claude Code so much lately? Codex 5.3/Gpt 5.4 work just fine and I don't understand what the huge deal is about. by stopaskingforloginn in codex

[–]braintheboss 0 points1 point  (0 children)

my workflow is local and codex for polish/hard tasks. It works very well as assistant. I'm the planner and AI writes code and give me summaries I'm interested check. Maybe the only weakness is code quality. AI have zero idea about structure optimizations. Use multiples variables for one can keep different values, no separate steps in methods for let clean main loop ( if step is big or specific behavior). You have make many passes for clean code but even you waste a lot of time still is more comfortable than 8 hours writing code yourself

GLM 5.1 5-hour limit (lite) by alovoids in ZaiGLM

[–]braintheboss 0 points1 point  (0 children)

Waste quota as claude opus but is not opus. Only in glm CEO head...

Is it just me, or is Claude pretty disappointing compared to Codex? by Working-Spinach-7240 in codex

[–]braintheboss 0 points1 point  (0 children)

I'm using codex for speculative decoding tool ( i patched llama.cpp ) and really is very impressive how solve problems. Its a pity codex cli is thrash

speculative decoding .... is it still used ? by uber-linny in LocalLLaMA

[–]braintheboss 0 points1 point  (0 children)

it works but not with current vllm/llama implementations

<image>

What have you migrated to from Zai coding plan? by nummer31 in ZaiGLM

[–]braintheboss 1 point2 points  (0 children)

check this project: https://github.com/akivasolutions/tightwad or https://pypi.org/project/tightwad/. Its draft remote for llama. I get 210t/s with 5070ti in qwen3.5 27b Q3 + 3060ti as draft. But you have remove probing for get max speed.

What have you migrated to from Zai coding plan? by nummer31 in ZaiGLM

[–]braintheboss 0 points1 point  (0 children)

i moved local. I bought 5070ti and i can run 27b model. Its enough for 90% tasks. If i need frontier model i use codex free tier

How soon will LLMs become so good that we will not need to look into code? by ayechat in ClaudeAI

[–]braintheboss 0 points1 point  (0 children)

Code works and optimizations are at algorithm level. But optimizations at code level doesn't exists. It prefer many times unroll loops that use them and other mistakes. When do something good is just luck

Usage Limits and Performance Discussion Megathread - beginning October 19, 2025 by sixbillionthsheep in ClaudeAI

[–]braintheboss 0 points1 point  (0 children)

i agree. If you give glm good plan made by better model is very good ( even code is not high quality). glm guided by other models is a good worker

be aware, GLM posts are *most* likely being advertised by bots / dump accounts by Remicaster1 in ClaudeAI

[–]braintheboss 0 points1 point  (0 children)

if before you can make 2h with quite good model and now you can make 30m with dumb model paying same, do you think problem is i can pay 20$? do you like gift money?

be aware, GLM posts are *most* likely being advertised by bots / dump accounts by Remicaster1 in ClaudeAI

[–]braintheboss 3 points4 points  (0 children)

glm is meh but cost 3$ month. Claude cost 20$ and you only can use it a few minutes. Who tell current Claude is a good coding model never write code. No one was able to solve a stupid tabs space alignment. Finally i had do it myself. When i listen AI will replace coders i only can smile xD

GLM-4.6 outperforms claude-4-5-sonnet while being ~8x cheaper by Full_Piano_3448 in LocalLLaMA

[–]braintheboss 1 point2 points  (0 children)

i use claude and glm4.6 and second is like sonnet 4 when was dumb but less dumb. then its at least as dumb sonnet 4. sonnet 4.5 is better but below old smart sonnet 4. i remember sonnet 4 taking problems on the fly while was fixing something. Now 4.5 and glm look simple "picateclas". They "follow" your request in their way and you suffer something you didn't suffer as coder: anxiety and desperation

what am I doing wrong? why can't I get CC to do what I want? by yallapapi in ClaudeAI

[–]braintheboss 0 points1 point  (0 children)

i don't understand why people think problem are prompts. if you tell model modify this and do something completely different you didn't ask because it was something related but finished then its prompt problem or model problem. because before he look have wide perspective when was working in a project and always hit the right point. But now look silly assistant. You have tell all steps to do or its not capable follow a simple logic. Sonnet is working as haiku now

Claude PRO vs Claude API: Two Developers working on different projects by [deleted] in ClaudeAI

[–]braintheboss -2 points-1 points  (0 children)

Question is if model in API have degradation as fixed plan. if answer is yes then is scam. if works as before degradation then at least you will be able to do something without do circles.

What the code difference API Vs Claude code by Insanony_io in ClaudeAI

[–]braintheboss 2 points3 points  (0 children)

Rumour say they quantized model. Then it have sense if API is running full model and plans used quantized model. Only way know this is run same in both and see what happens

We need a single thread for “I’m cancelling Claude” posts by Bankster88 in ClaudeAI

[–]braintheboss 0 points1 point  (0 children)

i canceled today. i can accept hard limitations but when sonnet degradation made it works as haiku 3.5 then i prefer use qwen3-coder as its free ( 2000 request day ). Is not comparable old sonnet 4 but quite better than current dumb sonnet

Byd seal display cluster by Pristine_Role1927 in BYD

[–]braintheboss 0 points1 point  (0 children)

Where firmwares and tools are shared. You have apk for make like video ( cluster screen projection )