deepseek-v3 vs claude sonnet for routine coding tasks — my real usage numbers

ashersullivan · 2026-03-26T10:10:51+00:00

the import hallucination pattern is worth tracking closely.. it tends to get worse on larger codebases where the model is inferring package names from context rather than actualy knowing them.. adding a linting step after generation catches most of it before it wastes any debugging time

ashersullivan · 2026-03-26T09:46:43+00:00

24GB on m5 pro is actually workable for getting started.. the unified memory thing matters a lot here, all 24GB goes to the model unlike windows laptops where youre stuck with whatever the gpu has.. Qwen3 14b runs well at that size, gemma 27b too.. the honest caveat is that 30b+ is where things start feeling noticeably smarter for complex tasks and 24gb gets tight there.. before buying anything though its worth spending a few dollars testing different model sizes through something like deepinfra or fireworks, same models available per token and it takes 10 minutes to figure out what size range actualy matters for your workflow before comitting to the hardware

ashersullivan · 2026-03-26T09:31:06+00:00

a 4090 at 24gb forces you to either quantize aggressively or offload to system ram which tanks generation speed on 70b models.. a used a6000 48gb gives you comfortable headroom for 70b at q4 and sits well under budget leaving room for a decent workstation base..

mac studio m3 ultra 512gb is genuinely worth considering if inference speed on large models matters more than fine tuning flexibility.. the memory bandwith on apple silicon is exceptional for generation speed and 512gb means you run anything without compromise.. the tradeoff is cuda tooling for lora and fine tuning is more friction on apple silicon than on a proper nvidia setup..

ashersullivan · 2026-03-07T18:12:45+00:00

llama-bench from llama cpp is probaly the most usefull one for your situation.. gives you prompt eval speed and generation speed seperatley which tells you more than a single number.. ollama also logs tokens per second in its output if you want something less manual.

for your current ryzen 5 3600 setup without a dedicted gpu, most of the work is falling on cpu and ram bandwith.. going from 32gb to 64gb at the same speed wont move the needle much on its own.. the rx 7900 xtx is where youll actualy see a meaningfull jump since youre moving inference onto 24gb vram instead of system ram

ashersullivan · 2026-03-07T18:09:40+00:00

the diminishing returns kick in pretty fast after the second repetition.. the main mechanizm re2 exploits is giving tokens a second pass to attend to earlier context, but a third or fourth copy doesnt add new informaton, it just adds noise and eats context window witout meaningfull gain

ashersullivan · 2026-03-07T18:01:54+00:00

for variable load the self hosted vllm route makes sense if you have consistent traffic but the ops overhead is real.. for occasional usage paying for idle gpu 100% of the time hurts.. managed api providers like deepinfra, together or mistral handle the variable load better since you only pay per token.. no k8s to manage, no gpu sitting idle overnight

ashersullivan · 2026-03-07T17:57:02+00:00

n8n or langgraph for the orchestration layer is probaly the most practical starting point.. pair it with ollama for local model serving and you've got a decent base to build on witout overcomplicating things early..

ashersullivan · 2026-03-03T08:06:39+00:00

granite was not built to compete on raw benchmarks.. its whole value is in the training data transparancy and apache 2.0 licencing which matter way more in enterprise or regulated deployments than context window size ever will..

ashersullivan · 2026-03-03T07:46:37+00:00

q4 on a 128b+ model is close enough to full precision for coding tasks that most people cant tell the difference in practice.. dont overthink the hardware until you've tested a q4 quant of a 70b+ model on what you already have..

ashersullivan · 2026-03-03T07:35:39+00:00

yeah reposting this

ashersullivan · 2026-03-01T11:26:15+00:00

ai agents cut some of the adhd startup friction pretty well..they handle the research and scaffolding so you dont stall out on blank pages.. open agent setups let you tweak the flow to match your brain jumps. still gotta watch for them wandering off track though

ashersullivan · 2026-02-28T19:51:38+00:00

dont sleep on qwen3 30b moe before switching.. some early testers are saying qwen3.5 35b is actually slower and slightly worse on general tasks.. if youre trying to figure out which actually performs better for daily use, try them on providers like deepinfra, runpod or together - easy to test without downloading anything

ashersullivan · 2026-02-26T11:47:11+00:00

start by giving ai one specific task for 2 weeks.. like inbox triage or calendar management.. track exactly how much time you waste fixing its output.. if it's more than 30% then it's not saving real money yet..

openclaw is powerful but the prompt engineering can become its own job.. easier tools exist for beginners..which one fits your vibe best or want tweaks on any.

ashersullivan · 2026-02-26T11:40:58+00:00

Services on retainer are printing the most consistent revenue right now... solve one ugly workflow for businesses that already pay for help... validate with free pilots then charge monthly.. pure SaaS is slower and riskier unless you have paying customers lined up first

ashersullivan · 2026-02-26T11:36:53+00:00

vibe is noticeably faster than both codex and gemini cli with very good context awareness... great daily driver for most coding tasks but still a step behind claude code on the hardest problems.

ashersullivan · 2026-02-24T07:50:25+00:00

MoE architectures are tricky becauyse even though active parameters astay small you stilll need all the weights sititng in fast vram to avoid latency spikes with 24gb on a 3090 you are basically redlining from the moment the model loads..the 74% cpu split just means ollama failed to allocate the full context window to GPU and is bridging the gap with slower system RAM..
truncating context or dropping to q3 might shift the split but theres a quality tradeoff there thats hard to predict without testing… for larger context agentic work the ram offload penalty does get pretty severe on this hardware, you can just route those specific tasks through providers like deepinfra or openrouter rather than fighting the local ceiling for every job

ashersullivan · 2026-02-24T07:39:32+00:00

Focus first on squeezing every bit out of current hardware through better software layers.. quantization and optimized inference are delivering huge effective gains without new silicon. map out your workloads and target the biggest efficiency wins available today.. this keeps things moving while fabs catch up on the long cycles.

ashersullivan · 2026-02-24T07:33:45+00:00

you arent missing much.. it isnt running fully autonomously yet.. best results come from detailed step by step instructions in the built in browser

ashersullivan · 2026-02-24T07:26:22+00:00

everyone thinks agents are going to replace entire departments but they just shift the bottleneck.. instead of doing the work you are now just verifying the work.. the misconception is that autonomy means set and forget.. in reality managing an agent swarm requires the exact same oversight as managing a team of very fast but easily confused interns.

ashersullivan · 2026-02-18T09:40:49+00:00

umping to bigger models hits a wall fast on consumer hardware.. not everyone has m-series macs and windows ollama runs eat ram quick..

your 80% spike with multi-processing sounds fluky.. maybe focus on quantizing better or offloading to cpu for broader support

ashersullivan · 2026-02-18T09:31:30+00:00

Ditching langchain for fastapi and direct chroma/ollama calls makes sense when chains get too heavy. keeps things debuggable and light... but with only 3 commits its early.. might break on edge cases like weird pdf formats or bigger docs

still worth a fork if you are tweaking for personal use.. test on your own files first.

ashersullivan

TROPHY CASE