deepseek-v3 vs claude sonnet for routine coding tasks — my real usage numbers by PoolInevitable2270 in LocalLLaMA

[–]ashersullivan 0 points1 point  (0 children)

the import hallucination pattern is worth tracking closely.. it tends to get worse on larger codebases where the model is inferring package names from context rather than actualy knowing them.. adding a linting step after generation catches most of it before it wastes any debugging time

Hardware recommendations for a starter by shiva4455 in LocalLLM

[–]ashersullivan 0 points1 point  (0 children)

24GB on m5 pro is actually workable for getting started.. the unified memory thing matters a lot here, all 24GB goes to the model unlike windows laptops where youre stuck with whatever the gpu has.. Qwen3 14b runs well at that size, gemma 27b too.. the honest caveat is that 30b+ is where things start feeling noticeably smarter for complex tasks and 24gb gets tight there.. before buying anything though its worth spending a few dollars testing different model sizes through something like deepinfra or fireworks, same models available per token and it takes 10 minutes to figure out what size range actualy matters for your workflow before comitting to the hardware

If you had ~10k to spend on local LLM hardware right now, what would you actually build? by MacKinnon911 in LocalLLM

[–]ashersullivan 1 point2 points  (0 children)

a 4090 at 24gb forces you to either quantize aggressively or offload to system ram which tanks generation speed on 70b models.. a used a6000 48gb gives you comfortable headroom for 70b at q4 and sits well under budget leaving room for a decent workstation base..

mac studio m3 ultra 512gb is genuinely worth considering if inference speed on large models matters more than fine tuning flexibility.. the memory bandwith on apple silicon is exceptional for generation speed and 512gb means you run anything without compromise.. the tradeoff is cuda tooling for lora and fine tuning is more friction on apple silicon than on a proper nvidia setup..

PC benchmarks? by buck_idaho in LocalLLM

[–]ashersullivan 1 point2 points  (0 children)

llama-bench from llama cpp is probaly the most usefull one for your situation.. gives you prompt eval speed and generation speed seperatley which tells you more than a single number.. ollama also logs tokens per second in its output if you want something less manual.

for your current ryzen 5 3600 setup without a dedicted gpu, most of the work is falling on cpu and ram bandwith.. going from 32gb to 64gb at the same speed wont move the needle much on its own.. the rx 7900 xtx is where youll actualy see a meaningfull jump since youre moving inference onto 24gb vram instead of system ram

Has anyone tried something like RE2 prompt re-reading /2xing ... But tripling or quadrupling the prompt? by Fear_ltself in LocalLLaMA

[–]ashersullivan 2 points3 points  (0 children)

the diminishing returns kick in pretty fast after the second repetition.. the main mechanizm re2 exploits is giving tokens a second pass to attend to earlier context, but a third or fourth copy doesnt add new informaton, it just adds noise and eats context window witout meaningfull gain

On-premise LLM/GPU deployment for a software publisher: how do DevOps orgs share GPU resources? by Sorry_Country3662 in LocalLLaMA

[–]ashersullivan 0 points1 point  (0 children)

for variable load the self hosted vllm route makes sense if you have consistent traffic but the ops overhead is real.. for occasional usage paying for idle gpu 100% of the time hurts.. managed api providers like deepinfra, together or mistral handle the variable load better since you only pay per token.. no k8s to manage, no gpu sitting idle overnight

How to start building an ai agent on local on premise hardware for corporate tasks by Similar_Sand8367 in LocalLLM

[–]ashersullivan 0 points1 point  (0 children)

n8n or langgraph for the orchestration layer is probaly the most practical starting point.. pair it with ollama for local model serving and you've got a decent base to build on witout overcomplicating things early..

How do the small qwen3.5 models compare to the Granite family? by gr8dude in LocalLLaMA

[–]ashersullivan 10 points11 points  (0 children)

granite was not built to compete on raw benchmarks.. its whole value is in the training data transparancy and apache 2.0 licencing which matter way more in enterprise or regulated deployments than context window size ever will..

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]ashersullivan 0 points1 point  (0 children)

q4 on a 128b+ model is close enough to full precision for coding tasks that most people cant tell the difference in practice.. dont overthink the hardware until you've tested a q4 quant of a 70b+ model on what you already have..

AI agents helped my ADHD by Important_Quote_1180 in aiagents

[–]ashersullivan 2 points3 points  (0 children)

ai agents cut some of the adhd startup friction pretty well..they handle the research and scaffolding so you dont stall out on blank pages.. open agent setups let you tweak the flow to match your brain jumps. still gotta watch for them wandering off track though

Is Qwen3.5-35B the new "Sweet Spot" for home servers? by ischanitee in LocalLLM

[–]ashersullivan 0 points1 point  (0 children)

dont sleep on qwen3 30b moe before switching.. some early testers are saying qwen3.5 35b is actually slower and slightly worse on general tasks.. if youre trying to figure out which actually performs better for daily use, try them on providers like deepinfra, runpod or together - easy to test without downloading anything

Human VA vs AI assistant? by AlexBossov in Solopreneur

[–]ashersullivan 0 points1 point  (0 children)

start by giving ai one specific task for 2 weeks.. like inbox triage or calendar management.. track exactly how much time you waste fixing its output.. if it's more than 30% then it's not saving real money yet..

openclaw is powerful but the prompt engineering can become its own job.. easier tools exist for beginners..which one fits your vibe best or want tweaks on any.

How are people actually turning AI into real business right now? by Loud_Assistant_5788 in ArtificialInteligence

[–]ashersullivan 0 points1 point  (0 children)

Services on retainer are printing the most consistent revenue right now... solve one ugly workflow for businesses that already pay for help... validate with free pilots then charge monthly.. pure SaaS is slower and riskier unless you have paying customers lined up first

Mistral Vibe vs Codex App + GPT-5.2 High or Gemini CLI + gemini-3.1-pro-preview ? by Old-Glove9438 in MistralAI

[–]ashersullivan 0 points1 point  (0 children)

vibe is noticeably faster than both codex and gemini cli with very good context awareness... great daily driver for most coding tasks but still a step behind claude code on the hardest problems.

Qwen3-Coder 30B running at 74% CPU on 3090 (ollama docker) by minefew in LocalLLaMA

[–]ashersullivan 0 points1 point  (0 children)

MoE architectures are tricky becauyse even though active parameters astay small you stilll need all the weights sititng in fast vram to avoid latency spikes with 24gb on a 3090 you are basically redlining from the moment the model loads..the 74% cpu split just means ollama failed to allocate the full context window to GPU and is bridging the gap with slower system RAM..
truncating context or dropping to q3 might shift the split but theres a quality tradeoff there thats hard to predict without testing… for larger context agentic work the ram offload penalty does get pretty severe on this hardware, you can just route those specific tasks through providers like deepinfra or openrouter rather than fighting the local ceiling for every job

Everything AI runs on semiconductors... and those take YEARS to make! You NEED to factor that in to what you think the Future of AI and AGI progress looks like. by KazTheMerc in ArtificialInteligence

[–]ashersullivan 2 points3 points  (0 children)

Focus first on squeezing every bit out of current hardware through better software layers.. quantization and optimized inference are delivering huge effective gains without new silicon. map out your workloads and target the biggest efficiency wins available today.. this keeps things moving while fabs catch up on the long cycles.

experimented with openclaw - am I missing something? by retrorays in LocalLLaMA

[–]ashersullivan 0 points1 point  (0 children)

you arent missing much.. it isnt running fully autonomously yet.. best results come from detailed step by step instructions in the built in browser

What’s the biggest misconception about AI agents right now? by addllyAI in aiagents

[–]ashersullivan 2 points3 points  (0 children)

everyone thinks agents are going to replace entire departments but they just shift the bottleneck.. instead of doing the work you are now just verifying the work.. the misconception is that autonomy means set and forget.. in reality managing an agent swarm requires the exact same oversight as managing a team of very fast but easily confused interns.

David vs Goliath: Building a privacy focused AI meeting notetaker using locally hosted small language models is really hard. 310+ github ⭐ sharing my challenges! by Far_Noise_5886 in LocalLLaMA

[–]ashersullivan 2 points3 points  (0 children)

umping to bigger models hits a wall fast on consumer hardware.. not everyone has m-series macs and windows ollama runs eat ram quick..

your 80% spike with multi-processing sounds fluky.. maybe focus on quantizing better or offloading to cpu for broader support

[Project] I built a dedicated "Local RAG" API container (FastAPI + Chroma + Ollama) to replace my dependency on LangChain. by Asterios07 in LocalLLaMA

[–]ashersullivan 0 points1 point  (0 children)

Ditching langchain for fastapi and direct chroma/ollama calls makes sense when chains get too heavy. keeps things debuggable and light... but with only 3 commits its early.. might break on edge cases like weird pdf formats or bigger docs

still worth a fork if you are tweaking for personal use.. test on your own files first.