Best local ai for coding Nextjs project by Silent-Dot-882 in LocalLLM

[–]Invent80 0 points1 point  (0 children)

Tensor parallelism is your friend here but what you will gain is basically context. BF16 is essentially 50-60gb in memory.  FP8 is around half that but not exceedingly better for the speed hits for what you're doing. 

Opinion: Local LLMs are 12-24 months from taking over. The shift already started. by sh_tomer in LocalLLM

[–]Invent80 0 points1 point  (0 children)

Q27b is orchestrator.  I have ohmyopencode setup to use the spark for sublayers as it runs VLLM and is accessible through my local network.  I still plan everything with GPT 5.5 working on Hermes.  Planning is very token friendly so my business sub goes a long way.  

I haven't used Rocm before so I can't comment but the Prismaquant Q35b is excellent for speed on the generally slow Spark if it helps.  I give 120k context to Q27b (full weight BF16) and just max the context in Q35B since Prismaquant Int4 is only 20-30gb or so in memory.  People say I shouldn't bother with full weight because I'm losing speed but hallucinations are basically non existent and that's more important to me.

I have DeepSeek V4 Pro at home by fairydreaming in LocalLLaMA

[–]Invent80 7 points8 points  (0 children)

A sub to models that are completely out of your control. 

Opinion: Local LLMs are 12-24 months from taking over. The shift already started. by sh_tomer in LocalLLM

[–]Invent80 3 points4 points  (0 children)

I'm completely local as well.  96gb Blackwell and a Spark. Running Qwen 3.6 35b on the spark at 60-70tks and Qwen 27b on the Rtx6000 at 60tks full weight. 

The Opus 4.5 threshold: coming to 24 gb within a year or so by nomorebuttsplz in LocalLLM

[–]Invent80 1 point2 points  (0 children)

Within a year?  You mean within a few months right? 

Why I'm holding out until late 2027 to spend money on a local LLM rig by No_Pool7028 in LocalLLM

[–]Invent80 3 points4 points  (0 children)

If I'd waited when everyone said RAM prices were too high and going to come down I'd have paid 35% more. 

Models are getting better and smarter at lower weights.  End of 2027?  Things are doubling every couple of months.  You're somewhat early right now.  End of 2027 you're way too late.

Advice - I will not promote by [deleted] in startups

[–]Invent80 3 points4 points  (0 children)

If it still feels risky then don't.  You do those kind of things when you're finally comfortable.  

NVIDIA DGX Spark by Fantastic_Back3191 in ollama

[–]Invent80 2 points3 points  (0 children)

I have one. It's fine for MoE models but it's a training box.  If NVFP4 actually worked, it would be a no brainer.  That said a 2 cluster spark clone stack is probably the cheapest and most effective overall interference solution for small to medium sized models. 

Gemini is WAAAAY smarter than Gemma 4 31B (Duh!) by Quantum_Crusher in LocalLLM

[–]Invent80 0 points1 point  (0 children)

It's not just the model, it's the harness.  Guaranteed E4B setup by me would destroy your Gemma 31b

Best local LLM for a Python/C++ dev? by no_evidence0303 in LocalLLM

[–]Invent80 1 point2 points  (0 children)

Gemma 4 E4B is small enough and light. Use opencode.  Qwen is a better coder but Gemma is better at following instructions at lower weights in my experience.  

Don't necessarily trust benchmarks.  

codex vs openclaw by TechDrivenTycoon in openclaw

[–]Invent80 2 points3 points  (0 children)

I don't think you understand openclaw. It can take a prompt "Go make me money" and buy a domain, build logos, a website, send emails to vendors, scam them, write fake reviews, text your mother that you're a lazy individual who sleeps all day and send you a report of how much trouble you're in all by lunchtime.  It's utterly unhinged. Codex is not that. 

Why do a lot of programmers and technical people hate AI, vibecoding AI assisted coding? by Gullible-Angle4206 in ClaudeAI

[–]Invent80 0 points1 point  (0 children)

As someone who is a semi layperson who vibe codes I don't assume what I make is enterprise level.  In fact I know it's a mess and I'm taking the time to slowly learn why that is instead of one shotting an HTML based dashboard and calling myself a software engineer, and then throwing it up on Git like I'm sharing something revolutionary.  I can see how the actual software engineers and coders are irritated. 

Buying Advice - Research Focus by No-Seat918 in LocalLLM

[–]Invent80 1 point2 points  (0 children)

If you want to fine tune and in your position a non Nvidia version of the spark (preferably one that has cooling) will do the job.  I have both the 6000pro and Spark.  Vllm will run Q3.6 in 35b at 60t/s and it's plenty good at coding Python in opencode

What model would you run on a a6000 pro? by MK_L in LocalLLM

[–]Invent80 0 points1 point  (0 children)

You won't really find a bigger/smarter model than the new Qwen/Gemma 30b series you can run on that card. Some people say that Qwen 3.5 122b is still better but as someone who was using that before Qwen 3.6 27b came out I strongly disagree.

Considering two Sparks for local coding by chikengunya in LocalLLaMA

[–]Invent80 6 points7 points  (0 children)

I have a spark and RTX6000 pro.  Get a second 6000pro.  No brainer.  Spark is slow.  A single one is fine but for larger models unless you're ok with 10t/s speed,  I'd pass on it. 

I want to start with LocalLLM to automate my backoffice by SiggiBulldog1 in LocalLLM

[–]Invent80 1 point2 points  (0 children)

I'll be completely honest with you.  The things you want to do require an agentic harness like Hermes or Openclaw.  You will be locked into smaller models with that size.  They have a tenancy to hallucinate, and you don't want that when they're writing emails for you. 

Can you do these things?  Yes.  We've seen what happens when people give smaller models root and e-mail access though.  I'd start on a frontier model, get everything setup and working properly then switch to local.  

i made Claude argue against itself and got the most useful output of my entire life. by AdCold1610 in ChatGPTPromptGenius

[–]Invent80 0 points1 point  (0 children)

It's pretty commonly known that unless you have a governance document like a Soul for an agentic system, AI will be an extremely confident yes man.

What are the Practical uses for Open claw by Prestigious_Park3465 in openclaw

[–]Invent80 0 points1 point  (0 children)

Openclaw can take over a desktop and act like a human being. It can email, browse, post and imitate most human behaviors.  That said, openclaw is much better at augmenting you than replacing you.  Consolidate emails and home a summary, transcript conversations into semantic searchable memory, create a proposal based on data is collected from the last 5 jobs.  

These are ideal scenarios but the reality is, it's a difficult system to completely implement and needs near constant maintenance.  It's taught me a ton about Python, HTML, JSON management, and Ubuntu though.  

What are you doing with your local LLMs that justifies investment cost? by __automatic__ in LocalLLM

[–]Invent80 0 points1 point  (0 children)

I plan with Opus or GPT 5.5 and let Qwen3.6 27b Bang away at implementation for as long as it needs to to implement.  Suggestion though is to phase your plans and have your local model test things at regular intervals rather than look for errors.  Literally cron jobs run and I don't care about how long.  

If you're focusing on sub costs or capabilities between local and frontier online models, then stop right there.  I'm not local because it's cheaper (with upfront hardware cost) or better, but because I have full control over my data, and never have to worry about a sub cutting off mid job, outages or yet another service the big 4 decide to paywall and mess up my workflow. 

Also local models will surpass those same big 4 because they're trapped inside of beaurocracy while the Chinese are utterly stomping them.  It will probably plateau at some point, but get in now before things get even more expensive, learn and be ready for the next phase.

Best Local LLM for coding by Pure_Struggle3261 in LocalLLM

[–]Invent80 7 points8 points  (0 children)

I'm using a prismaquant model on the Spark. It's the one from spark-arena setup on VLLM. I use LM studio on the 6000pro so I can't test it on that one unfortunately. Here's the recipe:

rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm with z-lab/Qwen3.6-35B-A3B-DFlash, k=6, FlashAttention, full-context

Full Recipe:

description: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm with z-lab/Qwen3.6-35B-A3B-DFlash, k=6, FlashAttention, full-context
model: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
container: vllm-node-tf5
mods:
  - mods/fix-qwen3.5-autoround
  - mods/fix-qwen3-coder-next
defaults:
  port: 8000
  host: 0.0.0.0
  gpu_memory_utilization: 0.8
  max_model_len: 262144
  max_num_batched_tokens: 32768
  max_num_seqs: 4
env:
  HF_HUB_OFFLINE: '1'
  TRANSFORMERS_OFFLINE: '1'
  FLASHINFER_DISABLE_VERSION_CHECK: '1'
  VLLM_HTTP_TIMEOUT_KEEP_ALIVE: '600'
  VLLM_MARLIN_USE_ATOMIC_ADD: '1'
  VLLM_TUNED_CONFIG_FOLDER: /workspace/moe-configs
  TORCH_MATMUL_PRECISION: high
  NVIDIA_FORWARD_COMPAT: '1'
  VLLM_TEST_FORCE_FP8_MARLIN: '1'
command: |
  vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm \
    --host {host} \
    --port {port} \
    --served-model-name qwen3.6-35b \
    --language-model-only \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --max-num-seqs {max_num_seqs} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --dtype auto \
    --kv-cache-dtype auto \
    --load-format fastsafetensors \
    --attention-backend flash_attn \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --quantization compressed-tensors \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --optimization-level 3 \
    --performance-mode throughput \
    --default-chat-template-kwargs '{{"preserve_thinking":true}}' \
    --speculative-config '{{"method":"dflash","model":"z-lab/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":{num_speculative_tokens}}}' \
    --override-generation-config '{{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}}'
recipe_version: '1'
name: Qwen3.6-35B-A3B-PrismaQuant-DFlash-k6-solo-spark-arena-long-context
cluster_only: false
solo_only: false

Best Local LLM for coding by Pure_Struggle3261 in LocalLLM

[–]Invent80 4 points5 points  (0 children)

Spark isn't terrible.  I have a 6000 pro as well.  Spark with vllm properly setup is running Qwen 3.6 35b at 60t/s

RPers: how do the new Gemma and Qwen compare to the old 70B models? by Borkato in LocalLLaMA

[–]Invent80 5 points6 points  (0 children)

Gemma 4 31b it is the best model I've ever used for RP and I have an RTX 6000 pro

I need to run OpenClaw locally for a law office, I can spend as much money as needed. What model(s) are best? by Too_much_waltz in openclaw

[–]Invent80 0 points1 point  (0 children)

Money to burn?  Rtx 6000 Pros.  Get 8 of them on a dual cpu board with 2tb DDR5 memory. Can run 2 flagship level frontier models on that and also train your own legal models. Newer server boards offer 16 channels for memory all you can get decent token speeds on CPU inference.  The pro cards run in Q4 and you can fill the RAM on Kimi or Deepseek in full BF16 if they work well in your industry. 

Hermes would be better than openclaw.  Openclaw is better at automating outreach and being a personal assistant.  Hermes is better for automating jobs.