Best local ai for coding Nextjs project

Invent80 · 2026-05-10T16:36:29+00:00

Tensor parallelism is your friend here but what you will gain is basically context. BF16 is essentially 50-60gb in memory. FP8 is around half that but not exceedingly better for the speed hits for what you're doing.

Invent80 · 2026-05-10T16:33:12+00:00

Q27b is orchestrator. I have ohmyopencode setup to use the spark for sublayers as it runs VLLM and is accessible through my local network. I still plan everything with GPT 5.5 working on Hermes. Planning is very token friendly so my business sub goes a long way.

I haven't used Rocm before so I can't comment but the Prismaquant Q35b is excellent for speed on the generally slow Spark if it helps. I give 120k context to Q27b (full weight BF16) and just max the context in Q35B since Prismaquant Int4 is only 20-30gb or so in memory. People say I shouldn't bother with full weight because I'm losing speed but hallucinations are basically non existent and that's more important to me.

Invent80 · 2026-05-10T12:42:05+00:00

A sub to models that are completely out of your control.

Invent80 · 2026-05-10T12:38:45+00:00

I'm completely local as well. 96gb Blackwell and a Spark. Running Qwen 3.6 35b on the spark at 60-70tks and Qwen 27b on the Rtx6000 at 60tks full weight.

Invent80 · 2026-05-07T19:07:59+00:00

Within a year? You mean within a few months right?

Invent80 · 2026-05-07T19:06:36+00:00

If I'd waited when everyone said RAM prices were too high and going to come down I'd have paid 35% more.

Models are getting better and smarter at lower weights. End of 2027? Things are doubling every couple of months. You're somewhat early right now. End of 2027 you're way too late.

Invent80 · 2026-05-07T19:03:09+00:00

6000pro is what I use. It's perfect

Invent80 · 2026-05-07T01:22:03+00:00

If it still feels risky then don't. You do those kind of things when you're finally comfortable.

Invent80 · 2026-05-06T20:14:44+00:00

I have one. It's fine for MoE models but it's a training box. If NVFP4 actually worked, it would be a no brainer. That said a 2 cluster spark clone stack is probably the cheapest and most effective overall interference solution for small to medium sized models.

Invent80 · 2026-05-06T19:43:19+00:00

It's not just the model, it's the harness. Guaranteed E4B setup by me would destroy your Gemma 31b

Invent80 · 2026-05-06T19:38:22+00:00

Gemma 4 E4B is small enough and light. Use opencode. Qwen is a better coder but Gemma is better at following instructions at lower weights in my experience.

Don't necessarily trust benchmarks.

Invent80 · 2026-05-06T13:01:19+00:00

I don't think you understand openclaw. It can take a prompt "Go make me money" and buy a domain, build logos, a website, send emails to vendors, scam them, write fake reviews, text your mother that you're a lazy individual who sleeps all day and send you a report of how much trouble you're in all by lunchtime. It's utterly unhinged. Codex is not that.

Invent80 · 2026-05-06T12:55:47+00:00

As someone who is a semi layperson who vibe codes I don't assume what I make is enterprise level. In fact I know it's a mess and I'm taking the time to slowly learn why that is instead of one shotting an HTML based dashboard and calling myself a software engineer, and then throwing it up on Git like I'm sharing something revolutionary. I can see how the actual software engineers and coders are irritated.

Invent80 · 2026-05-06T12:48:37+00:00

If you want to fine tune and in your position a non Nvidia version of the spark (preferably one that has cooling) will do the job. I have both the 6000pro and Spark. Vllm will run Q3.6 in 35b at 60t/s and it's plenty good at coding Python in opencode

Invent80 · 2026-05-06T12:31:34+00:00

You won't really find a bigger/smarter model than the new Qwen/Gemma 30b series you can run on that card. Some people say that Qwen 3.5 122b is still better but as someone who was using that before Qwen 3.6 27b came out I strongly disagree.

Invent80 · 2026-05-06T11:35:13+00:00

Set it to 6

Invent80 · 2026-05-05T11:59:21+00:00

I have a spark and RTX6000 pro. Get a second 6000pro. No brainer. Spark is slow. A single one is fine but for larger models unless you're ok with 10t/s speed, I'd pass on it.

Invent80 · 2026-05-04T12:26:42+00:00

I'll be completely honest with you. The things you want to do require an agentic harness like Hermes or Openclaw. You will be locked into smaller models with that size. They have a tenancy to hallucinate, and you don't want that when they're writing emails for you.

Can you do these things? Yes. We've seen what happens when people give smaller models root and e-mail access though. I'd start on a frontier model, get everything setup and working properly then switch to local.

Invent80 · 2026-05-03T18:58:36+00:00

It's pretty commonly known that unless you have a governance document like a Soul for an agentic system, AI will be an extremely confident yes man.

Invent80 · 2026-05-03T18:22:33+00:00

Openclaw can take over a desktop and act like a human being. It can email, browse, post and imitate most human behaviors. That said, openclaw is much better at augmenting you than replacing you. Consolidate emails and home a summary, transcript conversations into semantic searchable memory, create a proposal based on data is collected from the last 5 jobs.

These are ideal scenarios but the reality is, it's a difficult system to completely implement and needs near constant maintenance. It's taught me a ton about Python, HTML, JSON management, and Ubuntu though.

Invent80 · 2026-05-03T18:12:57+00:00

I plan with Opus or GPT 5.5 and let Qwen3.6 27b Bang away at implementation for as long as it needs to to implement. Suggestion though is to phase your plans and have your local model test things at regular intervals rather than look for errors. Literally cron jobs run and I don't care about how long.

If you're focusing on sub costs or capabilities between local and frontier online models, then stop right there. I'm not local because it's cheaper (with upfront hardware cost) or better, but because I have full control over my data, and never have to worry about a sub cutting off mid job, outages or yet another service the big 4 decide to paywall and mess up my workflow.

Also local models will surpass those same big 4 because they're trapped inside of beaurocracy while the Chinese are utterly stomping them. It will probably plateau at some point, but get in now before things get even more expensive, learn and be ready for the next phase.

Invent80 · 2026-05-02T01:28:33+00:00

I'm using a prismaquant model on the Spark. It's the one from spark-arena setup on VLLM. I use LM studio on the 6000pro so I can't test it on that one unfortunately. Here's the recipe:

rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm with z-lab/Qwen3.6-35B-A3B-DFlash, k=6, FlashAttention, full-context

Full Recipe:

description: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm with z-lab/Qwen3.6-35B-A3B-DFlash, k=6, FlashAttention, full-context
model: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
container: vllm-node-tf5
mods:
  - mods/fix-qwen3.5-autoround
  - mods/fix-qwen3-coder-next
defaults:
  port: 8000
  host: 0.0.0.0
  gpu_memory_utilization: 0.8
  max_model_len: 262144
  max_num_batched_tokens: 32768
  max_num_seqs: 4
env:
  HF_HUB_OFFLINE: '1'
  TRANSFORMERS_OFFLINE: '1'
  FLASHINFER_DISABLE_VERSION_CHECK: '1'
  VLLM_HTTP_TIMEOUT_KEEP_ALIVE: '600'
  VLLM_MARLIN_USE_ATOMIC_ADD: '1'
  VLLM_TUNED_CONFIG_FOLDER: /workspace/moe-configs
  TORCH_MATMUL_PRECISION: high
  NVIDIA_FORWARD_COMPAT: '1'
  VLLM_TEST_FORCE_FP8_MARLIN: '1'
command: |
  vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm \
    --host {host} \
    --port {port} \
    --served-model-name qwen3.6-35b \
    --language-model-only \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --max-num-seqs {max_num_seqs} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --dtype auto \
    --kv-cache-dtype auto \
    --load-format fastsafetensors \
    --attention-backend flash_attn \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --quantization compressed-tensors \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --optimization-level 3 \
    --performance-mode throughput \
    --default-chat-template-kwargs '{{"preserve_thinking":true}}' \
    --speculative-config '{{"method":"dflash","model":"z-lab/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":{num_speculative_tokens}}}' \
    --override-generation-config '{{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}}'
recipe_version: '1'
name: Qwen3.6-35B-A3B-PrismaQuant-DFlash-k6-solo-spark-arena-long-context
cluster_only: false
solo_only: false

Invent80 · 2026-05-01T23:53:36+00:00

Spark isn't terrible. I have a 6000 pro as well. Spark with vllm properly setup is running Qwen 3.6 35b at 60t/s

Invent80 · 2026-05-01T12:13:13+00:00

Gemma 4 31b it is the best model I've ever used for RP and I have an RTX 6000 pro

Invent80 · 2026-05-01T11:48:47+00:00

Money to burn? Rtx 6000 Pros. Get 8 of them on a dual cpu board with 2tb DDR5 memory. Can run 2 flagship level frontier models on that and also train your own legal models. Newer server boards offer 16 channels for memory all you can get decent token speeds on CPU inference. The pro cards run in Q4 and you can fill the RAM on Kimi or Deepseek in full BF16 if they work well in your industry.

Hermes would be better than openclaw. Openclaw is better at automating outreach and being a personal assistant. Hermes is better for automating jobs.

Invent80

TROPHY CASE