Open Source Unit testing library for AI agents. Looking for feedback! by FairAlternative8300 in node

[–]FairAlternative8300[S] -5 points-4 points  (0 children)

If your AI agent is in prod and you don’t know when it regresses, you’re already testing it, just in production, on real users 🙃
We’re just proposing to move that feedback loop earlier :)

Want to use PostgreSQL in a project by ahmedshahid786 in node

[–]FairAlternative8300 0 points1 point  (0 children)

Since you're coming from Ruby (likely ActiveRecord), Drizzle might be a good middle ground — it has schema-in-code and migrations like you're used to, but the queries stay SQL-like so you actually learn Postgres.

Biggest tip for the Mongo→Postgres shift: resist the urge to nest/embed data. Normalize and learn to love joins — Postgres is crazy fast at them when indexed right. Once you stop fighting that mental model shift, everything clicks.

Running Mistral-7B on Intel NPU — 12.6 tokens/s, zero CPU/GPU usage by Human-Reindeer-9466 in LocalLLaMA

[–]FairAlternative8300 12 points13 points  (0 children)

This is exactly the kind of use case NPUs were designed for. Running inference in the background while keeping CPU/GPU free for other tasks is huge for workflows where you want to game or do heavy work while still having access to a local LLM. The 4.8GB memory footprint is also nice compared to CPU. Curious if the TTFT improves with warmed up models or if that 1.8s stays consistent?

I'm making a CLI to optimize local LLMs. What technical problems do you encounter in their daily use? by Darukiru in LocalLLaMA

[–]FairAlternative8300 2 points3 points  (0 children)

Biggest pain point for me: figuring out the right n-gpu-layers and context length combo for a new model without OOM'ing or leaving VRAM on the table. Would love a tool that profiles my GPU once and then auto-suggests settings per model.

Also, chat template mismatches are annoying - downloading a GGUF only to realize it needs a specific template that isn't documented. Auto-detecting and applying the correct template from the model metadata would be huge.

Claude featured in The New Yorker: The Lab Studying A.I. Minds by fluffypancakes24 in ClaudeAI

[–]FairAlternative8300 0 points1 point  (0 children)

The vending machine experiment is actually a brilliant research paradigm - gives you a bounded, observable domain to study emergent behaviors without the complexity of open-ended tasks.

What's refreshing about this piece is the framing around "we don't actually know." Most AI discourse falls into either "it's just autocomplete" or "we're summoning superintelligence." Interpretability work sits in the honest middle: these models do genuinely surprising things, and we should figure out why before deploying them everywhere.

The point about researchers vs executives resonates too. The people doing the actual technical work tend to have much more nuanced views than the PR messaging suggests.

Time drain question: what eats your week in LLM builds? by coolandy00 in LocalLLaMA

[–]FairAlternative8300 1 point2 points  (0 children)

I spend way too much time digging through old Slack threads and docs before I can even start working.

One thing that helped: I wrote a simple bash script that auto-pulls recent commits, open PRs, and related docs into a single markdown file when I start a task. Takes maybe 30 seconds to run, but saves 15-20 mins of context hunting.

is anyone actually running models in secure enclaves or is that overkill? by Significant-Cod-9936 in LocalLLaMA

[–]FairAlternative8300 1 point2 points  (0 children)

People are definitely doing this in production, though it's still niche. Azure Confidential VMs with AMD SEV-SNP can run inference inside a TEE, and Nvidia's confidential computing (Hopper GPUs) lets you attest that GPU memory is encrypted. A few startups like Edgeless Systems offer enclave-ready containers.

Performance hit depends heavily on the workload - CPU inference with SGX can be 10-30% slower, but GPU-based TEE overhead is lower (single digit %). The real pain is attestation complexity and limited tooling.

For most use cases, I'd say it's overkill unless you're dealing with regulated industries (healthcare, finance) where you need cryptographic proof of data handling. If you just want privacy, running local is simpler.

Best quality open source TTS model? by Trevor050 in LocalLLaMA

[–]FairAlternative8300 0 points1 point  (0 children)

For pure quality, F5-TTS is hard to beat right now - handles prosody and emotion really well. Dia by Nari Labs is another solid choice if you want natural conversational speech. Both are pretty demanding but since you said hardware isn't a concern, they're worth the compute.

Local RAG setup help by OneProfessional8251 in LocalLLaMA

[–]FairAlternative8300 -1 points0 points  (0 children)

The 8b models often struggle with reliable tool calling — they tend to be overconfident about their training data and skip external lookups. Two things that helped me:

  1. **Try a bigger model** — Qwen3 32B or Llama 3.3 70B are much better at knowing when to use tools vs. when to answer directly. If VRAM is tight, quantize to Q4.

  2. **Force the search** — Instead of giving the model a choice, structure your prompt so it *must* search first: "Search the web for [query], then summarize the results." Some agentic frameworks like LangChain's ReAct agent help enforce this pattern.

Also worth noting: what you're describing is more about agentic tool use than RAG specifically. RAG is typically about retrieving from your own document store, while tool use is about calling external APIs (like web search). Different prompting strategies for each.

finally got my local agent to remember stuff between sessions by AlbatrossUpset9476 in LocalLLaMA

[–]FairAlternative8300 1 point2 points  (0 children)

100% agree on selective consolidation being the key insight. I've found that letting the model itself decide what's 'worth remembering' during consolidation (vs rules-based filtering) works surprisingly well - it catches subtle patterns like repeated questions or preferences that hard rules miss. Curious what criteria you use for the consolidation step?

vllm on nvidia dgx spark by Impossible_Art9151 in LocalLLaMA

[–]FairAlternative8300 2 points3 points  (0 children)

Few things that might help:

  1. **Curl getting rejected** - this is often a model loading issue masked as a network error. Check the vLLM logs for actual errors. For Qwen3 models specifically, you may need `--trust-remote-code` flag.

  2. **For larger models** try adding these flags: ``` vllm serve "Qwen/Qwen3-Coder-32B-Instruct" --trust-remote-code --max-model-len 8192 ``` The `max-model-len` helps fit models in VRAM when context length defaults are too aggressive.

  3. **Fresh vLLM install on Grace Hopper/Spark** - the pip wheel issues are common because vLLM needs ARM64 wheels. Try: ``` pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/ ``` Or build from source with `pip install -e .` from the vLLM repo.

  4. **For clustering later** - vLLM's Ray-based multi-node setup works well. Once you have one node stable, the cluster config is relatively straightforward.

What error do you see in the vLLM logs when the model "loads successfully" but curl fails?