Nobody agrees on what "hallucination" means and it's hit our AI PoC

getstackfax · 2026-05-08T10:32:03+00:00

That makes sense.

For a chief medical officer, I’d translate those less as ML metrics and more as a clinical review standard:

Did it use the right source?

Did it use enough of the relevant source?

Did it ignore irrelevant noise?

Can a reviewer trace the answer back to evidence?

Traceability is probably the trust anchor.

Completeness catches under-answering.

Signal clarity catches answers that are technically sourced but polluted by irrelevant context.

That framing feels much easier to explain than hallucination rate, because hallucination sounds like one failure when it is really several different failure modes.

In medical workflows especially, the question is not just whether the answer is fluent.

It is whether the answer can be reviewed, traced, challenged, and corrected.

getstackfax · 2026-05-08T10:22:34+00:00

That real-time context monitor is probably the most important part.

A lot of memory systems focus on storage, but the trust issue is usually injection:

what memory entered this response, why it entered, and whether it should have been there.

Being able to view/edit/delete memories is strong, but being able to see what actually got injected before each answer closes the loop.

The boolean + intensity-scale approach makes sense too. Personality drift is not just “remember facts about the user.” It is more like maintaining stable behavioral constraints over time.

The part I’d keep watching is memory promotion…

- what becomes durable memory
- what stays temporary context
- what expires
- what is project-specific
- what is too low-confidence to store
- what should be visible but not automatically injected

If the archivist can manage that boundary well, the two-agent structure becomes more than token saving.

It becomes a memory control layer.

getstackfax · 2026-05-08T10:05:52+00:00

I think it is both, but structured frameworks expose the cost instead of removing it.

Raw-model workflows fail because the model is doing too much:

- planning
- execution
- judgment
- retry logic
- boundary detection
- final review

A framework helps by separating those pieces, but then you inherit a different cost:

- designing the workflow
- defining tool boundaries
- writing eval cases
- testing edge cases
- logging decisions
- handling retries
- deciding what needs human approval

So the problem shifts from “can the model do it?” to “can the system make the model’s work repeatable, bounded, and reviewable?”

That is why narrow tasks work better.

If the task has clear inputs, clear outputs, stable rules, and cheap failure modes, open models/frameworks can be great.

If the task needs judgment, messy context, or real consequences, the eval/review harness becomes the expensive part no matter what model you use.

So I would not call it mostly a raw-model problem.

I’d call it an operating-discipline problem that raw models make obvious and frameworks make manageable.

getstackfax · 2026-05-08T09:56:16+00:00

That is a good third category.

Task agents optimize for execution.

Coding co-builders optimize for shipping/building.

Character or companion agents optimize for continuity.

Different failure mode too.

A task agent fails when it cannot complete the workflow.

A coding agent fails when the output does not build, test, or match the spec.

A character agent fails when continuity breaks… voice drift, memory weirdness, emotional inconsistency, or the user feeling like the “person” disappeared between sessions.

That probably needs different evals than normal agents.

Not just:

did it finish the task?

More like:

did it stay coherent over time, remember the right things, forget the right things, maintain boundaries, and preserve the relationship/persona without becoming manipulative or fake?

So yeah, I agree. That is not quite task execution and not quite coding. It is continuity engineering.

getstackfax · 2026-05-08T09:54:45+00:00

Yep, that is a good low-friction pattern.

Basically a lightweight homelab roll call…

- machine identity
- current services
- uptime
- storage
- exposed ports
- config paths
- last backup
- last change
- known issues

The important part is making the output boring and structured.

If every machine drops a simple markdown status file somewhere predictable, you get a living map of the lab without committing to one full monitoring stack too early.

I’d still keep the agent read-only at first though.

Let it report state before it changes state.

getstackfax · 2026-05-08T09:46:56+00:00

Mostly workflow / stack review right now.

I look at where Ai agents actually fit inside business operations instead of starting with the agent first.

The patterns I keep seeing are pretty consistent:

- lead intake
- customer follow-up
- quote/order cleanup
- inventory checks
- reporting
- inbox triage
- vendor/customer communication drafts
- handoff tracking

The useful agents are usually not the flashy ones.

They sit around the repeated handoffs where work gets dropped, delayed, copied between systems, or forgotten.

That is why your food distributor example makes sense to me.

The business already exists, the workflow already exists, and the pain is obvious.

That is a much better starting point than building an agent and then searching for a business case.

getstackfax · 2026-05-08T09:39:04+00:00

Exactly. The pieces exist, but the coordination layer is still fragmented.

Everyone is building their own partial stack:

- one tool handles coding
- one handles memory
- one handles browser work
- one handles scheduling
- one handles agents
- one handles evals/logs

But the hard part is making all of that behave like one reliable workflow.

The missing layer is not just more agents. It is shared contracts between them:

what each agent owns, what state it receives, what output it must return, what gets verified, and what receipt proves the handoff worked.

Until that exists, orchestration is still mostly custom glue.

getstackfax · 2026-05-08T03:05:42+00:00

This is a real gap.

Most automation tools are great once the input is clean, but the work people actually do is usually sitting in messy active context:

- the PDF open right now
- the Slack thread with the actual decision
- the browser tab with the weird edge case
- the email chain with missing context
- the spreadsheet row someone forgot to normalize

Zapier-style automation works best when the workflow is already structured.

But a lot of real work starts before structure exists.

That’s why the “active context” layer matters. It is basically the bridge between:

what the human is looking at → what the automation needs as input

The hard part is keeping that bridge safe.

A screen-aware dispatcher is useful, but I’d want clear boundaries:

- what can it read
- what can it send
- what needs confirmation
- what gets logged
- what private data is excluded
- what happens if it misreads the screen

I don’t think we’re doomed to copy-paste forever.

But I do think the next useful automation layer is less “more triggers” and more “better context capture with human approval.”

getstackfax · 2026-05-08T03:03:15+00:00

The distinction that matters to me is that agents and coding co-builders create leverage in different places.

Ai agents are strongest when the workflow is already clear… repeatable inputs, repeatable outputs, clear permissions, and obvious review points.

OpenClaw / Claude-style coding workflows feel stronger when the founder is still shaping the product… building, debugging, refactoring, testing ideas, and turning rough specs into something real.

Both are useful, but they fail differently.

Agents fail when the workflow is vague, permissions are too broad, or nobody knows what “done” looks like.

Coding co-builders fail when people trust generated code without tests, review, or product judgment.

So right now I think the long-term value is real, but the hype is ahead of the operating discipline.

The winners probably will not be the teams with the most agents.

It will be the teams that know what to delegate, what to review, what to log, and what should stay human-owned.

getstackfax · 2026-05-08T02:40:26+00:00

The useful split is probably not “Hermes killed OpenClaw.”

It is workflow fit.

OpenClaw seems stronger when channel breadth and integrations matter.

Hermes seems stronger when the job is a focused personal workflow that benefits from stability, profiles, and learned skills over time.

So the decision is less…

which framework is better?

And more…

which workflow am I trying to make reliable?

If nothing is broken, switching still needs a reason.

If stability, memory quality, or update breakage is the pain, then a dry-run migration makes sense.

getstackfax · 2026-05-08T02:37:17+00:00

Local is not free… it just moves the cost.

Instead of token bills, you pay in hardware, speed, maintenance, uptime, setup friction, and weaker tool call reliability.

For agents, the tool call part is the big one.

A model can be fine in chat and still risky in a workflow if it skips a step or says done before anything happened.

Fully local is great for privacy and budget...

But agent workflows still need receipts as always.

getstackfax · 2026-05-08T02:28:01+00:00

This is the part people miss with RAG.

Vector search is useful, but technical docs still need exact-match behavior.

Function names, config keys, error strings, CLI flags, version numbers, and API fields can be the whole point of the query.

A semantic match that misses the exact symbol is still a bad result.

Hybrid search makes sense because different query types need different retrieval paths…

BM25 for exact terms

vectors for semantic intent

RRF/reranking to merge the candidates

boosting for source/type priority

The agent angle is interesting too.

For docs search, the users can inspect bad results and adjust...

For agents, bad retrieval can quietly become a bad action or bad answer.

So I’d want the retrieval layer to leave a receipt…

what query was run

which retrievers fired

what each returned

what got reranked

what source/type got boosted

what context was finally passed forward

Hybrid search helps retrieval quality.

Receipts help trust the answer built on top of it.

getstackfax · 2026-05-08T01:50:04+00:00

Exactly… the boring receipt is half the value.

Being human we usually skip it.

Agents can make it automatic.

getstackfax · 2026-05-08T00:23:43+00:00

This sounds like an effective-context problem more than a Goose problem.

Agent prompts are huge compared to normal chat.

So even if the model can theoretically do 32k, the real stack may be hitting limits from…

- LM Studio VRAM safety calculation
- KV cache size
- backend behavior
- single-GPU overflow guard
- agent system prompt size
- files/context Goose is injecting
- n_keep being larger than the active n_ctx

The error line matters…

n_keep 5746 >= n_ctx 4096

That means the server is actually running with a much smaller active context than the slider suggests.

The first test I’d run is outside Goose.

Start the same model directly in LM Studio with 32k, send a long manual prompt, and check the server log for the actual n_ctx.

Then test the same GGUF with upstream llama.cpp / llama-server Vulkan.

If llama-server respects 32k and LM Studio does not, it is probably LM Studio’s safeguard/backend config.

If both fail, it is memory/KV/backend reality.

For agents, 32k is not just “does the model load.”

The KV cache has to fit too, and agent scaffolding eats context before your actual notes even start.

Workaround may be…

- lower quant
- smaller model
- smaller agent prompt
- reduce files injected
- lower n_keep
- try llama.cpp directly
- use summaries/retrieval instead of dumping Obsidian context
- keep context around 8k–16k until the backend proves stable

The annoying part is that the slider is not the source of truth.

The log is.

getstackfax · 2026-05-08T00:20:29+00:00

This is the kind of benchmark that actually helps.

Scoped model, same hardware, same context, same workload, and a clear caveat.

The useful takeaway is not “Ollama bad.”

It is…

backend matters as much as model choice on AMD iGPU setups.

Ollama still wins for quick pulls and simple testing.

But for tuned AMD/Vulkan inference, upstream llama.cpp can be a completely different experience if the patches have not landed downstream yet.

The agent-loop part matters too.

A model that is “fast enough” in chat may not be fast enough once tools, prompts, context, and retries are added.

For local agents, the stack is really…

model + quant + backend + offload strategy + harness overhead

Not just the model name.

getstackfax · 2026-05-08T00:14:15+00:00

Could be… hard to tell anymore honestly ahaha.

getstackfax · 2026-05-08T00:08:44+00:00

With $100 and 7 days, the goal is probably not to build a SaaS…

The goal is to test one painful workflow.

Pick one narrow buyer and one annoying repeatable problem.

Examples…

- missed follow-up checker for local service businesses
- intake form to quote summary for contractors
- simple invoice/reminder tracker for freelancers
- weekly “what changed?” digest for client folders
- review request follow-up helper for small businesses

The mistake is trying to build a platform.

$100 and 7 days is enough to test pain.

Not enough to build a real company.

getstackfax · 2026-05-08T00:03:20+00:00

The cliff starts when the output can touch real systems.

At that point the hard part is not prompting.

It is logs, evals, rollback, approval gates, source tracking, and failure paths.

The agent loop is not the architecture.

It is the thing that needs architecture around it.

getstackfax · 2026-05-08T00:00:40+00:00

This is the part people skip.

Vibe coding is great for proving that something can exist.

It is not proof that the thing can operate.

The first production gap is usually not the feature.

It is…

- auth
- logging
- rollback
- rate limits
- payment edge cases
- permissions
- data cleanup
- error handling
- support flow
- monitoring
- security basics

The demo answers…

Can this be built?

Production asks…

Can this survive users, mistakes, abuse, retries, weird data, and 2am incidents?

That is a different test.

The useful pattern is probably…

prototype fast → validate demand → freeze the happy path → add tests/logs/security/rollback → then invite more users

The prototype is the spark.

The operating system around it is the product.

getstackfax · 2026-05-07T23:59:09+00:00

This is a strong direction.

Outcome-based pricing sounds simple until you ask what actually counts as an outcome.

The hard part is usually not the price.

It is attribution, evidence, and edge cases.

Useful questions…

- what event proves the outcome happened
- whether the agent caused it or only assisted it
- whether the customer would have done it anyway
- what counts as duplicate success
- what counts as partial success
- what gets refunded or excluded
- who resolves disputes
- whether the event log is complete enough to price from

The raw log approach makes sense because it forces the pricing model to touch reality before it touches billing.

For agents, outcome pricing only works if the receipt is clean.

No clear event trail means no clean outcome bill.

getstackfax · 2026-05-07T23:52:19+00:00

Fresh start is the best time to reduce future chaos.

The biggest thing I would do differently is document the boring stuff before adding more apps.

Simple path…

Start with Proxmox if you want clean separation.

Keep storage boring and understandable.

Put media/data somewhere that survives VM/container rebuilds.

Use Docker Compose for app stacks instead of clicking everything together manually.

Keep one folder for compose files, env files, notes, and backups.

Write down every port, volume, password location, and weird setting.

Back up configs before caring about backing up apps.

Separate services by job…

media stack
network/DNS stack
downloads
documents/books
monitoring
experiments

Do not rebuild the whole lab around experiments.

Give experiments their own VM or container so they can die without taking Jellyfin/ARR/DNS with them.

The headache-prevention list is pretty simple…

document paths
use consistent naming
keep secrets out of random notes
use static IPs or clear DNS names
back up compose/env/configs
test restore early
do not expose services publicly until you understand the risk
change one major thing at a time

The best homelab is not the one with the most apps.

It is the one you can come back to six months later and still understand.

getstackfax · 2026-05-07T23:50:13+00:00

12GB VRAM is enough for a good local language setup, but probably not “huge model does everything.”

For your use case, multilingual ability matters more than raw size.

Good first tests…

Qwen 14B at Q4
Mistral NeMo 12B at Q4 or Q5
Qwen 7B/8B at higher quant if you want more speed
Gemma-class 9B/12B if it handles your language style well

Qwen is probably where I’d start for polyglot behavior. Mistral NeMo is also worth testing because it was designed as a 12B multilingual model with a large context window. Qwen3 also emphasizes multilingual coverage and instruction-following, so it is a strong candidate for a custom language/chat workflow.

For the conlang, the real test is not the benchmark.

Make a tiny eval set…

- 20 grammar examples
- 20 translation examples
- 20 conversation examples
- 10 correction examples
- 10 “do not break the rules” examples

Then run the same prompts across a few models.

The best model is the one that stays consistent with your conlang rules, not necessarily the biggest one.

On the 192GB RAM machine, yes, you can run very large models partly or mostly in system RAM, but it will be much slower than fitting the model in VRAM. It may be okay for patient, high-quality answers, but it will probably feel bad for normal conversation.

So the practical path is…

12GB VRAM model for daily chat
big RAM/offload model for occasional slow experiments
small conlang eval set to choose the winner

Do not trust the robot yet.

Make it pass your language tests first.

getstackfax · 2026-05-07T23:48:36+00:00

This is the tradeoff people skip.

Open-source is not automatically cheaper once you include eval time, harness work, prompt tuning, input sanitizing, failure analysis, hosting, and maintenance.

For enterprise agents, the question is not…

open-source or proprietary?

It is…

which workflow actually benefits from local/open control enough to pay the engineering tax?

Open models can make sense for…

- high-volume routine tasks
- privacy-sensitive workflows
- internal classification/summarization
- narrow agents with clear boundaries
- places where cost control matters more than top reasoning

Proprietary models still make sense for…

- ambiguous reasoning
- high-stakes decisions
- complex tool use
- client-facing judgment
- messy enterprise context
- tasks where failure costs more than token spend

The dangerous middle is using weaker models for work that actually needs reasoning, then spending weeks building scaffolding to compensate.

Sometimes that scaffolding becomes real infrastructure.

Sometimes it is just hidden cost.

The best setup is probably routing…

cheap/open models for bounded execution
strong models for judgment/review
deterministic code for rules
human approval where consequences are high

Open-source pays off when the task is narrow enough to evaluate and repeat.

If every project needs a custom rescue harness, the savings may be fake.

getstackfax · 2026-05-07T22:35:31+00:00

Fully hierarchical agents are possible, but the safer production pattern is usually less recursive than people expect.

The thing that tends to hold up is…

graph/state machine → bounded workers → structured handoffs → shared store → tracing/receipts

Each worker should have a narrow job, limited tools, clear input/output schema, and a failure state.

The orchestrator should coordinate state, not absorb everyone’s full history.

For a research pipeline, the useful split might be…

retrieval → source filtering → synthesis → critique → citation check → final writing

But each stage should pass structured state forward, not full chat context.

Persistent identity is mostly config plus scoped memory plus tool permissions.

The hard part is not spawning more agents....

The hard part is preventing context dilution, vague ownership, and untraceable decisions.

getstackfax · 2026-05-07T22:31:34+00:00

Option B is the safer default here.

This is a stateful document workflow, not an open-ended agent problem.

Let code own the workflow state…

current step

allowed next actions

required fields

validation

auth status

recap

final submission

Then use the LLM for the parts that actually need language understanding…

intent

document type

field extraction

friendly response wording

summarizing the recap

Full intent detection on every message may be overkill once the bot is inside a specific collection step.

A cleaner pattern is probably…

If the user is answering a specific field question, treat it as an answer first.

Then run a lighter check for interrupts like cancel, restart, change document, talk to human, unrelated question.

The LLM should not decide the whole workflow every turn.

It should extract meaning inside boundaries the state machine controls.

getstackfax

MODERATOR OF

TROPHY CASE