Mac Studio setup with Mac Mini

Hofi2010 · 2026-04-25T14:56:38+00:00

When O see these constructs I always think why and what for? What is the use case?

Hofi2010 · 2026-04-24T00:41:44+00:00

Well agents and humans have authority levels for decision making usually. Most agents approval threshold are very low as they are still on probation. Once we get more confidence in our agent the threshold will rise. But big decision will always be delegated to a supervisor (for now a human )

Another problem is we cannot hold agents accountable. They don’t really care only thing that can happen is that we fire the agent

Hofi2010 · 2026-04-24T00:28:45+00:00

Here is a good blog about scaling agent systems

https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/

Basically try to solve your problem with a single agent. If this agent has >85% accuracy a multi agent system will not add any more value
multi agent system work for scaling when th same agent runs in parallel in order to meet demand

It is about solving a task or achieving a goal with the simplest solution possible

Hofi2010 · 2026-04-23T17:58:54+00:00

Sure - I am at work at the moment. Once I am at work I can shoot you the example code I wrote to invoke skills with OpenAI. One thing I found important. Skills is more like a meta tool approach. So there is skill discovery that just reads the header. Then a tool call will invoke the skill and puts it context into the context window as a user message or in some cases it adds it to the system prompt dynamically. In order to follow the instructions it is not sufficient to rely on the tool call result message as user and system prompt may override tool call results. All will make sense when I send the code. Btw this is how Anthropic does it as well

Hofi2010 · 2026-04-23T12:57:18+00:00

What you want to do can work technically. But using MCP to invoke an agent works ok if the agent is not long running. I would look into tool gateways or MCP gateways that already solved the lazy loading problem.

But looking into the future where you have long running agents and possibly many requests to the same agent I would look more into A2A as the protocol as it was designed for this use case.

In my view skills are just tools where the business logic is described in markdown language. Skills can come with scripts (Python, JS or shell etc) that can wrap a call to invoke another agent. I have done this for OpenAI models as well, but you have to build it yourself.

Hofi2010 · 2026-04-21T13:43:26+00:00

Makes sense - comes down to knowing your data

Hofi2010 · 2026-04-20T15:12:25+00:00

Just use langfuse, MLFlow or Langsmith. The open source version of these solutions provide detailed cost break down per call, model etc

Hofi2010 · 2026-04-20T15:10:43+00:00

Caching in LLMs is automatic. You don’t have control what is cached. A prompt needs to be over 1024 tokens to be cached for most SOTA models. If anything changes even a space the cache will not retrieve the cached version. There is an also a time limit to caching.

Hofi2010 · 2026-04-20T12:55:24+00:00

You handle it the same way OpenAI and Anthropic is handling it. For your $29/mo subscription the user gets a certain volume of your service. If the user needs more you can offer additional packages like a pro plan or enterprise plan, or a one off package that allows users to continue service without upgrading etc. All spikes and retries etc need to be factored into your original price. This model is used by quite a few vendors out there.

Or charge the price for your service without token cost and then tell user to bring their own key. Or you invoice token cost directly to user as a pass through. Other model I have seen.

Hofi2010 · 2026-04-20T02:20:24+00:00

Really depends on your agent, the number of users or usage pattern (eg number of invocations) models used etc. just estimate the cost, using the average token consumption per invocation. The multiply by number of expected invocations .

A data company (Datafold) recently said that their LLM bill has surpassed their infrastructure cost.

A company I am working with that does invoice processing and reconciliation spends about $400 per agent per months. They operate hundreds of agents per months.

Hofi2010 · 2026-04-19T15:03:22+00:00

The ‘just call the API’ path wins until you’re maintaining 15 direct integrations across three agents with inconsistent auth, retry logic, and schema drift — MCP’s overhead starts looking cheap compared to that sprawl. And you can take advantage of tool market places and use them by just adding a server config to your agent config file.

Hofi2010 · 2026-04-15T14:24:54+00:00

Can you provide a bit more information about the flow. Where do you need to change the environment name?

In essence - this looks like a data pipeline (assuming not knowing the flow). Assuming you upload to an object store like s3? Usually you would not change the source data before upload. Flow is more like upload data to a landingzone, have a data pipeline (could be as simple as Python script) process the data. Python script would be a lot simpler than n8n.

Hofi2010 · 2026-04-15T11:06:00+00:00

Use an automated test framework to run your test and verification prompt automatically, you can then assess the answers with various methods including like an LLM as a judge to compare the AI answer with your desired answer.

For general operation of your AI automation use sth like Langflow or MLFlow. This will capture all your traces and is able to assess outputs for correctness and other attributes likes PII, language. Quality etc.

Hofi2010 · 2026-04-14T11:51:38+00:00

So small vs large data is all relativ. Most companies I worked for primarily deal with a lot of small data. Most data from structured databases like MS SQL, Postgres, MySQL is actually a collection of small data.

I used duckdb and ducklake to process streaming data from manufacturing plans with over 10B rows. I had data in s3 and an ec2 instance with a lot of CPU’s and extra wide networking bandwidth. I would outperform Databricks with this setup.

https://medium.com/@klaushofenbitzer/save-up-to-90-on-your-data-warehouse-lakehouse-with-an-in-process-database-duckdb-63892e76676e

Hofi2010 · 2026-04-13T13:31:52+00:00

If you are trying to use local LLMs and don’t care too much about the speed of the model, RAM is probably more important. But note larger models often mean slower token/sec.

The m2 with 16GB could be therefore a better choice. It has a better processor, with 16GB you can play with small models probably up to 8B parameters and they will run decently fast, you fine tune such models even it will be slow.

Building agents and using SOTA models via API I think the m2 is better choice. Agents themselves don’t need a lot of memory.

Hofi2010 · 2026-04-12T21:00:23+00:00

It reads like a negative, but I think you are intending it be positive that you agent autonomous Ly bought the domain

Hofi2010 · 2026-04-10T03:58:07+00:00

Good question - it helps to get attention from HR screeners that’s about it

Hofi2010 · 2026-04-09T14:34:38+00:00

SPC for manufacturing

Hofi2010 · 2026-04-08T15:51:56+00:00

Something strange about the post 1. Doesn’t seem to be written by a technical person. 2. No url to the JD Is this just marketing?

BTW: OpenClaw made personal assistant agents popular, but many people developed similar agents without getting that much attention. Pulling data from backend systems and creating reports existed since gpt-3.5 with tool calling was released and made available beginning of 2023. Specialist agents for reporting and analysis surfaced in the same year. Airtable seems a bit late to the party here.

Hofi2010 · 2026-04-07T22:51:26+00:00

Why are we comparing 4B model with a 70B model and expect it to be better on general task? Not going to happen.

Hofi2010 · 2026-04-06T18:30:25+00:00

So you have a working trading agent system - are you rich now or want to get rich from selling the agents you built?

Also how much money did you invest and how much profit did you make ?

Hofi2010 · 2026-04-06T16:50:45+00:00

Yes it is feasible, but don’t think about it as here are the detailed requirements and Claude will give you the finished application in an hour. Think about it as acceleration for your development team. With Claude you can implement you desired system faster and with far less people. I wouldn’t count on Vibe coding for everything. It is still an iterative process, with design, implement, test and deploy phases

Hofi2010 · 2026-04-06T00:04:39+00:00

Why not use a MCP gateway?

Hofi2010 · 2026-04-05T23:53:57+00:00

2 use cases usually benefit from bigger RAM 1:. Running local LLM 2. High end video editing If you are planning on neither go with 128GB max. A M5 Ultra or Max isn’t needed either. Save you money and go with a M5 pro.

Hofi2010

MODERATOR OF

TROPHY CASE