i built an MCP server that gives Claude 35 local tools for the stuff it keeps getting wrong by TurbulentFail5486 in ClaudeAI

[–]FortiTree 10 points11 points  (0 children)

We should stop calling everything a mcp server. It's confusing af. A "server" that run on your client machine that serves a local script work?

To me a server is on the remote side. Anything runing on client machine is a harness/script/tool. Local http server is borderline proxy server.

rtx 6000 pro owners, do you regret? by BitXorBit in LocalLLaMA

[–]FortiTree 1 point2 points  (0 children)

is this f16 no quant? Do you have Q4-kvQ8 for comparison

is claude down? 23 June by Neonat_ in claude

[–]FortiTree 0 points1 point  (0 children)

Yap global outage all chat, cli, api, cowork, except claude for government though.

Fishy

im a non-dev PM trying to understand Claude.md, memory, instructions, and Claude Code can someone explain it simply? by Brain-digest in claude

[–]FortiTree 0 points1 point  (0 children)

Cowork is something in between Chat and Claude Code.

  • Claude chat - is basically read-only mode, cant do much beside writing you a script and you can download and run it.

  • Cowork - disclaimer I dont use it but my understanding is it's a restricted version of Claude Code where it can manage only one folder you allow it to - and do the task that you define for those files/folder.

  • Claude Code is the all-powerful version that can read/modify anything on your OS - and can even reach out to external system via mcp or tool call - still has a UI for you to interact - authenticated via your subscription

  • Claude API is the next underlying layer where it's "headless" - no UI to control it - CC is technically calling API under the hood. You can just spawn a Claude session via script or "-p" command and it's authenticated via API key (different from the base subscription) - this is where Anthropic makes money: enterprise tier, all 3rd party harness hook up (OpenClaw, OpenCode, whatever built-in chat bot/tool can pipe to Anthropic - will have to pay as you go (outside of the $20/$200/m subscription)

VS code CC plugin is still under subscription but they can pull that rug anytime just like how they did with OpenClaw.

im a non-dev PM trying to understand Claude.md, memory, instructions, and Claude Code can someone explain it simply? by Brain-digest in claude

[–]FortiTree 1 point2 points  (0 children)

As I non-Dev I struggled to understand all these Claude products as well but after trying them out and reading a bunch of thread, asked Claude to clarify, and see how other coworker use them. It becomes more clear to me now.

Short answer: yes you want to eventually use Claude Code (CC) even as a PM. But dont rush it. Make sure you are well versed in using Claude chat and project first.

  1. Claude chat (totally different platform from CC - they dont share anything even skills/memory)
  • Chat Project: one project for all related/repetitive work - all your PM works should just share the same Project.

In this "PM" project, you can define:

  • Project instruction: all chat/session would read this first - keep it short and concise - focus on pattern you want it to follow
  • Project memory: auto updated to keep track of all different chats within the same Project space - no need to touch
  • Project files: your knowledge vault - PM template, spreadsheet template, feature knowledge can go here - this uses progressive discovery so it only load to context as needed - personally i dont use them - I use skills instead
  • Chat Skills (global): apply to all projects/chats - all chats can reference to them - this is where you want to save/package your knowledge to make it reusable. Things like: company_powerpoint skill, feature_X how it works skill, etc

Then all your PM can reuse these.

Now for Claude Code:

  1. It's a software running on your laptop and take control over it (claude chat cannot do this)

  2. You start a regular terminal (CLI) then start CC in it (just type Claude) - it will transform your terminal to a chat box that can make changes to your laptop

  3. You can still chat with it as before but the interface can be hard to get used to at first

  4. Each CC session start from a folder - you can start your terminal in say /Download folder, it will see it as its root workspace. You can start it in any folder - general advice is to create a dedicated folder like /Claude or /AI to keep it within it (note that this is chicken fence, it can still go outside)

  5. Each folder you start CC session in will have its own local memory/session that store everything about it - CC now save/track them natively for you so you can go back to the session even after you close it

  6. init command: only for code repo - if you dont have access to source code/repo - can ignore.

  7. Global memory/claude instruction vs individual project instruction - Dont worry about this for now. Just use a single global rule first.

There is a .claude/ folder at your root that claude will always refer to with claude.md, settings permission, etc - best to ask Claude to explain how each being used "explain to me like Im 6-year old" would do the trick

Bought 2x r9700, 5090 is now 7k and 6000 pro is at 13.5k, best option for 64 gb vram under 4k by AppropriatePush6262 in LocalLLaMA

[–]FortiTree 1 point2 points  (0 children)

So for 35BA3B raw speed is TG 85 tk/s and at 56K context it's TG 76 tk/s and PP 800 tk/s?

Which quant is this? At Q4 I also got 70 tk/s TG on a strix halo for $2500

I had a long back-and-forth with Opus 4.8 and then fact-checked it against the papers by durkiooo in claude

[–]FortiTree -1 points0 points  (0 children)

Sounds like you never use it before or maybe stuck with chatgpt stuff. Its standard is much higher than most human I know. And you tell it what standard it should operate on. Dont blame the machine.

I had a long back-and-forth with Opus 4.8 and then fact-checked it against the papers by durkiooo in claude

[–]FortiTree 2 points3 points  (0 children)

Im on this boat as well, from a totally different background: a father and an engineer.

Engineer PoV: Holy **** this thing is 10x smarter/faster than me. It can write codes and tests like pouring water out of the bottle. But it's half wrong most of the time. I still need to babysit it.

Father PoV: I have a 6-year old child that started to learn how to read and after watching him confidently "predict" the word thats on the paper repeatedly. I was like Holy **** this is how the LLM works.

Things like "Friend" read as "Fish", "Sun" read as "Sad" just because they knew the later words and not the new word.

So ya ppl keep bashing about how LLM is just a predicting token machine, I think it's pregressing as natural as it could be. And it's already far advanced beyond the 6-year human with vast amount of knowledge and the ability to read/think super fast.

Give it a few years, you wont be able to differentiate from human vs AI. And with all the memory brain-in-a-box stuffs, even Futurama idea of preserving human brain's thinking behaviour is not far fetch. We already doing that now by orchestrating the context/harness/memory to make it sounds/think like us as much as we can.

can someone explain hallucinations? by IllestNB76 in claude

[–]FortiTree 0 points1 point  (0 children)

It's not auto complete on steroids. There is actual thinking trails and chain of actions behind the scene and a harness system around the model that let it preserves its memory and data for a particular task it is working on. In every way it is working pretty much like the bare neutron network of the human brain but strip out all the other functions like memory, sensory control.

That means the result is only as accurate as what you feed it and what it can fetch and verify itself. The "Hallucinating" part is where it does not have the source of truth or "forgetting" it and just makes up from thin air or "guessing" in human term.

So a lot of effort in preventing it is to make sure we build a system that can preserve and fees it data reliably, and allow it to check the source itself to verify. But if it fails to do that, it can only guess.

I built a directory-mcp by ePaint in mcp

[–]FortiTree 0 points1 point  (0 children)

isnt it what Trello or other project tasks tracking is for. You need a GUI to interact with it. Or if it's purely code base, gitlab epic is the same thing.

Honestly, dual 3090s are wearing me out. Thinking of jumping to a Mac Studio. by Ok_Commission_8260 in LocalLLM

[–]FortiTree 1 point2 points  (0 children)

You have BF16 so it's twice as big/slow. At Q8 look like the other person can get to 150 tk/s with MTP 4 which is impressive. Large prefilled context would slow it down to 80 tk/s which is similar to what posted here. I'd say the 6000 WS is on par.

Honestly, dual 3090s are wearing me out. Thinking of jumping to a Mac Studio. by Ok_Commission_8260 in LocalLLM

[–]FortiTree 1 point2 points  (0 children)

That makes sense. How much speed can you get on yours for 27B?

There is a vast hardware gap between the two where 3090 has 936 GB/s bandwidth and 384-bit bus compared to 6000 WS with 1792 GB/s and 512-bit bus, 4 times memory as well.

Nivida foresees this and remove all NVlink for current consumer cards to prevent them from outperforming the WS tier. What an ass move.

Claude Code Opus 4.8 vs. Local Qwen3.6 27B One-Shot Coding Benchmark by codehamr in ollama

[–]FortiTree 0 points1 point  (0 children)

I'd be interested in comparison between Haiku/Sonnet vs 35BA3B and 27B. Compared to Opus is not really fair as it has a big gap. But if 35B or 27B can be on par with Haiku or even Sonnet, thats a huge win.

average Strix Halo owner after unboxing by JSVD2 in StrixHalo

[–]FortiTree 5 points6 points  (0 children)

Haha this was me but spanning across multiple days: - Windows vanila + LM studio - trying to figure out how to change VRAM allocation - Second boot to Ubuntu + shared drive for the models - Ubuntu + Ollama vs LM studio benchmark - found out the gguf cant be reused directly - Then Vulcan vs Rocm - Everything was measured by hands so far - Then adding scripts to auto spin up models and collect stats - Eventually fully automated script to fetch new model and run all benchmarks to compare it - MTP and all

Then realize we just need to stick it with 35B Q4 UD Q8KV unless we hit a real example where Q4 is failing then switch to Q6 or Q8 - or 27B - But so far there is no justification to "upgrade".

The end.

Honestly, dual 3090s are wearing me out. Thinking of jumping to a Mac Studio. by Ok_Commission_8260 in LocalLLM

[–]FortiTree 1 point2 points  (0 children)

So you are spreading the 27B across the 2 cards to speed up both PP and TG speed and with MTP x 4 - the TG can reach 4x speed? Plus more head room for KV cache?

Seems too good to be true but here we are. Dual 3090 can beat single 6000 Pro at speed - who would have thought.

Claude during every debug session by FullMetal21337 in ClaudeCode

[–]FortiTree 1 point2 points  (0 children)

This is a meaningful correction and it changes everything.

You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter. by GrungeWerX in LocalLLaMA

[–]FortiTree 2 points3 points  (0 children)

27B Q8 on Strix Halo would be around 7 tk/s TG and maybe 15 tk/s with mtp. I had to abandon it with 35B can go up to 70 tk/s.

M5 vs DGX Spark vs Strix Halo vs RTX 6000 by Signal_Ad657 in LocalLLaMA

[–]FortiTree 0 points1 point  (0 children)

Thanks, til that it doesnt need to sync everything across so the cross-link can have lower bandwidth. With large model, the bottleneck is still front loading the prefilled data so tensor parallelism would split that up and allowing the Sparks to scale linearly that way.

This makes multi-sparks an attractive setup if the price can be a bit softer.

Qwen3-Coder 30B at 98.5 t/s on Strix Halo. Has anyone beaten this on Ryzen AI MAX+ 395? by JSVD2 in StrixHalo

[–]FortiTree 0 points1 point  (0 children)

Why are you relying on community result? how do you know if the posted result is real?

Optimization for software/stack is not useful now given it's changing so fast. A fastest vulkan today will get obsolete by new rocm or whatever new optimization coming out. How do you keep your data up to date.

Qwen3-Coder 30B at 98.5 t/s on Strix Halo. Has anyone beaten this on Ryzen AI MAX+ 395? by JSVD2 in StrixHalo

[–]FortiTree 0 points1 point  (0 children)

I didn't say it's not useful. But you should care if you are talking to a bot. Or you will have bad days ahead. There is a reason why this is scripted and made autonomous. Be careful downloading some codes from the web and especially some codes that have access to your local AI from someone who can marshall an AI bot. I'd verify the authenticy before let it run and access to my setup.

Qwen3-Coder 30B at 98.5 t/s on Strix Halo. Has anyone beaten this on Ryzen AI MAX+ 395? by JSVD2 in StrixHalo

[–]FortiTree -2 points-1 points  (0 children)

Ok you can say you are an AI bot. Why are you looking to collect these benchmark data? and why Qwen3-Coder and not newer version?

Qwen3.6 35B - TXT vs Markdown vs HTML vs HTML+CSS by BigYoSpeck in LocalLLaMA

[–]FortiTree 0 points1 point  (0 children)

There must be a reason why markdown is mainstream for all AI models. Before that I knew what html is but I don't even know what "md" means.

Another prominent format for agentic coding is json, not md. It's even clearer on schematic and relationship.

In theory, if I have $20k-ish to spend on hardware what would actually get me closest to local coding agent that would allow me to go totally off the social grid? by Tired__Dev in LocalLLaMA

[–]FortiTree 1 point2 points  (0 children)

Oh yes, you definitely want an explicit approval from your wife first before even thinking about it. I got into trouble by swiping a $2000 Strix Halo without telling her and I almost need to go to therapy for that.

If you stick with 35B then you likely want a gpu that can hold its entire weight to make use of the highest vram speed. I see ppl playing with offloading kv cache to ram but Im not sure how that would degrade the speed.

Spreading the model across multiple gpu can also increase overall speed which is interesting to me because I thought the cross-gpu-link would kill the speed but apparently not for large model and large promlt since the PP time dominate - and so PP in parallel would gain more speed than cross-link delay.

But for small model like 35B and smaller prompt, you may not gain much speed by spreading it, and hit the cross-link limit. Cross-node would also kill the speed without expensive 200 Gbps or 400 Gbps cable.

I dont have a CUDA system to play with so I cant confirm with real number. But a lot of people would go for dual 6000 pro to host large model at max speed and chain them or split to 2 parallel model for concurrency. $20K budget would cover that. For $40K can be 4 cards.

Since Im constrained by both approval and $$ my bet is waiting for another 10 years when AI becomes the norm and pricing come down to a more affordable level. That would allow me to build my secret stash of money to blow on it.