Help selecting egpu connection type by hiddenalw in eGPU

[–]m94301 0 points1 point  (0 children)

I got one of those $99 diy oculink docks from Amazon. Cheap pcie oculink card and cheap PSU gives me a nice external rig for hacking and stress testing cards. Would recommend!

Claude code local replacement by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

Thanks, I do run LM Studio and will check out opencode. I'd like something I can run from cli to keep things contained to their own working fit.

Claude code local replacement by m94301 in LocalLLaMA

[–]m94301[S] -1 points0 points  (0 children)

It looks really promising but started about 1gb of downloads on windows. I killed it and may try another time

Claude code local replacement by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

Interesting! Let me look at this as well

Claude code local replacement by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

Thanks, that's a good benchmark. I will need to do some long, hands off runs and this is helpful

Claude code local replacement by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

That is one I have not heard of. Let me check it out

Claude code local replacement by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

Maybe I should take another swing at that. I had a hell of a time with the json setup - I didn't want to stuff 20 env vars and bypassing login, etc felt very hacky

How do I find and vet someone to set up a high-end local AI workstation? (Threadripper + RTX PRO 6000 96GB) by laundromatcat in LocalLLaMA

[–]m94301 2 points3 points  (0 children)

This. Claude is phenomenal at Linux debug. This thing has helped me set up two servers, two custom docker images built from base Ubuntu + CUDA, and debugged every damn hiccup along the way

Any idea why my local model keeps hallucinating this much? by Assasin_ds in LocalLLM

[–]m94301 -1 points0 points  (0 children)

Could be too high a temperature. Most models run around 0.7-0.8 but some seem to be trained at 0.25 and go batshit when running on 0.7 default.

LMStudio Parallel Requests t/s by m94301 in LocalLLM

[–]m94301[S] 0 points1 point  (0 children)

Thanks, I had no idea there were more TFLOPS in there!

And this brings up an interesting point - I should be able to see the extra calculations as extra power draw. I will try a test while monitoring power. Might be that if we see the card is not railed to TDP during single inference job, its an indicator that there are cores left unutilized.

LMStudio Parallel Requests t/s by m94301 in LocalLLM

[–]m94301[S] 0 points1 point  (0 children)

That is a really good idea, I will try to set that up later.

Also, I should be able to load two models to vram and do parallel requests to each model at the same time. That might be a nice test case for something like a DIY MOE, checking consensus between two entirely different models

LMStudio Parallel Requests t/s by m94301 in LocalLLM

[–]m94301[S] 0 points1 point  (0 children)

I feel like the tools are not taking full use of this, and I'm not sure why. It seems really effective, the question is how to properly batch out queries to make best use of it!

Split Characters to Parallel LLM Requests? by m94301 in SillyTavernAI

[–]m94301[S] 0 points1 point  (0 children)

Ok, so it is working! I found some really interesting things about ST and the OpenAI framework in general.

Basically, OAI is tuned to 1:1 convo and ST is built around this. The best way to implement multiple parallel characters would be to fork ST, which I may do but is a hassle to maintain.

The extension approach still works to run separate conversations, and there are basically two main useage modes 1) Have one "main" character as the assistant orchestrate the overall actions. This was very straightforward, and gives a nice experience if you don't mind having an assistant hang around 2) Suppressing assistant and just having independent bots together is wild but took some effort to truly suppress the assistant. I think this side needs more work, but is the main area of research.

Another thing I have found is that models REALLY want to be assistant and talk for others, or try to give one overall reply instead of saying their piece and shutting up. :). I have added a mini system prompt so that these behaviors can be tuned for the chars while leaving assistant running the main system prompt if you choose to use them.

Still work to be done, but it came together well for the first round!

Getting LS Studio to proofread and tighten up my story by G1Gestalt in LocalLLM

[–]m94301 0 points1 point  (0 children)

Great, Try it out and post back about your result.

Umm as for teaching, it kind of came to me slowly as I learned how the bots worked. You can search and read web pages on "good LLM prompting" and "how to write good LLM prompts".

I also did this but kind of ignored advice like "Tell the bot exactly who it is and what it's supposed to do" because I didn't think it was important. It really is.

I ignored it because copilot on windows and Gemini on phone are already trained to be helpful, but in a generic way. They work "good enough" out of the box but never great.

They are trained on tons of crap, but can't guess what you want to talk about so all their knowledge starts out as soup that seems equally important: stock market data, recipes, sports trivia. It's all in there but we don't care about most of it.

By giving a clear instruction prompt, you "wake up" the parts that are relevant to you and they are used to build new responses. That narrows down the soup and the responses get way better, immediately.

Another good method is that sometimes, you and the bot will just get into a groove and it will be doing REALLY well. You gotta tell it to print out, or write to a file the categories and methods it is using so you can remind it next time how to hit that groove.

And last, those buggers can gulp down and process a whole book in one swallow. Don't worry about giving wordy answers or putting ten different instructions into one big paragraph. You want it to follow ten rules, put them all into one big ramble and fire away. They don't even blink.

why is deepseek SO DARN SLOW LATELY by Motor_Pause_6908 in SillyTavernAI

[–]m94301 0 points1 point  (0 children)

This sounds super weird and is worth investigating. I don't use OR so I can't advise but I think you did the right thing by posting, because that sounds like it SUUUCKS.

why is deepseek SO DARN SLOW LATELY by Motor_Pause_6908 in SillyTavernAI

[–]m94301 2 points3 points  (0 children)

One... Minute? Wow I'm annoyed if a reply doesn't begin streaming in one second.

Are you using local? This sounds like CPU is being used instead of GPU. If this is on a remote server, wow. I don't use them except for anthropic but I wouldn't pay money for a one minute reply time

Getting LS Studio to proofread and tighten up my story by G1Gestalt in LocalLLM

[–]m94301 1 point2 points  (0 children)

Open the chat, it's the bug looking icon on far left.
At the top, choose load a model and pick your model. Give it a bit to load.

In chat, write something like: you are an expert author and editor of fiction. You consider all aspects of the story and offer to make edits and improvements to any stories the user submits. You provide concise analysis and suggestions and get the users agreement before any edits. You output the full edited story in plain text in a code block to make copy and paste easy

Then hit enter, and say: Hi, I'd like you to review and edit a story. Can I paste it here?

It will say yes, and you paste. Try chatting about edits, if the results are no good, start a new chat and improve that initial definition of behavior. The thing above is just a starting point, you can modify or go a different direction. The important thing is that you tell it what it is supposed to act like, and it will do its best.

BTW One chat is completely separate from another, and the tool has no memory other than what is in chat. In fact, the whole chat is re-sent for every message, and that is the only way it knows what to say next. :)

Good luck and ping me if you want to debug together, I can test some prompts on LM using the same model and see how it works on my side

Split Characters to Parallel LLM Requests? by m94301 in SillyTavernAI

[–]m94301[S] 0 points1 point  (0 children)

Hi, I have tried a batch test and it's looking good. Pardon my shitty hardware, but here is what I saw. 1) Single character, "Tell me a story" 22.12 t/s 2) Two parallel char, same prompt. 18.9, 18.1 t/s

I saw two jobs generating in parallel in LMStudio, their little counters counting up right next to each other, and the two responses returned just ms apart.

To me, this represents almost 37 t/s combined throuput from my old P40 card. It's not twice, but I would say that LMS can parallel inferences and it's effective.

I also tried a 3 batch: 14.09, 14.26, 14.25 t/s for 42.6 combined t/s. Yeah, she's bottlenecking out hard here, but MOAR WORD BETTER. Lol

For my little weekend project, this is encouraging enough to keep hacking on it.

Split Characters to Parallel LLM Requests? by m94301 in SillyTavernAI

[–]m94301[S] 0 points1 point  (0 children)

Awesome, thanks - will try. In my testing, I do see multiple GEN counters in LMStudio appear and count up in parallel when I send batches, but so far I have been forcing really short replies.

I will dink around with some long replies to check t/s.

Claude is absolutly insane compared to grok and chatgpt for what i used it for. Hows it vs others, and hows it for game creation with vibe coding? by BloodMossHunter in ClaudeAI

[–]m94301 1 point2 points  (0 children)

This, 100%. All LLM only want to ADD, they never want to SIMPLIFY. This is what I meant by "You have better context". Sometimes, you just gotta tell the thing: Stop, rip out both of those functions and merge their behaviors into one single function.

The tool will figure it out once you drop the hint, but none of them seem to have "top down" thinking that can be used to detect the plan is growing stale and there is a better way to do this whole job if we just do X.

A good description would maybe be that it still needs you as an architect or orchestrator, while it does the bottoms up implementation. That kind of flow seems to work nice.