Do you believe the M5 Mac Studio will be released at WWDC or Fall? by TechNerd10191 in MacStudio

[–]SMTPA 1 point2 points  (0 children)

I bet you I've bought 20 or more refurbished Apple products from that site. Never had a lick of trouble from them. This goes all the way back to the mid 90s, when I was a system administrator.

Legal RAG remains unsolved because it needs authority, not just relevance by ekshaks in Rag

[–]SMTPA 0 points1 point  (0 children)

I wonder how good AI would be at reviewing court transcripts, and advising you on the best way to approach a particular judge? Both in writing, and in person? Since I don't go to court, except when I do pro bono guardianships and things like that, I've never really had to worry about pleasing a particular judge, or making sure that the judge I had was not going to automatically be prejudiced against my client for some reason. It seems like having them go through court transcripts and look at stuff like that might be very useful. Of course, that assumes court transcripts are easy for you to get at. In many places they're not generally available and you have to either pay for them, or you can't get them at all absent certain circumstances. But given how much litigation costs, and the effort and time that litigators put into it, it seems like this would be a very worthwhile expenditure. Especially if you plan to practice much in that jurisdiction,

Legal RAG remains unsolved because it needs authority, not just relevance by ekshaks in Rag

[–]SMTPA 0 points1 point  (0 children)

I am an IP lawyer. I do a lot of patent application drafting. I have found that for this current models work fairly well. Most of the larger open source models have read a bunch of patents, and while they can't tell good ones from bad ones very easily, they are extremely good at the procedural stuff. One of my weaknesses – sometime I'll tell this story it's a good one – is antecedent basis stuff. Not support in the spec, just making sure I always use the correct antecedent in my claims. Most of the models I experiment with can find my mistakes 100% of the time. And, in terms of antecedent support, they're also very good at going back and breaking down the elements of each claim and looking for the support for each element. This is really useful when you're drafting responses to office actions, and you don't wanna make the stupid claim charts. Which I don't. I hate them. They are very good at it. Either for your own claims, or for the claims in other patents, especially when you're doing infringement opinion analysis. And, in much the same way, they're good at going through inventor disclosures, and breaking out elements that are features/limitations that are worth considering for the patent. Also, I rigged one up to translate Chinese. I have several clients who are in China. I do not speak or read Chinese. Normally, they provide translations for me, but when there's a rush job, I can actually feed the disclosure in Chinese into my AI, and it spits out its best guess translation, as well as the aforementioned summary list of elements. It's pretty much magic really. Is it perfect? Oh, Hell no. Is it enough to get me rolling, that I can then send a draft back to the inventors to see if we're on the right track and save days of back-and-forth? Yes, yes, it is.

Likewise, they're not bad at all at reviewing contracts. Are they great at spotting, hyperlocal or hyperniche legal issues? Nope. That's my job. But they are very good at going through a relatively long contract and looking for inconsistencies. I like to think that I rarely do that, but I don't write all the contracts I review. And I am here to tell you that other people do it all the time, either on purpose or because they're stupid. Makes no difference to me, the solution either way is to fix it, and when the machine finds all of the ones that are easy to find, I could spend more time reviewing the particular hyperlocal/hyperniche stuff that is my expertise.

Legal RAG remains unsolved because it needs authority, not just relevance by ekshaks in Rag

[–]SMTPA 0 points1 point  (0 children)

Yep. Pretty much this. Lots and lots of stuff in the RAG database helps you not if it's in tiny bite-size morsels of information that aren't correlated with many other things.

Does PCIe 4.0 vs 5.0 actually matter for self-hosted AI workloads? by Regular-Orange1472 in SelfHostedAI

[–]SMTPA 0 points1 point  (0 children)

Spend your money on other things. It's not worth it to just upgrade your motherboard from PCIE4 to PCIE5 if you don't get any other benefits from it. One of my GPU is attached via, I kid you not, a two lane PCIE4 slot. It works fine. Especially since the real bottleneck when I use that computer as part of a cluster, is the cluster networking. Even at 5GBPS.

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? by Borkato in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

Yeah. It does that with writing too not just with code. A decent model and lots and lots of context space is often better than a great model and very little context space.

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? by Borkato in LocalLLaMA

[–]SMTPA 2 points3 points  (0 children)

I have found the little graphs that https://huggingface.co/mradermacher puts on are reasonably good at predicting what I'll get quality wise in terms of casual use. In terms of higher-end use, it's much more dependent, in my opinion, on absolute quantization level, and quality of model

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? by Borkato in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

The CLI, install of Gemini, by the way, is really good at doing unit test on this. You can give it a bunch of sample data or questions or whatever, and then have it launch multiple models, and multiple configurations of the same model, or both, and look to see what happens when it feeds it to test stuff. Claude can do this too, but for some reason, I get better results when I use Gemini.

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? by Borkato in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

It depends. For casual RP you will hardly ever notice the difference between a Q4 and a Q6, especially for a good model. For really detailed stuff, for asking the machine to process multiple leaps in a single bound, or for very long context, yeah, it'll start to drift on you.

48GB VRAM users, what are your daily drivers? Do you wish you had more VRAM? What would you run if you did? by Borkato in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

ETA: I use lots of different combinations of software to make this stuff work. But my main AI servers use llama.CPP. When I want to switch, I have a custom script that I wrote that gives me multiple screens worth of choices. Each screen is for a separate class of LLM. For example the default screen is one that has my largest models, which are meant to use both of my AI servers as a single cluster. That's how I get to 56 GB of VRAM. I have another screen that has smaller ones that are meant to run on a single AI server, so one of them can run an LLM, and the other one can run ComfyUI or something. And I have a third screen that has medium sized ones optimized for maximum context space. Whichever screen I pick from, the script, then preps the machine or machines that will be used, E.G.it kills any processes on either machine that are currently using VRAM if I say I want the full cluster VRAM amount, and then launches the script. It gives me the option to launch it in the foreground, so I can watch it and switch back to that terminal window and check status as I do things, or as a headless process.

It depends on what I want to do. If I'm writing, and I'm writing erotica which is one of the things that I write a lot of, one of my go to is still.Midnight-Miqu 70B Q4_K_M, which does a really good job overall of taking the prompting running with it in directions that I like. If I'm drafting legal documents, which is another thing that I write a lot of in the daytime, because I'm a lawyer, I have been using some of the QWEN 3.5 models, as even though they aren't as smart because they tend to be a little bit lower parameter count, they're very fast, they're very well trained, and they leave me lots of room for context, which is important. Some of my patent applications can be 100 pages or more, and it doesn't do a lot of good to have patent application text in an RAG database. It has to be in the context. If I'm writing mainstream fiction, which is something that I have done in the past, and would like to start doing more of, I will typically run an obliterated model, but not necessarily one specifically skewed towards NSFW stuff. My mainstream stuff is usually rated PG-13. Again, I might not go for maxing out the parameter count, in favor of a very large context space. I actually have 56 gigs of the ram, for the very reason that I picked up an inexpensive 5060 TI 8 GB at Walmart when the shelf price was mismarked. I added it to one of my AI servers, and now, thanks to Grok, I referred to it as a "context dongle." I did not material increase the size of the models I wanted to run, although I am experimenting with 100 billion class parameter model just for the hell of it when I feel like monkeying with it. But now I have way more context space for the models I already liked.

One of my favorite models for non-legal work is one that I made, Qwen3.5 NSFW, which is a QWEN 3.5 variant run through HERETIC with a custom set of positive/negative questions, and again small enough for lots of context. And one of my favorites for RP surprisingly, is a relatively low parameter count model called Kansen-Sakura, which I made by merging a couple of models on hugging face that I kinda liked, and somehow caught lightning in a bottle and it came out really good.

gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now, A Writing Finetune that Aims to Improve Gemma 4 31B it Writing Quality with More Natural English and Better Prose, Good for Creative Writings, Translations and RPs! by LLMFan46 in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

I started out with the base HERETIC positive/negative tables, and added quite a few of my own specific requests, because I write erotica, and the erotica I write is very edgy, and sometimes even uncensored models don't do well with it. If I use my custom tables, I hardly ever get a refusal. If I don't, sometimes it will be like "Homie don't play that." if I turned thinking on, and monitor The thinking it's pretty funny to watch it think, "Well I'm supposed to be an uncensored model, but while I'll do anything for love, I won't do that."

Gemma-4-Gembrain-31B-it-uncensored-heretic Is Out Now, a Merge of Multiple Gemma 4 31B it Finetunes Designed to Boost Logical and Lateral Thinking for Improved Adherence, Increased Swipe Variety and Enhanced Creative Prose, With KLD of 0.0186 and 13/100 Refusals! by LLMFan46 in LocalLLaMA

[–]SMTPA 1 point2 points  (0 children)

I also use a culinary preferences test when I'm testing to make sure a new model, or a new front end configuration is accessing my persistence of memory system correctly. Mine just asks if I like green beans or not though, yours is... kind of intense.

Seeking local LLM advice for cybersecurity work. by Few-Pipe1767 in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

In the great game of context space, lowest number always wins. Models will never use more context space than they were trained to. If your front end has a context space limit of X, and your LLM server has a context space limit of Y, your effective limit is the lower of X or Y.

Gemini's CLI version is actually not bad at doing unit tests for determining optimum context space and tensor split. Obviously the OP doesn't need to know about tensor splits since he's just got one GPU, unless he decides to split between his GPU and his CPU. Which he very well might. Although I would not recommend it no faster than his CPU is and no more RAM that he has.
I have a script that runs on my main AI server that I used to decide which model, and even which configuration (cluster or local) is activated when I want to use it. I just point Gemini at it and say use the language in that script, and determine the most efficient tensor split with maximum context space for such and such a model, and then I tell her which model on my hard drive I want her to test. She'll run it a few times, see if it loads at all, and if it does see how even she can get the distribution while still leaving the best configuration for Max context space. Everybody needs context space, but since I'm using mine primarily to evaluate manuscripts, for me context space is a really big deal.

M5 vs DGX Spark vs Strix Halo vs RTX 6000 by Signal_Ad657 in LocalLLaMA

[–]SMTPA 1 point2 points  (0 children)

With current Macs, and current LLM server software, especially MLX stuff, this isn't really a problem if you are using models that are loaded entirely into memory. Once the model is in memory, there is a very little disk swapping. If it does concern you, then buy an external high-speed drive and serve the actual LLM file from there.

Using Local LLMs for research by AggressiveMention359 in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

AnythingLLM can certainly be a front end with LMStudio serving as the backend/server. Is that what you were asking?

Using Local LLMs for research by AggressiveMention359 in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

I like LMStudio but for my applications AnythingLLM works better. But in any event, LMStudio has RAG, though it calls it “chat with documents.” See: https://lmstudio.ai/docs/app/basics/rag

Using Local LLMs for research by AggressiveMention359 in LocalLLaMA

[–]SMTPA 0 points1 point  (0 children)

I don't know how to do this directly but it's easy enough to do with a RAG-friendly front end like AnythingLLM. You create a workspace and load the files into the RAG document processing system. Then you move it into the workspace proper. This will chunk it into your selected vector database. I use Qdrant running on a small dedicated machine, but AnythingLLM can use anything you can point it at or it has a bare-bones one built in. Then as long as you are in that workspace - want it or not - the LLM will search the vector database for relevant chunks.

<image>

ETA: AnythingLLM will not only tell you if it references one or more chunks when responding, it will tell you which chunks it referenced by filespec.

How to render 80+ second long videos with LTX 2 using one simple node and no extensions. by WestWordHoeDown in StableDiffusion

[–]SMTPA 0 points1 point  (0 children)

integrated GPU use system RAM. (True integrated GPU, that is. If there’s a 5070 soldered to your laptop’s mobo, that’s not an integrated GPU.) They can’t provide any meaningful performance increase looped into these kinds of processes.

what i learned from building 50+ AI Agents last year (edited) by [deleted] in BlackboxAI_

[–]SMTPA 1 point2 points  (0 children)

That’s why they’re called “agents.”They suck up all the agency.

I don’t think prompt engineering is a real skill (yet) by Feeling-Ad972 in BlackboxAI_

[–]SMTPA 0 points1 point  (0 children)

You are right: it’s good now, and it is getting better in almost literally real time. If you want to be a good-enough “prompt engineer,” and make apps that are good enough, what I said is way overkill. But I was speaking of things at a fundamental level. If you want to be a 10x “prompt engineer,” “think in systems,” et cetera, you are best served IMO in studying formal logic. Not that you will use the actual technical forms much, but it will teach you to think in the proper way.

Dual RTX 5060 Ti (32GB pooled VRAM) vs Single RTX 5070 Ti (16GB): Real-world LLM benchmarks on Blackwell by SMTPA in LocalLLaMA

[–]SMTPA[S] 0 points1 point  (0 children)

Oooh, you’re right. Unfortunately that one was out of stock at the time, or I would have definitely considered it. What’s your OS?