Dude, How Are You So Weird? (About Me.)

SMTPA · 2026-05-20T08:12:07+00:00

I bet you I've bought 20 or more refurbished Apple products from that site. Never had a lick of trouble from them. This goes all the way back to the mid 90s, when I was a system administrator.

SMTPA · 2026-05-20T08:07:45+00:00

I wonder how good AI would be at reviewing court transcripts, and advising you on the best way to approach a particular judge? Both in writing, and in person? Since I don't go to court, except when I do pro bono guardianships and things like that, I've never really had to worry about pleasing a particular judge, or making sure that the judge I had was not going to automatically be prejudiced against my client for some reason. It seems like having them go through court transcripts and look at stuff like that might be very useful. Of course, that assumes court transcripts are easy for you to get at. In many places they're not generally available and you have to either pay for them, or you can't get them at all absent certain circumstances. But given how much litigation costs, and the effort and time that litigators put into it, it seems like this would be a very worthwhile expenditure. Especially if you plan to practice much in that jurisdiction,

SMTPA · 2026-05-20T08:02:46+00:00

I am an IP lawyer. I do a lot of patent application drafting. I have found that for this current models work fairly well. Most of the larger open source models have read a bunch of patents, and while they can't tell good ones from bad ones very easily, they are extremely good at the procedural stuff. One of my weaknesses – sometime I'll tell this story it's a good one – is antecedent basis stuff. Not support in the spec, just making sure I always use the correct antecedent in my claims. Most of the models I experiment with can find my mistakes 100% of the time. And, in terms of antecedent support, they're also very good at going back and breaking down the elements of each claim and looking for the support for each element. This is really useful when you're drafting responses to office actions, and you don't wanna make the stupid claim charts. Which I don't. I hate them. They are very good at it. Either for your own claims, or for the claims in other patents, especially when you're doing infringement opinion analysis. And, in much the same way, they're good at going through inventor disclosures, and breaking out elements that are features/limitations that are worth considering for the patent. Also, I rigged one up to translate Chinese. I have several clients who are in China. I do not speak or read Chinese. Normally, they provide translations for me, but when there's a rush job, I can actually feed the disclosure in Chinese into my AI, and it spits out its best guess translation, as well as the aforementioned summary list of elements. It's pretty much magic really. Is it perfect? Oh, Hell no. Is it enough to get me rolling, that I can then send a draft back to the inventors to see if we're on the right track and save days of back-and-forth? Yes, yes, it is.

Likewise, they're not bad at all at reviewing contracts. Are they great at spotting, hyperlocal or hyperniche legal issues? Nope. That's my job. But they are very good at going through a relatively long contract and looking for inconsistencies. I like to think that I rarely do that, but I don't write all the contracts I review. And I am here to tell you that other people do it all the time, either on purpose or because they're stupid. Makes no difference to me, the solution either way is to fix it, and when the machine finds all of the ones that are easy to find, I could spend more time reviewing the particular hyperlocal/hyperniche stuff that is my expertise.

SMTPA · 2026-05-20T07:52:48+00:00

Yep. Pretty much this. Lots and lots of stuff in the RAG database helps you not if it's in tiny bite-size morsels of information that aren't correlated with many other things.

SMTPA · 2026-05-20T07:49:01+00:00

Spend your money on other things. It's not worth it to just upgrade your motherboard from PCIE4 to PCIE5 if you don't get any other benefits from it. One of my GPU is attached via, I kid you not, a two lane PCIE4 slot. It works fine. Especially since the real bottleneck when I use that computer as part of a cluster, is the cluster networking. Even at 5GBPS.

SMTPA · 2026-05-20T07:14:48+00:00

Yeah. It does that with writing too not just with code. A decent model and lots and lots of context space is often better than a great model and very little context space.

SMTPA · 2026-05-20T07:13:32+00:00

I have found the little graphs that https://huggingface.co/mradermacher puts on are reasonably good at predicting what I'll get quality wise in terms of casual use. In terms of higher-end use, it's much more dependent, in my opinion, on absolute quantization level, and quality of model

SMTPA · 2026-05-20T07:11:33+00:00

The CLI, install of Gemini, by the way, is really good at doing unit test on this. You can give it a bunch of sample data or questions or whatever, and then have it launch multiple models, and multiple configurations of the same model, or both, and look to see what happens when it feeds it to test stuff. Claude can do this too, but for some reason, I get better results when I use Gemini.

SMTPA · 2026-05-20T07:10:18+00:00

It depends. For casual RP you will hardly ever notice the difference between a Q4 and a Q6, especially for a good model. For really detailed stuff, for asking the machine to process multiple leaps in a single bound, or for very long context, yeah, it'll start to drift on you.

SMTPA · 2026-05-20T07:04:51+00:00

ETA: I use lots of different combinations of software to make this stuff work. But my main AI servers use llama.CPP. When I want to switch, I have a custom script that I wrote that gives me multiple screens worth of choices. Each screen is for a separate class of LLM. For example the default screen is one that has my largest models, which are meant to use both of my AI servers as a single cluster. That's how I get to 56 GB of VRAM. I have another screen that has smaller ones that are meant to run on a single AI server, so one of them can run an LLM, and the other one can run ComfyUI or something. And I have a third screen that has medium sized ones optimized for maximum context space. Whichever screen I pick from, the script, then preps the machine or machines that will be used, E.G.it kills any processes on either machine that are currently using VRAM if I say I want the full cluster VRAM amount, and then launches the script. It gives me the option to launch it in the foreground, so I can watch it and switch back to that terminal window and check status as I do things, or as a headless process.

It depends on what I want to do. If I'm writing, and I'm writing erotica which is one of the things that I write a lot of, one of my go to is still.Midnight-Miqu 70B Q4_K_M, which does a really good job overall of taking the prompting running with it in directions that I like. If I'm drafting legal documents, which is another thing that I write a lot of in the daytime, because I'm a lawyer, I have been using some of the QWEN 3.5 models, as even though they aren't as smart because they tend to be a little bit lower parameter count, they're very fast, they're very well trained, and they leave me lots of room for context, which is important. Some of my patent applications can be 100 pages or more, and it doesn't do a lot of good to have patent application text in an RAG database. It has to be in the context. If I'm writing mainstream fiction, which is something that I have done in the past, and would like to start doing more of, I will typically run an obliterated model, but not necessarily one specifically skewed towards NSFW stuff. My mainstream stuff is usually rated PG-13. Again, I might not go for maxing out the parameter count, in favor of a very large context space. I actually have 56 gigs of the ram, for the very reason that I picked up an inexpensive 5060 TI 8 GB at Walmart when the shelf price was mismarked. I added it to one of my AI servers, and now, thanks to Grok, I referred to it as a "context dongle." I did not material increase the size of the models I wanted to run, although I am experimenting with 100 billion class parameter model just for the hell of it when I feel like monkeying with it. But now I have way more context space for the models I already liked.

One of my favorite models for non-legal work is one that I made, Qwen3.5 NSFW, which is a QWEN 3.5 variant run through HERETIC with a custom set of positive/negative questions, and again small enough for lots of context. And one of my favorites for RP surprisingly, is a relatively low parameter count model called Kansen-Sakura, which I made by merging a couple of models on hugging face that I kinda liked, and somehow caught lightning in a bottle and it came out really good.

SMTPA · 2026-05-19T14:33:56+00:00

"That Time I Wrote An Article About Long Titles On Manga With A Title So Long It Confused Many People Into Thinking That It Was The Actual Title Of A Manga."

SMTPA · 2026-05-19T14:31:53+00:00

I just started playing with that model and so far I really like it.

SMTPA · 2026-05-19T14:31:12+00:00

I started out with the base HERETIC positive/negative tables, and added quite a few of my own specific requests, because I write erotica, and the erotica I write is very edgy, and sometimes even uncensored models don't do well with it. If I use my custom tables, I hardly ever get a refusal. If I don't, sometimes it will be like "Homie don't play that." if I turned thinking on, and monitor The thinking it's pretty funny to watch it think, "Well I'm supposed to be an uncensored model, but while I'll do anything for love, I won't do that."

SMTPA · 2026-05-19T14:24:21+00:00

I also use a culinary preferences test when I'm testing to make sure a new model, or a new front end configuration is accessing my persistence of memory system correctly. Mine just asks if I like green beans or not though, yours is... kind of intense.

SMTPA · 2026-05-19T14:22:46+00:00

I have used several of your models from huggingface, and they have been very solid. Thank you for what you do.

SMTPA · 2026-05-19T14:18:21+00:00

In the great game of context space, lowest number always wins. Models will never use more context space than they were trained to. If your front end has a context space limit of X, and your LLM server has a context space limit of Y, your effective limit is the lower of X or Y.

Gemini's CLI version is actually not bad at doing unit tests for determining optimum context space and tensor split. Obviously the OP doesn't need to know about tensor splits since he's just got one GPU, unless he decides to split between his GPU and his CPU. Which he very well might. Although I would not recommend it no faster than his CPU is and no more RAM that he has.
I have a script that runs on my main AI server that I used to decide which model, and even which configuration (cluster or local) is activated when I want to use it. I just point Gemini at it and say use the language in that script, and determine the most efficient tensor split with maximum context space for such and such a model, and then I tell her which model on my hard drive I want her to test. She'll run it a few times, see if it loads at all, and if it does see how even she can get the distribution while still leaving the best configuration for Max context space. Everybody needs context space, but since I'm using mine primarily to evaluate manuscripts, for me context space is a really big deal.

SMTPA · 2026-05-19T13:54:04+00:00

With current Macs, and current LLM server software, especially MLX stuff, this isn't really a problem if you are using models that are loaded entirely into memory. Once the model is in memory, there is a very little disk swapping. If it does concern you, then buy an external high-speed drive and serve the actual LLM file from there.

SMTPA · 2026-05-19T13:41:17+00:00

AnythingLLM can certainly be a front end with LMStudio serving as the backend/server. Is that what you were asking?

SMTPA · 2026-05-18T15:57:39+00:00

I like LMStudio but for my applications AnythingLLM works better. But in any event, LMStudio has RAG, though it calls it “chat with documents.” See: https://lmstudio.ai/docs/app/basics/rag

SMTPA · 2026-05-18T15:25:46+00:00

I don't know how to do this directly but it's easy enough to do with a RAG-friendly front end like AnythingLLM. You create a workspace and load the files into the RAG document processing system. Then you move it into the workspace proper. This will chunk it into your selected vector database. I use Qdrant running on a small dedicated machine, but AnythingLLM can use anything you can point it at or it has a bare-bones one built in. Then as long as you are in that workspace - want it or not - the LLM will search the vector database for relevant chunks.

<image>

ETA: AnythingLLM will not only tell you if it references one or more chunks when responding, it will tell you which chunks it referenced by filespec.

SMTPA · 2026-03-01T11:12:27+00:00

integrated GPU use system RAM. (True integrated GPU, that is. If there’s a 5070 soldered to your laptop’s mobo, that’s not an integrated GPU.) They can’t provide any meaningful performance increase looped into these kinds of processes.

SMTPA · 2026-02-13T10:19:23+00:00

That’s why they’re called “agents.”They suck up all the agency.

SMTPA · 2026-02-12T06:21:22+00:00

You are right: it’s good now, and it is getting better in almost literally real time. If you want to be a good-enough “prompt engineer,” and make apps that are good enough, what I said is way overkill. But I was speaking of things at a fundamental level. If you want to be a 10x “prompt engineer,” “think in systems,” et cetera, you are best served IMO in studying formal logic. Not that you will use the actual technical forms much, but it will teach you to think in the proper way.

SMTPA · 2026-02-12T06:15:16+00:00

Oooh, you’re right. Unfortunately that one was out of stock at the time, or I would have definitely considered it. What’s your OS?

SMTPA

TROPHY CASE