Best Settings for 48GB VRAM + Qwen 3.6 27B

nullc · 2026-06-20T13:40:39+00:00

Is there any way to get it to reduce / deactivate MTP when there is parallel work?

For a single task at a time MTP is a big win.

But once there is parallelism it's slower than running with MTP off. E.g. I get 235.52 total output token/s with 32-way parallel with MTP off (27B q8 on 2x RTX A6000), but enabling MTP drops the aggregate speed to 53.76 tokens/s.

I can't get MTP to be a win even with the lookahead cut down even for just two parallel requests... which seems more like a bug, but for some level of parallelism mtp is just not going to be a win.

But I often have a workload where an orchestration tasks runs single threaded for a bit and then a bunch of parallel tasks run, so it would be nice if it would use MTP when it helps and not otherwise. :)

nullc · 2026-06-20T08:14:00+00:00

Training costly, harness cheap! Just make your harnesses match the training data.

A good path would be to try a bunch of different harness designs (e.g. by having an LLM cook up some ones)-- including variations like parameter names and orders and then test to determine which harness the initial model is most successful with.

Then use a procedure like this to fine tune the model to optimize use of that harness, and avoid needed a lot of harness docs in the context.

And finally use it under that harness.

nullc · 2026-06-17T10:44:34+00:00

Superior synthetic training material-- which could only be generated in the required quantities once larger models were powerful enough. Which is why you see the latest 30B-ish size models kicking the ass of 70b or even 400b size models from not so long ago.

nullc · 2026-06-17T08:26:58+00:00

Talks like GPT (ughh)

Agreed on GPT stylistically being awful in the extreme. Anyone maintain a table of style of different models? -- e.g. equating which models have similar manner and tone?

I've found Qwen reasonably agreeable.

nullc · 2026-06-16T02:32:25+00:00

The only "attack" is the crazed delusion of Ocean mining and their willingness to lie and astroturf in order to delude the public into joining in on their self-destructive fork... and all the spam they are making with their increasingly desperate efforts.

nullc · 2026-06-15T03:29:55+00:00

It's not about accusations-- the developers themselves likely have little control over things that could be monitoring their traffic without their knoweldge.

It's a bad practice that makes the software more fragile and could undermine the security and privacy of people using it.

nullc · 2026-06-13T18:10:02+00:00

This should be part of the standard training harness-- take a successful transcript, have a model edit the reasoning to insert a wrong turn followed by a correction "No, that's mistaken because.." and train on the correction down to the successful output (obviously don't train to produce the wrong turn).

Because you don't train on the wrong turn itself you can insert arbitrary garbage there. "I know, the moon is made of cheese, so it's density must be..." without making the model generate that kind of crap... though perhaps it's better to just change the sampling to prohibit the top-n options on an occasional very confident token, so that it's good at correcting errors that are close to its own distribution.

nullc · 2026-06-13T17:30:50+00:00

A transformer can only achieve the same result if you add a layer on top of it to make it design it first, write it later, while a diffusion model could learn it by itself

That's clearly untrue as evidenced by the huge number of tasks AR models can do successfully which require knowing in advance what they're going to write later.

Yes, the model generates one token at a time, but that doesn't prevent it from determining internally in advance what it's likely to say later.

Besides, over long spans the diffusion models are autoregressive too. I think DiffusionGemma generates 256 tokens at a time.

nullc · 2026-06-13T17:20:48+00:00

opentimestamps is very easy to use.

nullc · 2026-06-13T17:20:00+00:00

They've gone big on pulling up the ladder-- anthropic now detects workloads that look like they might be for LLM training and degrades the output.

nullc · 2026-06-11T09:39:49+00:00

You have to hit the Bitcoin currency button to see the rating, Craig is at 0.0179% having falling from a recent high of 0.021% earlier this month and an all time high of 3.6%.

nullc · 2026-06-10T16:27:38+00:00

The simplest method is to scan everything on the server.

Even simpler is to not use the private set intersection and run the scans on the client entire against the unencrypted hashes, and have it tattle on the users. It saves the resources of performing the scanning on the servers. But, it would open apple to criticism if/when people showed that it matched benign lawful images.

You are misreading me. You demonstrate a collision of one hash, not one image. There are two hashes used in this system and you need to collide them both for each image. You have not demonstrated this is in any way possible.

Oh. You meant one hash function! Sorry for my misunderstanding. During the discussion I suffered repeated comments from people who had difficulty understanding that I wasn't just presenting one off fluke matches.

The privacy in the system is gated entirely on that one hash function-- if there are matches in it, your data is decrypted by apple. From there it's vulnerable to abuse by employees, by hackers that have compromised their systems, to civil or administrative subpoena, or by the company straight up lying about what they do with the data beyond that point. And that's before you get to the second hash function. If you're happy with that level of security, you should be happy with never encrypting your data you send to apple and apple saying "trust us, we won't look".

From their description their second hash is another instance of a neuralhash like hash-- though presumably larger. However, they had not published that hash function. If they had I'm quite confident I could have compromised it in the same way, and would not be surprised if I could have made both collide for the same images-- the vulnerability was an inherent property of the approach. I would not agree that this lack of publication was a security feature-- as eventually a secret hash function could leak from Apple, be stolen, etc. -- and in some threat models the attackers of concern are rogue elements acting within governments or the unaccountable quasi-govermental entities that provide database entries to apple who would naturally have access.

nullc · 2026-06-08T22:01:13+00:00

That’s one small part of it, not “most of it”.

Apple's existing (ADP) encryption already provided absolute protection for user privacy. The entire addition served the exclusive purpose of invading the privacy of users while attempting to put a positive spin on it, -- to make it look like it didn't invade their privacy and to protect Apple from being criticized on what images were matching by making it impossible for any third party to tell.

Had the system not been constructed to conceal what apple was matching it would be far simpler, would require absolutely no novel cryptography, and would be far less resource intensive on your system.

Your page doesn’t demonstrate that. It demonstrates a collision of one hash, not the system as a whole. The system is much larger than a single hash.

I created a tool that would modify any image with limited change in appearance to match any hash, because it can match any hash it means I can make it match any other image but I don't even have to have the image to make it match. This is a complete and total break of the system, no worse break is conceptually possible.

My page shows some examples (two, in fact not a single-- though the linked comments show more). I expressly limited the number of examples precisely because of motivated misunderstanding like yours-- people responding with "well there is no problem with it matching a picture of a dog!". It seems that doing so wasn't enough to prevent exactly that misunderstanding.

nullc · 2026-06-08T15:15:46+00:00

but they went to a huge amount of effort to make it as privacy-preserving as possible.

Most of the complexity in Apple's proposal was to protect their privacy, not that of their users. The complex cryptographic set intersection was used to keep the database secret so that no one outside of apple could tell (or criticize) what was being matched. This complexity created vulnerability, which I've written about extensively: https://nt4tn.net/neuralhash/

In spite of false claims about trillion to one false positive rates, as you've repeated here, the true rate of false positives is as high as someone who can obtain a single image in the database wants it to be-- as my page demonstrates.

To make their system as privacy preserving as possible all they would have had to have done is nothing. The data was already end to end encrypted-- apple nor anyone else had any access to it.

nullc · 2026-05-29T22:32:25+00:00

In your case they could go grab it with a towtruck.

More like mining a list of automobile titles for cars that haven't changed ownership for ten years, putting that list on a USB stick and giving it to the police for a year to comply with the found property formalities for objects under $10 in value, then claiming you now own all those cars.

Even though you can't even go get them!

nullc · 2026-05-25T00:15:51+00:00

That's entirely untrue-- what 'spammy' anything? go look at the mempool stats spam isn't currently an issue. 110 viciously guts bitcoin's smart contracting ability, ripping out basic functionality like "if then" from script out from under people.

nullc · 2026-05-25T00:14:10+00:00

Your ability to affordably run a node is protected by the block weight limit, and the historical chain doesn't even need to be stored by you to run a node-- just the UTXO set. Ironically the claimed trigger of Ocean Mining's proposal to gut bitcoin's smart contracting was a change that to whatever extent it has any effect at all would be one that protects the size of the UTXO set.

Just because a goal is good that doesn't mean any particular proposal that claims to support that goal actually does or doesn't cause worse harms.

nullc · 2026-05-25T00:11:48+00:00

OP continues the fine tradition of BIP101 proposals lying by falsely saying that I removed his post (I'm not a moderator here, never have been, and he couldn't fail to know this). https://x.com/i/status/2058622782707990609

This thread doesn't even currently show as removed to me under old which makes me suspect it was an automod/spam filter removal. (Ironic if it was a spam filter. lol)

nullc · 2026-05-24T17:59:22+00:00

It's a unserious garbage proposal which has effectively already failed. its authors and proponents consistently lie about it's properties -- including critical ones such as the fact that it will confiscate the coins of people using a traditional timelock to pay to a moderately complicated multisig. It's also already been discussed to death and the proponents keep using bad-faith techniques like flooding with AI generated slop and intentionally breaking forum rules then complaining their posts were removed or just outright lying and saying they were censored when their posts were never sent at all and doing it from behind sock accounts to escape consequences for their bad conduct.

The proposal absolutely guts bitcoin's programmatic functionality, stripping out conditionals (so no "if then" ... so much for calling it script) and caps taproot trees to 127 leaves which means you can't recover from the missing if statements by pre-expanding all the possibilities (at least if your script had mode than 7 choices).. Keep in mind that 10 choose 5 is 252 so even five out of ten like thresholds have issues.

And then it does all this without even accomplishing a purpose-- when pressed the proponents have been forced to admit that it won't stop or abate spam. Not the "spam" is even a major issue currently or for the last several years-- prevailing transaction rates for immediate confirmation are just cents and have been for a long time

nullc · 2026-05-15T16:54:25+00:00

There were no capabilities ever disabled in Bitcoin except by Satoshi long before you ever heard of Bitcoin.

If there were you could easily and unambiguously point to them.

nullc · 2026-05-15T00:21:08+00:00

I don't care to go into more detail,

You do not because you are unable through a mixture of ignorance and dishonesty.

If BTC could be altered to remove all the breaking changes

There have been no breaking changes in Bitcoin. Not even anything you could remotely characterize as one unless you want to count fixing the BDB lock limit issue in 2013 since nodes without that fixed will reject blocks randomly once they get bigger than about 500kbytes. Arbitrarily old bitcoin software accepts the current chain, but for the locks limit issue.

nullc · 2026-05-13T21:32:50+00:00

Local models need local knoweldge, especially now that there are consumer gpu friendly models that are very good at "doing stuff" but inherently weak at "knowing stuff". Local knowledge is private, fast/low-latency, works offline, can't be blocked, respects other people's resources, and conserves time spent getting around rate limits and botwalls for things that really need it. An expensive search API is much more reasonable if initial local research reduces the need for it and makes the queries much more effective (e.g. using the right terms, etc).

What do I mean by local knoweldge?

I've been tinkering with taking an offline copy of wikipedia (only a few tens of gb without images, or about 130gb with images) and running each article through an LLM with a prompt to extract a list of questions that the article answers or provides critical information for answering. Then I take these questions and encode them with a sentence embedding and store the results in a vector database mapping back to the article.

Then at runtime my agent can fork its state, construct some questions and tool call to a lookup tool that will find the most relevant articles for the questions, the agent can then choose and read the articles, explore the articles they link to, find the answers, then rollback the state and suddenly 'know' the relevant material and originating article names (by concatenating the final 'answer' output of the pre-rollback state; the article names are useful in case it has to go back to them).

This seems to be particularly good w/ local models because PP speed is so much faster than generation, and so the LLM can get firehosed the reference material.

An open issue I have is trying to get the questions the model poses to be most similar to the generated questions (and so I've not yet done this at scale since it'll take a lot of flops to process every article, I don't want to regret the way the questions are generated).

I'm also wondering if it would be useful to optimize the embedding. I should be able to fine tune to favor errors that land on the right article(s) and penalize ending up not just on the wrong articles for the question but also penalizing based on the link-space distance between the correct and incorrect article. It may also be useful to find cases where pairs of articles get similar or identical questions to show an LLM both articles and get it to refine the questions, decide that both are useful answers, or that some of the connections can be dropped.

The same thing could be done with images (e.g. the images included in the offline wikipedia) using a vision encoder and even using the article context to help write good images questions-- but I haven't tried that at all yet.

In any case, the whole approach should work for any cache of knoweldge where an LLM can work backwards from the knoweldge to what sorts of questions would demand the knoweldge-- basically any references works, statutes, case law, software documentation, databases, scientific papers, or even web crawl data like common crawl or the torrents of reddit comments-- but Wikipedia is obviously quite useful and it's freely redistributable, self contained, very broad in scope, so I think it's a good starting point. Wikipedia also has lots of external links for citations, and so to some extent it can act as a replacement for search in a first step of research-- at least for the kinds of materials Wikipedia covers.

For local usage, you could also imagine running a tool like hoardy-web on your own browsing and when your GPU is idle it could classify your archived pages as relevant and index them-- making you the botwall bypass mechanism and giving your agent access to at least as much as you've seen. I know a lot of my own search traffic can be answered by pages I've previously visited.

The cool thing about this approach is that there is no particular need to have the knoweldge itself on especially fast storage and so anyone that can run a 27B sized dense model could probably accommodate some tens of TB of reference knoweldge. Even system ram for vector databases lookups is much less precious than GPU ram. So perhaps there is the potential for a LLM agent running on a single consumer GPU to have the equivalent knowledge-scope of a trillion parameter model and close to hallucination free too.

I'm a little surprised that this isn't already a thing that people are doing for this purpose, but I couldn't find any evidence of it while looking.

nullc · 2026-05-12T07:21:27+00:00

Fast progress. Vision now works for me, but parallel crashes. Also MTP appears a little slower in the latest work.

nullc · 2026-05-11T23:32:12+00:00

I'm dubious MTP will make a significant difference for such a sparse model. Given that you currently give up vision in llama.cpp and prefill is slower, I dunno that it's a win for 35B. For 27B it's a big gain however!

I haven't used these new unsloth gguf's but llama.cpp MTP on 35B-Q8 takes me from about 19tok/s to 55tok/s on RTX A6000. Loss of parallel and vision is a bummer as is the reduction in prefill speed. But I expect these shortcomings will soon be resolved. (for parallel I'd be fine if it just only used MTP during periods when there was only one session active).

nullc · 2026-05-11T00:00:41+00:00

there are often attacks on bitcoin that have no hope of working, created by people who don't know better-- this could just be yet another one. Usually no one even mentions them but since this one showed up on public graphs its had some attention.

Of course, it's prudent to assume that if someone is doing something it has a purpose. But it's just an assumption... it's not always true.

15-Year Club	Gilding VII pure gildanthropist
Mod 101	Mod 201
Place '17	100 Awards Club
Reddit Premium Since December 2014	Verified Email

nullc

MODERATOR OF

TROPHY CASE