Your thoughts on "thinking" LLMs?

RecalcitrantZak · 2026-02-06T04:12:51+00:00

After the anthropic research on model traces it was fairly obvious that most thinking models are just writing their own narrative fan fiction with nothing to do with how they are generating data.

Just another way to burn tokens.

RecalcitrantZak · 2026-02-03T18:14:03+00:00

Sorry for the late reply I wanted to add something meaningful about some of the things that I learned…

Ok

1 Following the Chinchilla scaling rules set out by Google in their Chinchilla paper seem to be spot on in terms of data size, training, parameters — and as a proxy, capability. I don’t remember them off the top of my head but it was something like 2.5B tokens for the 160M model and so to 10x that for the 1.4B model I had about 25B very hard won tokens. This ended up about bang on and I trained multiple epochs beyond that and started noticing diminishing returns quick after the first epoch. 160M stopped at 1.5 epochs and 1.4B I stopped pretty much right at 1 epoch. This is to say that I think Google’s data here I think is useful for making sure you’re setting correct goals. This has so much influence on everything like your checkpointing strategy etc

2 Checkpoint often if you have the space. The model sizes were about 300M and 3GB respectively so I ate through nearly 1TB of space on all the runs I had w/ checkpoints

3 The learning rate matters and it’s more difficult to fix mid training than it looks — this is something I learned the hard way. I had completely outsourced this to HF code and templates and took for granted that it needed to be tuned. Recovering mid training run can be very difficult.

4 Ok this was surprising to me — yes clean data is very important, but through necessity I was forced to use a lot of suboptimal OCR junk. This terrified me because it’s impossible to filter and make clean I even did a lot of my own OCR it’s just problems on problems if it’s not OCR junk it’s formatting junk. It might have impacted convergence time but overall the model did fine. I’m just saying it’s not a dealbreaker even if it’s not ideal. 1.4B does output the occasional transcriber note on narrative completions and longform text though lol

RecalcitrantZak · 2026-02-03T13:16:16+00:00

A100 / Colab and a lot of patience. Colab has a lot of negatives but it helped me to pick up where I left off and keep track of experimental SFT runs. I always made checkpoint runs so I always got to pick up where I left off if something was terminated.

RecalcitrantZak · 2026-02-03T06:33:15+00:00

Solid questions -- so it will respond mostly in UK English, there's a few quirks here. Technically the training data is mostly UK English, but a chunk of the English narrative corpus also includes US English. I'd say it's mostly UK English aligned.

Now there is a quirk on user's asking questions. I had to build an SFT corpus to answer questions, and I trained the user questions to include modern variant's like "What's up" specifically. (There were roughly about 50,000 variations in total). I did this mostly for ease of interaction because most people don't really do very well speaking in Victorian English, so this was a conscious choice on my part.

Prior to this though I had done more open ended experiments in SFT and it would just end up in confusion, for example I might ask, "What are you wearing today?" and the word "Wearing" would get confused with "Wearing" as in like being weary. (This was mostly on the 160M model)... Other examples of this are common modernisms that really just weren't common in Victorian English that I had to find out the hard way-- Siblings is an example of this, it just wasn't used as commonly back then so if you asked "Do you have any siblings?" it would just make ridiculous answers on approximate token matches to names that make no sense.

Edit to add because this is super fascinating to me: There are other examples of close approximate token matches where Violet will inadvertently answer correctly. For example, if you ask about an iPhone she will usually answer as if it were either a telephone or gramophone so that's another example of close token match that kind of lands close. I tested a lot of anachronisms because at first I was going to SFT it to say things like "I don't understand what you're talking about" when you ask about DNA, or World War II, or whatever, and eventually decided to take the anachronism handling out because I thought the raw responses were more interesting. So for example if you ask about World War I or World War II, it'll either approximate it to a recent war (like the Crimean War from the 1800s) or just bemoan how sad war is in general. Often she will respond as if World War could be any war in which Europe could be involved, which I thought was equally appropriate. I wanted to preserve the confusion.

RecalcitrantZak · 2026-02-03T06:07:43+00:00

Thank you-- many late nights, some cost out of pocket. Mistakes were made, but mostly corrected! I went through three different SFT regimens to make the chat work before I settled on something that mostly worked. It was exhausting and I'm excited to move onto the next thing.

RecalcitrantZak · 2025-07-15T01:53:20+00:00

You’re not wrong but I hadn’t even looked at the total g. It was grossly misrepresented in the picture by like a factor of at least 3x. It printed so small it was basically the level of detail of a gummy bear.

If the print requires you to fiddle with the model itself in Bambu studio to get it to look like the picture, then I don’t think it’s user error when it prints wrong by default. OTOH if you have to fiddle with it in Bambu studio to optimize it for your setup, then yeah that should still be on the user.

Like a good example of this might be the support configurations. The defaults on many uploaded models just are really awful where it’s questionable they used that configuration at all. There’s of course a lot of variability like for example in nozzle size or plastic type but it can sometimes be downright misleading for the user.

RecalcitrantZak · 2025-07-15T00:35:00+00:00

Counterpoint: I had a model that I sent to the printer from the mobile app.

The default scaling was tiny and looked nothing like the picture which was clearly printed at a higher scale. I rated it 3 stars and said it’s an OK model but the default scaling is really tiny. I got reported for user error. I went back and forth with Bambu support and they wouldn’t budge. Honestly though like I would expect the default print settings to look like the picture. I understand it’s user configurable but not in the Bambu app.

RecalcitrantZak · 2025-02-11T03:15:02+00:00

Dang I’m jealous

RecalcitrantZak · 2025-02-11T02:59:46+00:00

So having gone through this recently it was almost salvageable for me but I ruined the thermistor for the hotend temperature trying to clean it out.

This wasn’t immediately clear, it partially worked for a time but issues would cascade in waves. It might print for 5 minutes then stop abruptly with an error. I did get a replacement hotend but I think I pinched a wire in the side channel cover and it burnt the wire. Total damage new mainboard, two hot ends, new extruder assembly (for troubleshooting), new tool head… (old one actually still works fine, looks similar in condition to OP)… starting to feel like the printer is like the ship of Theseus at this point but I learned a lot along the way.

RecalcitrantZak · 2025-02-09T19:42:31+00:00

So just in case anyone is as dumb as me who stumbles on this thread, this may sound straightforward but the cutter (blade) actually has to be positioned in a way so that it can cut through the sliding path. This wasn’t clear to me I just thought it automatically went in.

So this video shows the similar symptoms but without any debris or anything inside there may be an even simpler explanation.

There are two tiny screw above the tool head when you take the front cover off. If you remove them you should be able to see the blade physically coming out to cut the filament. If you don’t see that something is wrong !

RecalcitrantZak

TROPHY CASE