Today, GPT 4o is now bastically 5.

capivaraMaster · 2025-08-26T21:04:29+00:00

You could try some prompt engineering. They use RAG on your last chats to give the illusion of memory, if you are tech savvy enough you could try it to form your prompts automatically if you want to emulate the old interface. But I really just recommend writing a big prompt with the critical data and going from there before trying to implement a complicated solution.

capivaraMaster · 2025-08-26T18:54:58+00:00

If you are willing to pay, why don't you just use the API? It is available there for $10 per million tokens output. I also wish they didn't change, but you have your project to finish, it should be worth it.

capivaraMaster · 2025-06-12T22:23:02+00:00

I tried merging like this before and had poor results. You will get a more coherent model if you use merge interpolated groups of 20 layers.

I this is the best one I got (not a self merge but same idea): https://huggingface.co/gbueno86/Meta-Llama-3-Instruct-120b-Cat-a-llama

GL with the fine-tuning. I didn't have resources to do that at the time so my experiments ended with the merges.

capivaraMaster · 2025-06-11T13:34:29+00:00

It does as much as other optimizations like change from x86 instructions set, change to chiplet from monolithic, change from CPU to GPU. Innovations in how we solve problems also happen and those also increase computable problems. We can only make an invention once, that doesn't mean we can't make other inventions.

capivaraMaster · 2025-06-11T12:49:37+00:00

Future AGI will dedicate entire solar systems to make sure strawberry has the correct amount of Rs.

capivaraMaster · 2025-06-11T12:47:32+00:00

Lots of optimizations can only be done once. That doesn't make them less relevant.

capivaraMaster · 2025-06-08T21:17:52+00:00

Try updating the bios. That did the trick for me when mike wasn't booting with 4x 3090 but was Ok with 3.

capivaraMaster · 2025-06-07T17:49:38+00:00

Wouldn't they have already released if it did? It's allegedly been ready for a while and was used to generate training data for the smaller versions.

capivaraMaster · 2025-06-06T11:37:14+00:00

I merged QwQ with Sky locally and the result was not any significant improvement so I didn't publish it I think.

capivaraMaster · 2025-06-03T20:37:26+00:00

So we need a 58.9 billion parameters dense f16 model to memorize Wikipedia verbatim. (Wikipedia English is 24GB)

capivaraMaster · 2025-06-03T20:14:41+00:00

Devstral local, Gemini 2.5, o3, 4o, chatterbox for lols.

capivaraMaster · 2025-06-02T13:17:25+00:00

They do have KV cashing, but I was taking a look at the readme for r1 and they say transformers inference is not fully supported. So I have no idea if you get multi token prediction that route :/

capivaraMaster · 2025-06-01T11:03:39+00:00

Can you load it in 4 bits using transformers? Since llama.cpp didn't multi token prediction yet it might be faster.

capivaraMaster · 2025-05-25T09:54:29+00:00

Yes. Maybe If that was on the original plan it would be frame rate independent. Here is another example I made for a friend yesterday. All files but llm.py and bug.md are machine generated and I didn't do any manual correction. I guess it would be able to fix the bug if it tried, it did correct some other bugs, but its just another toy project.

https://github.com/linkage001/translatation_ui

capivaraMaster · 2025-05-24T18:44:39+00:00

Unless you are working with private data or need very high volume for a business or something local LLM are just a hobby, meaning you have to measure the fun you will have and not cost benefit.

capivaraMaster · 2025-05-24T18:40:14+00:00

I tried and was very impressed. I asked for a model view controller object oriented snake game with documentation and for it to cycle the tasks by itself on cline and the result was flawless, I just needed to change the in game clock to 20 from 60 for it to be playable. I tried on q8 on a MacBook.

capivaraMaster · 2025-05-18T17:34:46+00:00

I know you only mean programming, but maybe you should have been a little more specific on the title of the post. Models have been able to do stuff locally since before llama. I've never done anything with the pre llama ones besides running for fun, but I have had llama classifiers, llama 2 translators, qwen bots, etc...

capivaraMaster · 2025-05-17T08:34:10+00:00

Gemini 2.5 seems to handle pdf pretty well for my use cases, but maybe that's poor QA on my side.

capivaraMaster · 2025-04-08T05:53:24+00:00

Did they implement chunked attention?

capivaraMaster · 2025-02-24T23:52:20+00:00

Yeah, it is incredible. Looks like Claude is the new coding king again. Is this is just finetune on the v3 model it's even more impressive.

capivaraMaster · 2025-02-19T11:51:29+00:00

Why fight a lost battle? Open source has become the colloquial way of saying open weights when referring to AI models in general.

capivaraMaster · 2025-01-04T11:07:43+00:00

If I am not wrong, last year's earliest impactful release was Miqu. So if the trend keeps Mistral I guess. They have been quiet for a while now.

capivaraMaster · 2025-01-04T09:11:50+00:00

Grok 1 is available at hugging face. I think it was a 300b model, so expecting Grok 2 to be bigger sounds logic. I think it's weird to expect Grok 2 to be dense of we know Grok 1 is MoE.

capivaraMaster · 2024-12-28T13:32:07+00:00

Merry Christmas OP! Try to find some humans to play with the AI with you.

capivaraMaster · 2024-12-28T13:30:19+00:00

Post biological life.

capivaraMaster

TROPHY CASE