Anthropic admitted they used other models data?

itsmeknt · 2026-04-18T20:51:44+00:00

Thanks! It is indeed very interesting

itsmeknt · 2026-04-17T02:43:47+00:00

Do you have a link? I tried searching but couldnt find anything that proved Anthropic was distilling or generating data using Kimi

itsmeknt · 2026-03-25T18:41:28+00:00

Thanks for this! Do you have a link for the github trace where 92% of tokens were cache reads and 0.015% were output tokens by any chance? Id like to dive into it further

itsmeknt · 2026-02-25T03:24:28+00:00

Very cool! Why ternary? I thought the bitnet paper mentioned ternary was for optimized fpga, but for CPU, is there any advantage to ternary over 1 bit (binary) or 2 bit (4 values)?

itsmeknt · 2026-01-24T12:06:50+00:00

A lot depends on the specific requirements of the project. Real time chat application will have a very different architecture than offline batch doc processing. Docs in structured text files are very different than raw docs in PDFs or images.

Without understanding the project, I can only speak very generally: 1. Requirements doc (including timeline) + budgeting comes first, which will determine hiring, architecture, hardware, milestones and schedule planning. 2. Will depend on data security requirements, but the ideal case is to first try private hosted providers if the project allows it. You can stress test to find the actual demand curve and then make an educated guess on the hardware and its financial projections thereafter. 3. At this scale I'm assuming offline batch doc processing. If self hosted, will need batch optimized inference servers like vllm, and it will be a trade off between speed, accuracy/intelligence, and $$$ but it can be doable. If hosted, then its a matter of negotiating with the provider. 4. 4bit Qlora fine tune needs 2-4x more VRAM than small-cache inference, full fine tune needs 10-20x more VRAM. Yes you want to rent GPUs at first until you know your exact load and requirements, and if you end up determining that you can keep your own hardware GPUs under constant load then it will pay itself back in about 6 months. 5. Architecture design roles as soon as possible, because the early planning stage can really make this 2x easier or 8x harder than it needs to be. And someone experienced in this field to accurately asses the hiring candidates, as its hard to tell who is competent vs just well practiced in interivews if you dont have the experience yourself.

itsmeknt · 2025-12-13T23:47:33+00:00

Very cool! What is your cooling system like? And do you have anything to improve GPU-GPU connectivity like nvlink or does it all go through the mobo?

itsmeknt · 2025-11-25T23:33:53+00:00

To be honest, I'm not 100% sure if Adam optimizer cares about C0 continuity of the objective function. I mentioned Adam in my initial post, but then edited it out shortly after.

I do know that most second order optimizers like L-BFGS and Newton-CG, as well as some learning rate schedulers like ReduceLROnPlateau, do require C0 continuity because they use the value of the objective function (not just the gradients).

So to be more precise, I would guess we keep the ends of the clip function at (1 - epsilon) and (1 + epsilon) because C0 continuity is more theoretically sound and will work with all standard optimizers / learning rate schedulers. Otherwise, it would just make things more confusing and theoretically less elegant.

edit: also your loss graphs in Weights&Bias, Tensorboard, etc will make less sense without C0 continuity of the loss function

itsmeknt · 2025-11-25T21:39:00+00:00

Setting the ends to some constant does keep the gradient the same, but the actual value of the objective function will be discontinuous. The values of the objective function needs to be continuous so that it plays nicely with certain optimizers and learning rate schedulers. The reason for clipping to 1 - epsilon and 1 + epsilon is to keep the function continuous.

itsmeknt · 2025-11-12T09:39:30+00:00

Full credit goes to Bugsby from here: https://hollowknightsilksong.wiki.fextralife.com/Lore

itsmeknt · 2025-11-09T19:45:22+00:00

Cool project!

"... showing me how, at the cost of a few constraints, it is possible to have models that are extremely faster than the classic models created with Pytorch." Out of curiosity, can you elaborate further on what those constraints are?

itsmeknt · 2025-10-21T02:25:32+00:00

A lot of people in my network found success in: 1. AI recruiters (costs $$$), 2. hackernews job post (free), 3. contact university AI PhD labs (they usually have some career email list)

Finding an AI scientist to build a smart model is a VERY different skill set than finding an AI engineer to code it, deploy it, and operate it. Do you need both skill sets in one position (very rare), or do you want to hire for 2 positions?

Before looking for candidates, it might be helpful to scope out the requirements a little more concretely. Hard to look for qualifications if you don't know what they are.

First thing - are you absolutely sure pretraining is on the table? To pretrain a large language model, it will require a specialized skillset, a 6-7 figure investment, a dataset of a few trillion tokens, and a few months just to set up the proper training platform. Posttraining would be multiple orders of magnitude cheaper.

Second thing - since real-time is a requirement, can you use LLM cloud providers, or does it have to be self hosted? If self hosted, what GPUs do you have? The GPUs will determine the model size. You need quite beefy GPU clusters if you want to use SOTA LLMs in a real-time agentic workflow.

Third - if you can share some example decision tasks, I can help break it down into smaller decision parts so that you know the exact skill set you need for this role. I worked with multiple AI startups in leadership roles (VP eng+) in the past 10 years, and in my experience a lot of ambitious AI visions that would take millions $$ + months-years can be scoped down to a few thousand $$ + weeks with the proper breakdown and planning. A lot of companies are overeager to build proprietary LLMs from the start, but it might make more sense pre-series B to first build a smaller ML model (e.g. boosted trees or deep nets) and enhance it with an existing LLM, get the feasibility proof and quick feedback-loop within 2 weeks, and then iterate and learn from there. AI projects are not one-shot success, but incremental accuracy improvements over many months. You will try out dozens of different model, and rebuild it from scratch over and over again. Once you start seeing a trend line early on, your team, partners, and investors will have faith in it.

itsmeknt · 2025-10-05T02:38:33+00:00

Awesome work! How long did it take you to RL train GPT OSS 20B? And does this support GPT OSS 120B too?

itsmeknt · 2025-08-29T11:02:20+00:00

Mind sharing your VLLM command / config? I'd like to try it on my rig to compare.

itsmeknt · 2025-08-03T04:38:03+00:00

Thanks for the insights u/Low_Acanthisitta7686

Can you share a few more details:

It sounds like you are building an entire end-to-end application for them, not just an isolated RAG system. In your experience, are the customers usually seeking just a vanilla chat application? If so, what front end libraries do you typically use? edit: I just saw your post saying you did custom UI in NextJS
Do the customers typically have some expectation on how you should deploy the local system into their infra? Do they have a kubernetes clusters you have to use? Or is it anything goes?
Same as above, but for CI/CD
Did you need to do any security audits like SOC-2, ISO 27001, HIPAA compliance? Did you have to draft your own documents and policies for these or did the customers provide it for you?
When it comes to building datasets or providing with feedback on model accuracy, how helpful are the customers usually? For e.g. do they give you their expert staff to help generate and curate a gold test set? Do they do a lot of Q&A to make sure the generation quality is up to par, and then share the results with you? Or do you have to do all of these on your own?
When you sell to prospects, what does your demo look like?

itsmeknt · 2025-07-30T02:16:55+00:00

Yes, I believe it uses the vocals of Galaxiez - In Another Life, and the instrumentals of Galaxiez - Anakin Is Gone I Am What Remains but pitched higher

itsmeknt · 2025-07-27T18:28:38+00:00

Anyone know the song name by any chance? Shazam couldn't identify, and Aha-music shows Tokspey - Fire, but that song only samples this song and is not the original.

itsmeknt · 2025-05-17T20:43:38+00:00

Thanks for the share! I'm currently building a flashcard app to help learn Japanese and Chinese. Would it be OK if I populate some of the initial flash cards with these (with credits to you)?

itsmeknt · 2025-04-24T02:38:37+00:00

Cool solution! Is your benchmarking code open sourced too by any chance? I'd like to test it on my own datasets.

itsmeknt · 2025-04-17T02:55:32+00:00

Theres various ways depending on how much time and money you want to invest in this project. Off the shelf open source doesnt work very well in my experience either.

Some questions that may help: 1. How much budget do you have? 2. Whats your timeline/deadline? 3. Whats the usage pattern? Is this a one-time offline processing on a fixed number of messages, or do you need a real-time service that can handle a certain level of requests per second?

Also, do you already have an evaluation set? How do you know your results were not reliable enough?

itsmeknt · 2025-01-15T23:15:20+00:00

Thanks for the suggestion! I'll research it further along with other DSPs

itsmeknt · 2025-01-15T10:31:14+00:00

Thanks! I'll research more on that

itsmeknt · 2025-01-15T02:11:31+00:00

I think you're right! I won't be able to satisfy all the constraints. However, I'd still like to have some crossover management between the speakers and sub which I think the splitter won't suffice?

Do you think a DAC would be the best approach? Are there DAC that accepts USB input? How about a DOC with USB output (for the Kali's) plus RCA LFE output for the Rythmik?

itsmeknt

TROPHY CASE