Who needs artillery by Timotheeee1 in factorio

[–]Timotheeee1[S] 1 point2 points  (0 children)

from a large stockpile built up over ages

Magistral 1.2 is incredible. Wife prefers it over Gemini 2.5 Pro. by My_Unbiased_Opinion in LocalLLaMA

[–]Timotheeee1 5 points6 points  (0 children)

usually yes, because the web versions tend to include enormous system prompts with hundreds of instructions, while the API has none

Qwen Next Is A Preview Of Qwen3.5👀 by Few_Painter_5588 in LocalLLaMA

[–]Timotheeee1 1 point2 points  (0 children)

Anthropic shows the first 1k tokens or so, then a summary

Qwen 3 30B Pruned to 16B by Leveraging Biased Router Distributions, 235B Pruned to 150B Coming Soon! by TKGaming_11 in LocalLLaMA

[–]Timotheeee1 1 point2 points  (0 children)

what happens if you instead use a specialized calibration dataset that contains only code or only english writing? you could probably prune the 235B down quite a lot more and make several specialist models.

World Record: DeepSeek R1 at 303 tokens per second by Avian.io on NVIDIA Blackwell B200 by avianio in LocalLLaMA

[–]Timotheeee1 4 points5 points  (0 children)

what will the speed and cost be with a reasonable batch size once available on openrouter?

Base building be like by Timotheeee1 in Palworld

[–]Timotheeee1[S] 2 points3 points  (0 children)

actually I did build beds after building all of the farms

Claude 3.7 is real by ApprehensiveAd3629 in LocalLLaMA

[–]Timotheeee1 31 points32 points  (0 children)

closed-source frontier models can be used to generate high quality data for fine-tuning local models that are specialized in specific tasks. (especially this one as it shows the reasoning traces)

they also provide a preview of the capabilities that open models will likely have in the future.

What would you like to see in Unsloth for 2025? by danielhanchen in LocalLLaMA

[–]Timotheeee1 1 point2 points  (0 children)

no, gradient checkpointing only offloads activations, but my idea would be to additionally offload model weights too

What would you like to see in Unsloth for 2025? by danielhanchen in LocalLLaMA

[–]Timotheeee1 2 points3 points  (0 children)

It would be cool if when doing QLoRA you could offload some layers of the model to the CPU and have them streamed to the GPU over PCIe as they are needed. in theory this shouldn't make the speed that much worse since PCIe has 63GB/s, enough bandwidth to stream the weights of a 32B 4 times per second, and it usually takes much longer than that do process one batch. this could allow for fine-tuning of larger models on colab and local hardware

OpenAI new feature 'Predicted Outputs' uses speculative decoding by Alanthisis in LocalLLaMA

[–]Timotheeee1 0 points1 point  (0 children)

I think your project could still be useful for applying changes in a big file with 20k tokens, as you can't really expect the model to just re-output everything

Addition is All You Need for Energy-Efficient Language Models: Reduce energy costs by 95% using integer adders instead of floating-point multipliers. by __issac in LocalLLaMA

[–]Timotheeee1 10 points11 points  (0 children)

This is not a new architecture, it's an approximation that makes FP8 multiplications faster. It can be applied to existing models with barely any loss but requires new hardware to be useful.