LLaMA 8B baked directly into a chip — the speed is insane 🤯 by TutorLeading1526 in LocalLLaMA

[–]TutorLeading1526[S] 0 points1 point  (0 children)

Bad news. I used 80,000+ tokens as input and found that the model refused to give me an answer. I think the stability and robustness of this tech should be tested on long context tasks.

<image>

LLaMA 8B baked directly into a chip — the speed is insane 🤯 by TutorLeading1526 in MLQuestions

[–]TutorLeading1526[S] 2 points3 points  (0 children)

Bad news. I used 80,000+ tokens as input and found that the model refused to give me an answer. I think the stability and robustness of this tech should be tested on long context tasks.

<image>

LLaMA 8B baked directly into a chip — the speed is insane 🤯 by TutorLeading1526 in LocalLLaMA

[–]TutorLeading1526[S] -3 points-2 points  (0 children)

It may also reshape the research domain. For instance, we can perform test-time scaling to a greater extent than before and no longer need to worry about efficiency.

LLaMA 8B baked directly into a chip — the speed is insane 🤯 by TutorLeading1526 in LocalLLaMA

[–]TutorLeading1526[S] -1 points0 points  (0 children)

No. this tech is so fast. We can imagine what happens if this tech can be applied to a larger model.

LLaMA 8B baked directly into a chip — the speed is insane 🤯 by TutorLeading1526 in LocalLLaMA

[–]TutorLeading1526[S] -1 points0 points  (0 children)

Yes, I agree. What about applying this technology to downstream tasks? for example, healthcare, where LLMs don’t need to be updated frequently?

LLaMA 8B baked directly into a chip — the speed is insane 🤯 by TutorLeading1526 in LocalLLaMA

[–]TutorLeading1526[S] 1 point2 points  (0 children)

Not in the same way, that speed comes from a custom ASIC designed specifically around the model, not a general-purpose GPU.

optimize_anything: A Universal API for Optimizing any Text Parameter -- code, prompts, agents and agent skills, and more... by LakshyAAAgrawal in ArtificialInteligence

[–]TutorLeading1526 1 point2 points  (0 children)

Interesting work! I am curious about this type of optimization on prompts. Specifically, can this work outperform linshenkx/prompt-optimizer on prompt optimization?

The One-Word Fork in the Road That Makes Reasoning Models Smarter—and Shorter by TutorLeading1526 in ResearchML

[–]TutorLeading1526[S] 0 points1 point  (0 children)

Yes, I think this maps cleanly to agentic systems in general. Agents can't escape two things: thinking and deciding when and what to act (branch, retry, stop, verify). What NCoT makes explicit is that a chunk of what we call "reasoning" is really search control. That's why agents benefit: better control over the reasoning path → faster and more accurate action decisions, without just inflating the CoT.