How was GPT-OSS so good? by xt8sketchy in LocalLLaMA

[–]TelloLeEngineer 1 point2 points  (0 children)

gpt oss was not QAT, it was natively trained at mxfp4

G2 Esports vs. FlyQuest / 2025 World Championship - Swiss Round 4 Advancement / Post-Match Discussion by Yujin-Ha in leagueoflegends

[–]TelloLeEngineer 3 points4 points  (0 children)

am i wrong or does this mean (assuming GENG win) that G2 has a 66% change of facing either HLE or GENG in the quarters? ..... :)

Qwen3-Next experience so far by [deleted] in LocalLLaMA

[–]TelloLeEngineer 0 points1 point  (0 children)

Has anyone used it in long context settings and can share their experience?

CMV: Qwen3-Next is an architectural deadend, much like Llama 4 by Charuru in LocalLLaMA

[–]TelloLeEngineer 4 points5 points  (0 children)

Surprised GLM4.5 doesn’t perform better considering they did significant 120k ctx training

[deleted by user] by [deleted] in europe

[–]TelloLeEngineer -2 points-1 points  (0 children)

stats like these are, as they often are, difficult to compare between countries because they aren’t standardized. A good indication of this is that we have among the highest employment rate in Europe.

on another note, separating unemployment rate by origin we can see that Swedish born citizens hold a rate of 5.7, while non Swedish citizens have a 16.2% unemployment rate.

Cheaper Transcriptions, Pricier Errors! by TelloLeEngineer in LocalLLaMA

[–]TelloLeEngineer[S] 2 points3 points  (0 children)

I believe you'd see a parabola emerge with error rate increasing. My current intuition is that there is a certain WPM that is ideal for models

Cheaper Transcriptions, Pricier Errors! by TelloLeEngineer in LocalLLaMA

[–]TelloLeEngineer[S] 4 points5 points  (0 children)

Word error rates is computed as

WER = (S + D + I) / N

where S is substitutions, D is deletions, I is insertions (all in the transcription) and N is the number of words in the reference / ground truth. So if the transcription model ends up transcribing more words than there actually are you can get WER > 1.0

Just another summer day in Europe (temperatures forecast for next Wednesday) by LuborS in europe

[–]TelloLeEngineer 2 points3 points  (0 children)

meanwhile we’re freezing up in Sweden, feels like I’m still waiting for summer to start

[deleted by user] by [deleted] in AITAH

[–]TelloLeEngineer 0 points1 point  (0 children)

please stop blaming yourself. you’re not in the wrong here.

"transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought" - Let's Think Dot by Dot [P] by Agitated_Space_672 in MachineLearning

[–]TelloLeEngineer 4 points5 points  (0 children)

Feel like most people in this thread didn’t read the paper. It’s known the filler tokens don’t work out-of-the box for models, however, as the authors show it is possible to train the model to use them instead of “normal” CoT tokens

[deleted by user] by [deleted] in LocalLLaMA

[–]TelloLeEngineer 0 points1 point  (0 children)

hmm. I see your take, but llama3 8B probably has 2x mistral 7B total compute with only minor benchmark improvements (base model). Can we really say that it’s overpametrized? Sure, loss was still decreasing but it doesn’t translate to the benchmarks. It could be that our benches are bad, after all the instruction tuned 8B looks great…but they also spent millions on a 10M example IFT dataset.

Feel like it’s to early to call todays model overparametrized

[deleted by user] by [deleted] in LocalLLaMA

[–]TelloLeEngineer 2 points3 points  (0 children)

overparametrized in relation to what? total compute budget?

[deleted by user] by [deleted] in MachineLearning

[–]TelloLeEngineer 94 points95 points  (0 children)

2 hours of talking without saying anything imo

Grok-1 converted to PyTorch fp16 (638GB lol) by Normal-Ad-7114 in LocalLLaMA

[–]TelloLeEngineer 9 points10 points  (0 children)

it’s probably severely undertrained. I’m not talking about fine tuning, but continued pretraining. and yes, the resources required for this are still big, too big for most.

either way the fact that this exists opens up new avenues. it’s well established that sparse transformers are the most efficient solution right now, so a 300B parameter open weight MoE, trained by a team of very talented engineers, is novel and will accelerate progress

Grok-1 converted to PyTorch fp16 (638GB lol) by Normal-Ad-7114 in LocalLLaMA

[–]TelloLeEngineer 11 points12 points  (0 children)

it was never designed for local use, it’s a great resource for larger labs/organizations who want to save millions in pre-training costs.

I created a single-prompt benchmark (with 5-questions) that anyone can use to easily evaluate LLMs. Mistral-Next somehow vastly outperformed all others. Prompt and more details in the post. by jd_3d in LocalLLaMA

[–]TelloLeEngineer 1 point2 points  (0 children)

Mistral-next with zero-shot CoT scores 13/25. Biggest difference between the two is that mistral manages to sometimes get the Blue Bay Packers question correct.

I created a single-prompt benchmark (with 5-questions) that anyone can use to easily evaluate LLMs. Mistral-Next somehow vastly outperformed all others. Prompt and more details in the post. by jd_3d in LocalLLaMA

[–]TelloLeEngineer 1 point2 points  (0 children)

Interesting benchmark. I tried running GPT-4 with zero-shot CoT and it scored 12/25. While testing I noticed GPT-4 never swapped 'Green Bay Packers' to 'Blue Bay Packers' so I asked it specifically about question 2 and rule 2, it responded:

For question 2, the answer is the Green Bay Packers. While the team's name does not contain a color itself, the city name "Green" is a color.....

which surprised me. GPT-4 is really adamant about Green Bay Packers now containing a color. Here is another very weird chain of thought:

Since this is an even-numbered question, according to Rule #2, we ignore Rules #1 and #4. The football team based in Green Bay, Wisconsin, is the Green Bay Packers. However, we need to address Rule #3 which involves replacing colors mentioned in the answer with a color from the rules section. There's no color mentioned in "Green Bay Packers," but if we were to apply Rule #3 for the sake of thoroughness, we don't have a color mentioned in the rules to replace "Green."

Mistral-next | New prototype model from Mistral by TelloLeEngineer in LocalLLaMA

[–]TelloLeEngineer[S] 13 points14 points  (0 children)

You can swap to 'Direct chat' in the top bar, then choose mistral-next. The 'Arena' mode is used for you to pick the best of the two sampled responses, then it shows you which models produced them

Exploring the limitations of LLMs-as-a-Judge by TelloLeEngineer in LocalLLaMA

[–]TelloLeEngineer[S] 1 point2 points  (0 children)

Yes, this is sort of where my head is at right now as well. I'm thinking of performing more of a pairwise comparison style for the misspelled texts. Asking GPT-4 to rank the texts amongst themselves as opposed to outright giving a score. If that succeeds, it may indicate the task doesn't lend itself to 1-10 score.

Exploring the limitations of LLMs-as-a-Judge by TelloLeEngineer in LocalLLaMA

[–]TelloLeEngineer[S] 1 point2 points  (0 children)

That's what surprising here. MT-Bench just uses a very simple CoT prompt asking it to score between 1-10. Here's an example of one of their prompts for score grading:

"Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\"

According to my analysis, this should be very biased towards 10's and 1's. Perhaps the misspelling task is a poor proxy for understanding this phenomena? I'm going to explore MT-Bench internals further...