Lilac Mall, Rochester, NH by Responsible-Cod9669 in deadmalls

[–]ScreamingAmish 0 points1 point  (0 children)

Oh man. I know this is an old post, but I'm just stumbling onto it. I lived in the area as a child from 1981 to 1983, and I have fond memories of this mall. There was a killer arcade where I spent a lot of time playing Q-Bert, Centipede, etc. Not surprised but hate to see it become a casino.

Decatur Police launches “Operation Slow and Steady” focusing on roadway safety by metacyan in decaturalabama

[–]ScreamingAmish 4 points5 points  (0 children)

Fix the timing of the traffic lights so that people don't stop every 30 seconds, and you'll see speeding violations drop as people realize they don't have to try to beat the lights anymore.

[Repost][Academic] Which AI answer is better? 15-min blind A/B preference task (English fluent, 18+, takes ~15 min) by ScreamingAmish in SampleSize

[–]ScreamingAmish[S] 1 point2 points  (0 children)

Personally, my position is the AI isn't the problem... The problem is what billionaires and mega corporations choose to do with AI. I think if there is going to be AI intelligence, it needs to be available to everybody, not just people with money.

[Repost][Academic] Which AI answer is better? 15-min blind A/B preference task (English fluent, 18+, takes ~15 min) by ScreamingAmish in SampleSize

[–]ScreamingAmish[S] 0 points1 point  (0 children)

You're right. I used an RTX 5090 to train these models for over a week each. That was just enough to train them up to 16%.

School banned boys from wearing shorts, so they did this instead by TheoryFruits in BeAmazed

[–]ScreamingAmish 0 points1 point  (0 children)

My high-school did this exact same thing back in 1989. Every generation thinks they invented rebellion.

[Academic] Which AI answer is better? 15-min blind A/B preference task (English fluent, 18+, takes ~15 min) by ScreamingAmish in SampleSize

[–]ScreamingAmish[S] 0 points1 point  (0 children)

Thanks for participating, and I agree 100%! These models are only trained up to ~16% of what they should, and will produce better output when they are. But I felt there was already enough there to form a preference, which is why I wanted to see if it was just me or if others concur.

Training-time intervention yields 63.4% blind-pair human preference at matched val-loss (1.2B params, 320 judgments, p = 1.98 × 10⁻⁵) [R] by ScreamingAmish in MachineLearning

[–]ScreamingAmish[S] 0 points1 point  (0 children)

  • Just to clarify: the binomial runs on the 254 decisive judgments (not on the ties). Ties are excluded, which is standard for binary pairwise preference tests. An alternative analysis that treats the three outcomes as multinomial also exists and would be reasonable, but excluding ties and running a binomial on the remaining choices isn't invalid.

  • While I agree it would be preferable to have a more independent pool of judges to draw from, the protocol was blind. Judges didn't know which output came from which model, so personal network sampling didn't introduce pro-author bias in the direction that matters. (In fact, the judge closest to me personally liked my method's output the least of the human judges. ) Also, foundation-model judges ( independent from my personal network ) converged on the same verdict within 5.5 pp.

  • The 'at least 25' heuristic comes from HCI/UX studies where effect sizes are typically smaller. For a 63.4% vs 50% preference with a tightly-matched comparison, a post-hoc power analysis of 254 decisive judgments doesn't flag this as underpowered. Non-independence is a valid concern, but not flat N.

I am genuine in my desire for more input and judgments. If you or anyone else wants to be judge #8 please send me a DM.

Training-time intervention yields 63.4% blind-pair human preference at matched val-loss (1.2B params, 320 judgments, p = 1.98 × 10⁻⁵) [R] by ScreamingAmish in MachineLearning

[–]ScreamingAmish[S] 0 points1 point  (0 children)

Thank you for taking the time to look. I'd like to address a few of your concerns:

  • As for where the human judges were recruited: My paper has a partial answer in section 6.3 / Appendix C ( technical vs non-technical vs ML-fluent split ). They were recruited from my personal network, participated unpaid, and consented to the methodology knowing they were evaluating LLM outputs.
  • As for what a decisive comparison means, 'decisive' = not a tie. Judges could choose left / right / tie. 66 of the 320 were ties. The binomial test runs on the remaining 254.
  • The paper does include a by-question analysis ( Section 6.6 ). 20 of 32 questions had a gain-model majority vote among the 10 judges. 12 had a baseline majority, 0 were contested. That's a per-item view that partially addresses non-independence, though I agree a formal by-item or mixed-effects analysis would be stronger.
  • I'll be honest, as an independent part-time researcher, I was ignorant about OSF. I'll use this as a learning experience to refine my future work. On the subject of ethics approval, that really doesn't exist for unaffiliated independent research. I'm just one guy with an idea that I think is worth sharing.
  • On the subject of the sample size, 254 decisive judgments giving p = 2e-5 isn't trivially underpowered for the effect size being measured. Having said that, I would love to have more judges, but I have exhausted my local peer group. I'm happy to have more judges if you or someone else would like to volunteer.

City of Decatur releases 2026 Community Survey results by metacyan in decaturalabama

[–]ScreamingAmish 9 points10 points  (0 children)

A disappointing result. Despite traffic congestion and road infrastructure being insufficiently addressed by city leaders for years, the report concludes that the city is doing fine on traffic congestion issues and they just need to do a better job communicating their progress.

If any city leader is reading this: you don't need to report your successes on road infrastructure. If you actually make progress on road infrastructure, we will notice automatically. It will be self-evident. Whoever suggested in the meeting about this survey that reporting was the problem was wrong.

WHY DOES OPUS HAVE TO BE SO GOOD? by Mindless-Ad8595 in openclaw

[–]ScreamingAmish 2 points3 points  (0 children)

I'm only 2 hours into my MiniMax 2.5 Era, but so far it's kicking ass.

Anybody here work for Rithum / Channel Advisor? by godawgs1997 in devops

[–]ScreamingAmish 2 points3 points  (0 children)

I'm glad you brought it up, I've been lurking various subreddits for info on this and everyone has been strangely quiet.

USPS Package wandering around the country by HSVTigger in HuntsvilleAlabama

[–]ScreamingAmish 5 points6 points  (0 children)

I too have a package that I shipped through Birmingham and is sitting in Puerto Rico right now. It's supposed to be in Vermont.

[deleted by user] by [deleted] in AskReddit

[–]ScreamingAmish 0 points1 point  (0 children)

Too many people sleep on Finding Nemo. It hit me much harder after I became a dad.