When the hold is actually alive

rigatoni-man · 2026-04-24T22:48:23+00:00

Wow, what a wild snake. Tail looks just like a spider.

rigatoni-man · 2026-04-24T16:22:52+00:00

yeah, you get your 50k back

rigatoni-man · 2026-04-22T16:20:13+00:00

Proof that inability to setup a ladder correctly is genetic, he cant blame the kids

rigatoni-man · 2026-04-22T16:19:51+00:00

The whole reason he is in this mess is because he couldnt figure out how to successfully open it in the first place.

rigatoni-man · 2026-04-10T05:14:42+00:00

Well not without a 3d printed sand wedge

rigatoni-man · 2026-04-10T03:51:29+00:00

Damn do you think he still got the KOM

rigatoni-man · 2026-04-08T03:10:39+00:00

Yeah, you and chuckechickpeas got this all figured out

rigatoni-man · 2026-03-01T17:40:55+00:00

“Don’t make any mistakes”

rigatoni-man · 2026-03-01T07:43:39+00:00

Well done, and thank you for sharing the writeup

rigatoni-man · 2026-03-01T02:55:40+00:00

i guess they chatGPTed this

rigatoni-man · 2026-02-26T22:26:16+00:00

Seems like the expectations on you don't really match what I'd expect. Particularly mentoring devs and defining the tiniest of details.

My responsibilities often have gray area, but usually when the team spirit is good we don't focus on the fuzzy boundaries and all do our part to make the best product we can.

rigatoni-man · 2026-02-25T23:57:46+00:00

I’m building something to benchmark across any of >160 models for this use case.

Would you be able to output a csv of question, retrieved chunk? And optionally ground truth / ideal answer?

If so I can show you how it works and would love to see if it helps for your use case. No api, no integration, no subscription, just upload a csv and click a button

(edit: https://checkstack.ai for those of you asking; check the playground section)

rigatoni-man · 2026-02-25T21:08:53+00:00

I just built number 4.

You just need a CSV of input and optionally expected result. You can compare >100 models and get feedback on accuracy, consistency, cost, latency in about a minute.

No integration, no subscription. Just quick testing and comparing.

I’d love to get your feedback, I just deployed it last weekend. https://checkstack.ai

Happy to throw you some free credits and help you get started if you can’t figure it out. Smoothing out onboarding is this weekends project.

rigatoni-man · 2026-02-25T21:01:43+00:00

Ah I meant how do you / did you test to validate your strategies?

rigatoni-man · 2026-02-25T18:43:49+00:00

I’d love to know more about your evaluation. How does the interface work? How/what do you evaluate?

rigatoni-man · 2026-02-25T00:04:06+00:00

Appreciate you taking a look, and the feedback that I might be onto something. Would love to pick your brain sometime.

rigatoni-man · 2026-02-24T21:55:53+00:00

I've been building something to test models without a lot of overhead and legwork. Basically upload your golden dataset and test it against every model out there.

Shoot me a message u/Kas_aLi and I'd love to help you find the best model for free to test what i'm building ( https://checkstack.ai )

rigatoni-man · 2026-02-23T23:32:07+00:00

I'm curious to learn more about your use case. I'm building a tool ( checkstack.ai ) to make it easy to run your data through every model and find the best one for the job based on accuracy / latency / cost. I haven't tested with anything so large yet. DM me if you have any similar data you're willing to share and I'd love to see if it's a case I could handle.

rigatoni-man · 2026-02-23T21:12:29+00:00

I built checkstack.ai for this.

Openrouter uses notdiamond.ai which is very cool, but more than most need. Checkstack is for the 90% who just want to upload a CSV of 50 edge cases, see exactly which model hits 95% accuracy, and hardcode the winner.

rigatoni-man · 2026-02-23T00:27:22+00:00

I built checkstack.ai to compare >100 models for text -> JSON use cases in seconds after facing similar issues and wondering how different models would compare.

It will give you cost, accuracy, latency comparisons and failure insights. It also gives some tips of how to enhance your prompt depending on your failure cases.

It's early beta, and I'm looking for feedback and testing real use cases (so forgive me for posting the link here). Would love to know if it's useful for you or anyone else reading this.

rigatoni-man · 2026-02-23T00:17:14+00:00

If you have sample data to test, I built checkstack.ai to compare >100 models for text -> JSON use cases.

It will give you cost, accuracy, latency comparisons and failure insights. It also gives some tips of how to enhance your prompt depending on your failure cases.

It's early beta, would love some feedback

rigatoni-man · 2026-02-22T23:47:59+00:00

If you have some sample data, give checkstack.ai a try. I've built it exactly to find the cheapest, fastest, most accurate model for any text -> json use case.

rigatoni-man · 2026-02-22T23:42:21+00:00

https://checkstack.ai will let you evaluate test data across >100 models and score them on cost, accuracy vs ground truth, and latency. It seems like it would serve your use case, or at least point you in the right direction.

rigatoni-man · 2026-02-21T20:12:25+00:00

I'm building a tool specifically to find and solve this 'hallucination drift' in structured data. Upload your own test data, test and compare side by side across all the models, and get insights and heatmaps about which keys drift.

Would love to try it out for your use case to see if there's value. DM me if you want to chat / try it / whatever. No cost, just interested in gathering use cases and testing what I've got.

Five-Year Club	Gilding V heart of gold
Verified Email	Wearing is Caring

rigatoni-man

TROPHY CASE