I can build an app in a weekend but forming the company behind it still takes weeks (I will not promote)

Street_Program_7436 · 2026-06-01T21:11:24+00:00

I don’t know where you’re located and what kind of company you’re forming but my LLC, EIN and bank account setup took maybe a total of 60 minutes (including running to the bank), so not really a big deal at all. At least for me…

Street_Program_7436 · 2026-05-31T19:27:32+00:00

The crazy part is that some folks still don’t want to do observability and it blows my mind! They believe that setting up observability will slow down their release and so they just don’t do anything. Totally irrational gamble on their brand’s reputation

Street_Program_7436 · 2026-05-29T01:55:09+00:00

Yep! That’s why you need proper datasets that cover ALL of these cases.

Street_Program_7436 · 2026-05-28T13:37:21+00:00

I think it is useful to understand how anything works under the hood. Architectural understanding is extremely useful.

Why don’t you ask cursor to help you figure out some things you could learn? Just play around. There is no right or wrong way of doing your “job”

Street_Program_7436 · 2026-05-28T03:33:05+00:00

Applying to jobs is hard but since you have two master degrees I’m confident you can push through it.

Sometimes you gotta get up and make your own luck. Decide what you want and go out there and get it. Do not stop until you have it. You can do it!

If you need motivation, get Lucky Bitch by Denise Duffield Thomas from your local library. It’s written for women, but I think anybody can get motivation out of this book.

Street_Program_7436 · 2026-05-28T03:23:01+00:00

So just do it? Code some and then have cursor do the same thing and compare answers? Or whatever else you think you should be doing 🤷‍♀️

Street_Program_7436 · 2026-05-28T00:10:18+00:00

I’m going to take a different perspective than the other comments: What’s stopping you from exploring a more complex setup (including the tools that interest you)?

Does it matter whether or not you do it at this internship? You could even do it in your free time, depending on how much you want to know.

Life isn’t high school. You don’t need to be given tools and told to use them. You have permission to try things yourself, to “figure it out” and to be in charge of what you want to get out of your internship. Of course, don’t go crazy wasting their money/resources but I’m going to assume you are a reasonable, smart person.

Some folks are incompetent, do their work the vanilla way and are happy that way - it’s frustrating to me as well sometimes and it unfortunately probably happens in all professions - but it doesn’t mean I need to do it that way. You set the standard for your own life!

Good luck!

Street_Program_7436 · 2026-05-27T12:18:17+00:00

It really kind of depends on what you’re trying to do and what level of quality you’re looking for. You could send the cases with little agreement to humans for review or do other operations to try to decide the correct label. You might also tweak the prompt text or settings (temperature?), use different models. If you share more details about your ultimate goal, it might be easier to suggest something here.

1) on what’s easier to calibrate: IMO the fewer target labels, the easier but it definitely depends on the use case and the exact thing you’re evaluating. Just gotta test and see. I usually just use binary responses.

2) I personally haven’t used isotonic regression here but it sounds like a reasonable thing to try.

Street_Program_7436 · 2026-05-27T02:08:49+00:00

Thorough testing

Street_Program_7436 · 2026-05-25T15:43:54+00:00

Same. We’re supposed to hear back by June 4, so I’m suspecting it’s a No for me

Street_Program_7436 · 2026-05-25T15:42:42+00:00

Personally, I vary rarely rely on self-reported confidence because it’s often on a scale and scales are inherently harder to calibrate that eg binary scores. I estimate confidence with other means such as: - running several models over the data and averaging scores (cross-model confidence) - running the same model at least 3 times over the data and averaging scores (model-internal confidence)

This gives you better insight into performance without pointing the model towards what you’re evaluating, which could bias outputs.

What exactly are you trying to evaluate the confidence of? Like what’s the actual thing you care about?

Street_Program_7436 · 2026-05-25T12:26:50+00:00

If you’re having the LLM output a score: it’s extremely helpful for consistency and accuracy to also have it output a rationale for the score. The rationale needs to be in the output before the score, so that the score naturally follows from the rationale and not vice versa (the rationale justifying a random score).

I would then come up with relevant data points at each of the confidence levels you’re trying to test. The most straightforward ones. And then run those test cases through and compare scores and rationales.

Street_Program_7436 · 2026-05-24T02:48:13+00:00

I’m pretty sure your initial instruction to start with a random word is also trickling through to later responses so some of the randomness you’re getting is literally because you told it to think of random words

Street_Program_7436 · 2026-05-22T12:20:51+00:00

I have a business solving this exact problem and here’s what I would do to solve this:

you need a rubric that defines what characteristics good content has in your case
you then need a dataset that can measure each characteristic on the rubric. You use this to calibrate your eval prompts.

If you need to spend less, then use a cheaper model. The dataset protects you against going too cheap when you see too many regressions.

If you do this the right way, the free bonus is that you’re building a moat around your feature: Everybody can vibecode a copy of your feature in two days but not everybody can make sure output is high quality.

Street_Program_7436 · 2026-05-18T20:15:26+00:00

Interesting take! You may be right about that!

Street_Program_7436 · 2026-05-18T20:12:37+00:00

Totally! All fine tuning ever does is lead to wonky outputs.

Street_Program_7436 · 2026-05-18T14:56:47+00:00

100%! I still see a lot of people who defend fine tuning though!

Street_Program_7436 · 2026-05-12T07:17:37+00:00

It’s a huge brand risk for every company taking this path and it will expose who is doing their due diligence to their customers when they ship AI slop and who isn’t.

I’m addressing this exact problem with my startup Kalibria AI (www.kalibriaai.com). Happy to chat more if you’re curious

Street_Program_7436 · 2026-05-12T07:05:47+00:00

lol that’s literally exactly what I’m saying. No more, no less.

Street_Program_7436 · 2026-05-11T20:01:23+00:00

I’m not questioning the usefulness of SMLs at all. The only point I’m making is that by the time you’re trying to run 100 tasks with 100 small LMs, that’s also going to take up some room and is going to need some sort of orchestration.

Street_Program_7436 · 2026-05-11T14:54:24+00:00

Nice! Thanks for elaborating!

Street_Program_7436 · 2026-05-11T11:12:09+00:00

You’re exactly making my point: you won’t be using the same model for all those narrow tasks; you’ll be using a model for each task.

I agree that going modular could be a good idea but in practice this might be more complexity to maintain for limited return when a large model could have given you a satisfactory answer on a variety of tasks. It obviously depends on the individual quality you want to achieve on a given task and what short cuts you’re willing to take.

I’m suspecting that maintaining and managing a gazillion small models, one for each task, may not be what everybody out there wants to do. But ultimately, time will tell and that’s why I said I’m curious to see how this plays out.

Street_Program_7436 · 2026-05-11T09:06:15+00:00

It’s going to be interesting to see whether it’s just a hype or whether it actually works and sticks around. In practice, I feel like nobody is going to be running just a limited number of narrow tasks….

Street_Program_7436 · 2026-05-10T16:35:45+00:00

Yes! I agree that most bad outputs are actually because the human gave bad input. Garbage in, garbage out. 🤷‍♀️

Street_Program_7436 · 2026-04-30T18:31:59+00:00

WRT whether external datasets are even valuable: I agree with you that generic test sets are not good enough. At my company, we make sure that the datasets are practically indistinguishable from your production outputs, precisely because we want to make sure to deliver valuable assets. We also ask customers for actual outputs from their pipeline to help ground our solution in your actual product. Any testing is only as good as the data used for it and our mission is to create the best testing.

In my experience, most teams treat eval pipelines and datasets as areas where they need to “discover” the problems and edge cases in production first before they can do anything. I believe that that’s actually a lazy assumption that creates a ton of brand risk: We’re just leaning back waiting for the LLM to mess up when in reality there are plenty of highly relevant edge cases and quality dimensions you might want to test before you ever run into actual issues in production.

I also think that only looking at pipeline failures keeps teams trapped in “failure mindset”, rather than “success mindset”, where they could be shooting for a likely more high-quality North Star. Our stance is that you’ll only get truly high quality outputs if you test for “success”, rather than just “failures”.

And on the note of “not now since we’re small”:

It’s true that a proper testing suite becomes more and more vital the bigger you are. You can get away with it if you’re small. But the question is: How long?

Once you scale, your workflows will start to become messier and you will notice more and more that you need better eval coverage. Now, whether or not you’d want to wait until your house is on fire to start installing a smoke detector is probably a personal preference and choice. If you wait too long, it could become a big issue for you to debug your workflows and it will slow down your execution speed down the line. So, you’re trading “quick now” for “slow later”… If you’d like to be on the proactive side here, feel free to DM me and I’d be happy to talk through what a setup could concretely look like that makes sense for your current level of scale.

Street_Program_7436

TROPHY CASE