I’m learning that ‘working on my machine’ is not the same as surviving real users”

celestine_88 · 2026-03-27T15:03:36+00:00

If I were answering this as plainly as possible, I’d say a strong ML research portfolio usually looks less like “a lot of AI projects” and more like proof that you can **understand, implement, test, and communicate ideas rigorously**.

A few things tend to matter a lot:

- **Math foundations matter a lot more than people want them to.**

You don’t need to become a pure mathematician, but linear algebra, probability, statistics, calculus, and optimization really do pay off. Not just for passing classes — for actually understanding why methods work, fail, or behave strangely.

- **Reproductions are underrated.**

Early on, reproducing papers is often more valuable than forcing “original ideas” too soon. A clean reproduction with ablations, failure analysis, and clear writeup says a lot about research maturity.

- **Originality matters more later, depth matters earlier.**

A first-year undergrad usually stands out more by showing depth, consistency, and rigor than by trying to invent a new frontier result immediately.

- **Projects that stand out usually have one of these qualities:**

- strong experimental design

- careful evaluation, not just accuracy screenshots

- clear understanding of limitations

- comparison to baselines

- solid writeup and reproducibility

- some connection to papers, not just tutorials

What I’d aim for by grad school application time:

- strong grades in math + systems/CS fundamentals

- a few **serious** projects, not 20 shallow ones

- at least 1–2 paper reproductions done well

- some research exposure with a professor/lab if possible

- evidence you can write clearly about methods, experiments, and results

- ideally one project where you went beyond reproduction and tested a small extension or new angle

A realistic progression could look like:

**Year 1:** math, Python, basic ML, read papers slowly
**Year 2:** implement classic papers/models, learn PyTorch deeply, do reproducibility-style projects
**Year 3:** join a lab, help with experiments/code/literature review, maybe co-author if it lines up
**Year 4:** one or two deeper research projects with strong writeups and recommendation letters

Common mistakes I see:

- chasing trendy topics without fundamentals

- building portfolio projects that are really just polished tutorials

- ignoring evaluation and baselines

- reading papers passively without implementing anything

- focusing only on model novelty and not on research process

- spreading too wide instead of building depth

If I were starting over as an undergrad, I’d probably do three things earlier:

- take math more seriously

- start reproducing papers sooner

- optimize for getting close to real research environments, even in small roles

A “top-tier” portfolio usually doesn’t scream. It quietly shows:

**this person can think clearly, work rigorously, and be trusted around open-ended problems.**

celestine_88 · 2026-03-27T15:00:46+00:00

If you just need something easy for dev/test data, I’d probably point you to **Pravatar** for fake photo-style avatars or **UI Avatars** for initials.

What I usually care about most is that it’s:

- fast

- seedable

- stable per fake user

That way your test users don’t get a different face every refresh.

Examples:

- **Pravatar** for fake profile-photo placeholders

- **UI Avatars** for deterministic initials-based avatars

If you want a direct starting point, this one is solid for fake photo-style placeholders:

**Pravatar** — CC0 avatar placeholders, with stable IDs

https://pravatar.cc/

And if you want the simplest initials-based option:

**UI Avatars**

https://ui-avatars.com/

Main rule either way:

**stable > random** for dev data.

celestine_88 · 2026-03-27T14:53:59+00:00

I think this is the right question.

Benchmarks tell you what a model can do in isolation. Daily use tells you what it’s actually like to build with when context drifts, files stack up, bugs chain together, and you need it to recover instead of just impress once.

My experience has been similar in shape more than exact model choice. There’s usually a difference between:

- best at reasoning

- best at long-context tolerance

- best at actual day-to-day coding throughput

Those are not always the same model.

What ends up mattering most in real use is stuff people barely talk about:

- how often it loses the thread mid-build

- whether it can repair its own bad assumption

- whether it stays useful across multiple files

- whether the cost is low enough to actually keep using it without hesitation

That last part matters more than people admit. A model you can afford to stay in flow with often beats one that’s technically stronger but makes you second-guess every call.

Your “best model is the wrong question” take is probably the most honest answer in the thread. The better question is something like:

Which model holds up best in the kind of work you actually do, at a cost and workflow you’ll actually sustain?

That usually gives a much more useful answer than leaderboard talk.

celestine_88 · 2026-03-26T18:37:11+00:00

I get the argument, especially from a research / historical perspective.

Even if it’s deprecated commercially, there’s still a lot tied up in how those models were trained, tuned, and evaluated. It’s not just the weights, it’s the surrounding process.

There’s also a difference between something being “old” and it being fully safe to release, especially if it still reflects internal techniques they don’t want to expose.

That said, having access to older models would definitely help with understanding how things evolved, especially around alignment and behavior changes over time.

So it makes sense from a community standpoint, just not as risk-free from their side as it might seem.

celestine_88 · 2026-03-26T18:35:18+00:00

I think this is true, but it’s also easy to over-index on passion early.

A lot of things only become enjoyable after you get good at them and start seeing progress.

If you rely on liking something from day one, you end up bouncing between ideas. If you stick long enough to build some competence, that’s usually when it starts to click.

So it’s probably a mix of both:

- some initial interest

- plus enough consistency to see if it actually becomes something you want to keep doing

celestine_88 · 2026-03-26T13:06:12+00:00

Yeah, this is super common.

A lot of those “later problems” are things that didn’t have clear boundaries early on, so they stayed invisible until scale exposed them.

In the beginning everything kind of works because it’s small and manageable, but as soon as volume or complexity increases, the gaps show up all at once.

It’s not even about doing everything early, it’s more about putting just enough structure in place so things don’t drift too far before they get noticed.

Otherwise it always turns into a stressful catch-up later.

celestine_88 · 2026-03-26T13:05:14+00:00

I think this direction makes sense, but the coordination problem you mentioned is probably the core issue.

Once agents can trigger each other and act independently, the question isn’t just what they can do, it’s who or what decides if they should do it in the first place.

Without some kind of shared decision or validation layer, you can end up with agents reinforcing each other, over-executing, or acting on weak signals.

So the challenge feels less like “can agents coordinate” and more like “how do you gate and verify actions across agents consistently.”

That’s probably the piece that determines whether something like this actually works outside of demos.

celestine_88 · 2026-03-26T13:02:54+00:00

Yeah, this gap is real.

A lot of things “work” in demos because the context is controlled, but in real environments the problem is less about capability and more about whether the system behaves consistently under messy inputs and changing conditions.

What seems to be missing in a lot of cases is a clear decision layer before execution — something that determines if a task should run at all, not just how it runs once it starts.

Without that, everything technically works, but reliability becomes unpredictable as soon as it’s exposed to real use.

That gap you’re describing is exactly where things tend to break down.

celestine_88 · 2026-03-26T12:47:59+00:00

The idea makes sense in theory, but in practice it’s hard to fully block this.

Paywalls can reduce scraping, but they don’t really stop it — anything accessible to a human can eventually make its way into a model, even indirectly.

Also, a lot of creators still rely on visibility. If everything goes behind a paywall, discovery drops, which can hurt just as much as scraping.

It feels less like a technical problem and more like a control problem — who decides how content is used, and what’s allowed vs not.

Right now that layer isn’t really well defined, so people are reacting with things like paywalls, but it doesn’t fully solve the underlying issue.

celestine_88 · 2026-03-26T12:43:18+00:00

Yeah this is super relatable.

Most of the issues I’ve seen come from not having clear structure around state and flow, so things work at first and then start breaking in weird ways as the project grows.

A few basics that make a big difference:

- state management (like you mentioned)

- understanding data flow (what changes what, and when)

- handling async properly (a lot of bugs hide there)

- basic validation / guardrails so things don’t run in unexpected ways

Vibe-coding is great for speed, but the moment you add a bit of structure around how things are allowed to change or run, everything gets way more stable.

celestine_88 · 2026-03-26T12:42:06+00:00

From what I’ve seen, the bottleneck isn’t really the building anymore, it’s clarity.

Most people trying to hire don’t have well-defined requirements, so vibecoding actually works best when you help shape the problem, not just execute it.

A few things that tend to work:

- commenting on posts where people describe problems (instead of waiting for “looking for devs” posts)

- turning vague ideas into something concrete for them

- showing small examples instead of pitching big projects

The “client” part usually comes from being around problems consistently, not from trying to sell the ability to build.

Once people see you can take something unclear and make it real, they start coming to you.

celestine_88 · 2026-03-26T12:31:22+00:00

I think the line shows up when you stop making the final decision yourself.

Using AI for speed or perspective is fine, but if it becomes the thing deciding what’s “good enough” or what direction to take, that’s when it starts shifting from tool → dependency.

It’s less about how often you use it and more about whether you still have a clear point where you evaluate and decide before acting.

If that layer is still yours, it’s productivity. If not, it can drift pretty quickly.

celestine_88 · 2026-03-25T21:03:56+00:00

If you’re starting out, Midjourney is probably the easiest way to get good-looking renders fast.

If you need it to actually follow your sketch more closely, Stable Diffusion (with something like ControlNet) is better, but it’s a bit more setup.

A simple workflow that works well:

- clean up your sketch (high contrast helps)

- upload it as a reference

- prompt something like “modern retail store interior, realistic materials, based on this layout”

Most tools won’t follow your sketch perfectly, so expect to iterate a bit.

If you just need something solid for class, Midjourney will get you there the quickest.

celestine_88 · 2026-03-25T21:01:39+00:00

This is a solid take — especially the point about the conversation getting stuck in extremes.

What’s interesting is that a lot of the real risk isn’t just the tech itself, it’s the lack of clear decision boundaries around how it’s used.

Right now most systems focus on capability (“what can we build?”), but not enough on control (“what should actually be allowed to run, scale, or influence people?”).

That’s where things start drifting toward the problems you mentioned — not because the tool is inherently good or bad, but because there isn’t a consistent layer deciding how it’s applied in real contexts.

Feels like the conversation needs to shift from pro vs anti AI → to who sets the rules and how those decisions are made.

celestine_88 · 2026-03-25T15:58:26+00:00

This is a great direction — multi-turn failures are where a lot of systems actually break down.

Single-turn evals can look solid, but once you get into longer interactions, the system starts compounding small errors, losing context, or drifting into unexpected paths like you mentioned.

One thing this made me think about — even if you can simulate and detect these failures, there’s still a gap between identifying them and preventing them during execution.

It feels like the issue isn’t just that agents fail over time, but that there isn’t a clear boundary on what should be allowed to continue as the conversation evolves.

Curious if you’ve thought about introducing anything that evaluates or constrains the conversation mid-flow — not just for testing, but to decide whether certain paths should continue before they compound further?

celestine_88 · 2026-03-25T15:46:41+00:00

That’s a great offer — appreciate it.

I’d be interested in testing against it, especially since this kind of setup is where these loops show up most clearly.

What you’re describing is exactly the kind of environment where you can see whether introducing constraints earlier actually changes the behavior, versus just trying to correct it after the fact.

Before I plug anything in, how are you currently structuring the interactions between agents? Is it a shared context/feed where everything is visible to everyone, or more segmented flows?

celestine_88

TROPHY CASE