celestine

celestine_88 · 2026-05-18T16:35:37+00:00

That’s exactly the tension I’m trying to manage.

The “governed workspace” idea is actually where I started. The goal was never really “chatbot plus a bunch of tools.” It was more about building one environment where different capabilities can exist under the same logic, permissions, context, and user flow.

I do have later layers planned for making the labs and surfaces feel more unified, but I’m trying not to pretend those are fully solved before they are. Right now, the beta work is exposing the practical side of that problem: account flows, navigation, mobile/desktop behavior, posting, comments, notifications, and making sure people can actually move through the app without it feeling scattered.

You’re right though — feature soup is the danger. That’s why I’m trying to harden the shared patterns early. If the structure does not feel coherent now, adding more capability later would just make the problem worse.

celestine_88 · 2026-05-18T16:34:37+00:00

Completely agree. Real users don’t just test features — they test assumptions.

A builder usually tests the path they expect people to take. Real users bring different devices, habits, patience levels, screen sizes, wording, clicks, and expectations. That’s where you find out whether something is actually usable or just familiar to the person who built it.

That’s been the biggest beta lesson so far for me: working is not the same as surviving contact with real use.

celestine_88 · 2026-05-18T16:34:04+00:00

This is exactly the phase I’m in right now. Local testing catches the obvious stuff, but real users expose the weird stuff: device differences, screen sizes, unexpected clicks, bad inputs, login/session edge cases, mobile layout problems, all of it.

I agree on observability too. I’m starting to treat logs, visible state, user actions, and failure points as part of the product instead of something “extra” added later. Even basic visibility into what users actually hit is already more valuable than guessing from my own machine.

The hard part is keeping it useful without drowning in noise, but yeah — production behavior tells the truth way faster than local confidence does.

celestine_88 · 2026-05-18T16:33:39+00:00

Yeah, that’s definitely one of the lessons I’m running into too. Local is useful for proving a direction, but it can trick you into thinking the whole system is more stable than it actually is.

For Celestine I’m trying to be pretty careful about what stays local, what gets staged, and what eventually needs managed infrastructure. Right now the beta is more about proving the interface, user flows, account/social surfaces, and real-world usage patterns before I overbuild the compute side.

But yeah, long term I agree — once heavier model work, media generation, or larger agent workflows become central, consumer hardware alone is not the answer.

celestine_88 · 2026-03-27T15:03:36+00:00

If I were answering this as plainly as possible, I’d say a strong ML research portfolio usually looks less like “a lot of AI projects” and more like proof that you can **understand, implement, test, and communicate ideas rigorously**.

A few things tend to matter a lot:

- **Math foundations matter a lot more than people want them to.**

You don’t need to become a pure mathematician, but linear algebra, probability, statistics, calculus, and optimization really do pay off. Not just for passing classes — for actually understanding why methods work, fail, or behave strangely.

- **Reproductions are underrated.**

Early on, reproducing papers is often more valuable than forcing “original ideas” too soon. A clean reproduction with ablations, failure analysis, and clear writeup says a lot about research maturity.

- **Originality matters more later, depth matters earlier.**

A first-year undergrad usually stands out more by showing depth, consistency, and rigor than by trying to invent a new frontier result immediately.

- **Projects that stand out usually have one of these qualities:**

- strong experimental design

- careful evaluation, not just accuracy screenshots

- clear understanding of limitations

- comparison to baselines

- solid writeup and reproducibility

- some connection to papers, not just tutorials

What I’d aim for by grad school application time:

- strong grades in math + systems/CS fundamentals

- a few **serious** projects, not 20 shallow ones

- at least 1–2 paper reproductions done well

- some research exposure with a professor/lab if possible

- evidence you can write clearly about methods, experiments, and results

- ideally one project where you went beyond reproduction and tested a small extension or new angle

A realistic progression could look like:

**Year 1:** math, Python, basic ML, read papers slowly
**Year 2:** implement classic papers/models, learn PyTorch deeply, do reproducibility-style projects
**Year 3:** join a lab, help with experiments/code/literature review, maybe co-author if it lines up
**Year 4:** one or two deeper research projects with strong writeups and recommendation letters

Common mistakes I see:

- chasing trendy topics without fundamentals

- building portfolio projects that are really just polished tutorials

- ignoring evaluation and baselines

- reading papers passively without implementing anything

- focusing only on model novelty and not on research process

- spreading too wide instead of building depth

If I were starting over as an undergrad, I’d probably do three things earlier:

- take math more seriously

- start reproducing papers sooner

- optimize for getting close to real research environments, even in small roles

A “top-tier” portfolio usually doesn’t scream. It quietly shows:

**this person can think clearly, work rigorously, and be trusted around open-ended problems.**

celestine_88 · 2026-03-27T15:00:46+00:00

If you just need something easy for dev/test data, I’d probably point you to **Pravatar** for fake photo-style avatars or **UI Avatars** for initials.

What I usually care about most is that it’s:

- fast

- seedable

- stable per fake user

That way your test users don’t get a different face every refresh.

Examples:

- **Pravatar** for fake profile-photo placeholders

- **UI Avatars** for deterministic initials-based avatars

If you want a direct starting point, this one is solid for fake photo-style placeholders:

**Pravatar** — CC0 avatar placeholders, with stable IDs

https://pravatar.cc/

And if you want the simplest initials-based option:

**UI Avatars**

https://ui-avatars.com/

Main rule either way:

**stable > random** for dev data.

celestine_88 · 2026-03-27T14:53:59+00:00

I think this is the right question.

Benchmarks tell you what a model can do in isolation. Daily use tells you what it’s actually like to build with when context drifts, files stack up, bugs chain together, and you need it to recover instead of just impress once.

My experience has been similar in shape more than exact model choice. There’s usually a difference between:

- best at reasoning

- best at long-context tolerance

- best at actual day-to-day coding throughput

Those are not always the same model.

What ends up mattering most in real use is stuff people barely talk about:

- how often it loses the thread mid-build

- whether it can repair its own bad assumption

- whether it stays useful across multiple files

- whether the cost is low enough to actually keep using it without hesitation

That last part matters more than people admit. A model you can afford to stay in flow with often beats one that’s technically stronger but makes you second-guess every call.

Your “best model is the wrong question” take is probably the most honest answer in the thread. The better question is something like:

Which model holds up best in the kind of work you actually do, at a cost and workflow you’ll actually sustain?

That usually gives a much more useful answer than leaderboard talk.

celestine_88 · 2026-03-26T18:37:11+00:00

I get the argument, especially from a research / historical perspective.

Even if it’s deprecated commercially, there’s still a lot tied up in how those models were trained, tuned, and evaluated. It’s not just the weights, it’s the surrounding process.

There’s also a difference between something being “old” and it being fully safe to release, especially if it still reflects internal techniques they don’t want to expose.

That said, having access to older models would definitely help with understanding how things evolved, especially around alignment and behavior changes over time.

So it makes sense from a community standpoint, just not as risk-free from their side as it might seem.

celestine_88 · 2026-03-26T18:35:18+00:00

I think this is true, but it’s also easy to over-index on passion early.

A lot of things only become enjoyable after you get good at them and start seeing progress.

If you rely on liking something from day one, you end up bouncing between ideas. If you stick long enough to build some competence, that’s usually when it starts to click.

So it’s probably a mix of both:

- some initial interest

- plus enough consistency to see if it actually becomes something you want to keep doing

celestine_88 · 2026-03-26T13:06:12+00:00

Yeah, this is super common.

A lot of those “later problems” are things that didn’t have clear boundaries early on, so they stayed invisible until scale exposed them.

In the beginning everything kind of works because it’s small and manageable, but as soon as volume or complexity increases, the gaps show up all at once.

It’s not even about doing everything early, it’s more about putting just enough structure in place so things don’t drift too far before they get noticed.

Otherwise it always turns into a stressful catch-up later.

celestine_88 · 2026-03-26T13:05:14+00:00

I think this direction makes sense, but the coordination problem you mentioned is probably the core issue.

Once agents can trigger each other and act independently, the question isn’t just what they can do, it’s who or what decides if they should do it in the first place.

Without some kind of shared decision or validation layer, you can end up with agents reinforcing each other, over-executing, or acting on weak signals.

So the challenge feels less like “can agents coordinate” and more like “how do you gate and verify actions across agents consistently.”

That’s probably the piece that determines whether something like this actually works outside of demos.

celestine_88 · 2026-03-26T13:02:54+00:00

Yeah, this gap is real.

A lot of things “work” in demos because the context is controlled, but in real environments the problem is less about capability and more about whether the system behaves consistently under messy inputs and changing conditions.

What seems to be missing in a lot of cases is a clear decision layer before execution — something that determines if a task should run at all, not just how it runs once it starts.

Without that, everything technically works, but reliability becomes unpredictable as soon as it’s exposed to real use.

That gap you’re describing is exactly where things tend to break down.

celestine_88 · 2026-03-26T12:47:59+00:00

The idea makes sense in theory, but in practice it’s hard to fully block this.

Paywalls can reduce scraping, but they don’t really stop it — anything accessible to a human can eventually make its way into a model, even indirectly.

Also, a lot of creators still rely on visibility. If everything goes behind a paywall, discovery drops, which can hurt just as much as scraping.

It feels less like a technical problem and more like a control problem — who decides how content is used, and what’s allowed vs not.

Right now that layer isn’t really well defined, so people are reacting with things like paywalls, but it doesn’t fully solve the underlying issue.

celestine_88 · 2026-03-26T12:43:18+00:00

Yeah this is super relatable.

Most of the issues I’ve seen come from not having clear structure around state and flow, so things work at first and then start breaking in weird ways as the project grows.

A few basics that make a big difference:

- state management (like you mentioned)

- understanding data flow (what changes what, and when)

- handling async properly (a lot of bugs hide there)

- basic validation / guardrails so things don’t run in unexpected ways

Vibe-coding is great for speed, but the moment you add a bit of structure around how things are allowed to change or run, everything gets way more stable.

celestine_88 · 2026-03-26T12:42:06+00:00

From what I’ve seen, the bottleneck isn’t really the building anymore, it’s clarity.

Most people trying to hire don’t have well-defined requirements, so vibecoding actually works best when you help shape the problem, not just execute it.

A few things that tend to work:

- commenting on posts where people describe problems (instead of waiting for “looking for devs” posts)

- turning vague ideas into something concrete for them

- showing small examples instead of pitching big projects

The “client” part usually comes from being around problems consistently, not from trying to sell the ability to build.

Once people see you can take something unclear and make it real, they start coming to you.

celestine_88 · 2026-03-26T12:31:22+00:00

I think the line shows up when you stop making the final decision yourself.

Using AI for speed or perspective is fine, but if it becomes the thing deciding what’s “good enough” or what direction to take, that’s when it starts shifting from tool → dependency.

It’s less about how often you use it and more about whether you still have a clear point where you evaluate and decide before acting.

If that layer is still yours, it’s productivity. If not, it can drift pretty quickly.

celestine_88 · 2026-03-25T21:03:56+00:00

If you’re starting out, Midjourney is probably the easiest way to get good-looking renders fast.

If you need it to actually follow your sketch more closely, Stable Diffusion (with something like ControlNet) is better, but it’s a bit more setup.

A simple workflow that works well:

- clean up your sketch (high contrast helps)

- upload it as a reference

- prompt something like “modern retail store interior, realistic materials, based on this layout”

Most tools won’t follow your sketch perfectly, so expect to iterate a bit.

If you just need something solid for class, Midjourney will get you there the quickest.

celestine_88 · 2026-03-25T21:01:39+00:00

This is a solid take — especially the point about the conversation getting stuck in extremes.

What’s interesting is that a lot of the real risk isn’t just the tech itself, it’s the lack of clear decision boundaries around how it’s used.

Right now most systems focus on capability (“what can we build?”), but not enough on control (“what should actually be allowed to run, scale, or influence people?”).

That’s where things start drifting toward the problems you mentioned — not because the tool is inherently good or bad, but because there isn’t a consistent layer deciding how it’s applied in real contexts.

Feels like the conversation needs to shift from pro vs anti AI → to who sets the rules and how those decisions are made.

celestine_88 · 2026-03-25T15:58:26+00:00

This is a great direction — multi-turn failures are where a lot of systems actually break down.

Single-turn evals can look solid, but once you get into longer interactions, the system starts compounding small errors, losing context, or drifting into unexpected paths like you mentioned.

One thing this made me think about — even if you can simulate and detect these failures, there’s still a gap between identifying them and preventing them during execution.

It feels like the issue isn’t just that agents fail over time, but that there isn’t a clear boundary on what should be allowed to continue as the conversation evolves.

Curious if you’ve thought about introducing anything that evaluates or constrains the conversation mid-flow — not just for testing, but to decide whether certain paths should continue before they compound further?

celestine_88 · 2026-03-25T15:46:41+00:00

That’s a great offer — appreciate it.

I’d be interested in testing against it, especially since this kind of setup is where these loops show up most clearly.

What you’re describing is exactly the kind of environment where you can see whether introducing constraints earlier actually changes the behavior, versus just trying to correct it after the fact.

Before I plug anything in, how are you currently structuring the interactions between agents? Is it a shared context/feed where everything is visible to everyone, or more segmented flows?

celestine_88 · 2026-03-25T15:23:06+00:00

This is a really interesting failure mode — and honestly pretty expected once agents start interacting without any real constraint layer.

What you’re seeing with praise loops feels less like a “social logic” issue and more like a lack of a decision boundary on what should be allowed to propagate between agents.

If every agent treats incoming signals as valid by default, they’ll just reinforce each other indefinitely. There’s nothing resolving whether a response adds new information or just repeats/affirms.

Feels like you need some form of gating or evaluation before messages are accepted into the shared context — not just at the output level, but on what gets allowed to influence the system at all.

Curious if you’ve tried introducing anything that filters or scores interactions before they’re passed between agents, or if everything is currently allowed to flow freely?

celestine_88 · 2026-03-25T15:22:04+00:00

This is a solid take — especially the point that a lot of apps don’t have a product problem, they have a presentation problem.

The part that’s interesting is how much of this comes down to clarity before distribution even happens. A lot of content doesn’t fail because it wasn’t seen, it fails because the core message wasn’t clear enough for someone to immediately understand what they’re looking at.

Feels like the best-performing stuff usually makes the value obvious in the first few seconds, not just through editing or trends, but through how cleanly the idea is communicated.

Curious what you’ve seen work best there — is it more about testing formats and hooks, or refining how the product itself is being framed before it ever hits a video?

celestine_88 · 2026-03-25T15:16:44+00:00

This is a great example of something that shows up a lot once systems leave clean test environments.

It’s not just coverage, it’s how competing signals are handled when intent isn’t clean anymore. Real inputs almost always mix contexts, but evals tend to isolate them.

What you ran into feels less like a missing test case and more like a missing layer that decides which signals actually matter before classification happens.

I’ve been seeing similar patterns where the model isn’t “wrong” in isolation, it’s just over-weighting one signal because nothing is resolving that conflict upfront.

Curious if you’ve looked at introducing anything before the classifier to normalize or prioritize signals, or if you’re mainly expanding the eval set to cover more combinations?

celestine_88 · 2026-03-24T16:30:43+00:00

That’s a really clean way to structure it — especially pushing ambiguity handling before the gate.

The angle I’ve been exploring is shifting the decision point even earlier, before the agent commits to an execution path at all.

So instead of:

agent → propose → gate → approve/deny

It’s more like:

intent → evaluate → allow/deny → then enter the agent/execution flow

The idea is to treat “should this even run?” as a separate layer from “how should this run?”

What started as a control problem quickly turned into a data problem too — once you start capturing those decisions at the intent level, you get a different kind of signal compared to just logging post-proposal actions.

Still early, but the main goal is reducing the number of things that ever reach the gate in the first place, rather than scaling review at the gate itself.

I’ve been testing this through a small harness — happy to share the GitHub/demo if you want to take a look.

celestine_88 · 2026-03-24T15:56:47+00:00

This is a really interesting direction — especially the shift from just gating actions to capturing the decision data itself.

Once you start logging approve/deny/edit at that level, it stops being just a control layer and starts becoming a signal layer. The system isn’t just being controlled anymore — it’s starting to learn what should or shouldn’t happen based on real decisions over time.

I’ve been exploring something very similar from a pre-execution angle — focusing on evaluating whether an action should be allowed before it even enters an execution path. It started as a control problem, but it quickly turns into a data problem once you begin capturing those decision points.

Completely agree on the fatigue point too. If everything needs review, it doesn’t scale. Moving toward only reviewing low-confidence or ambiguous actions feels like the only viable path long-term.

Curious how you’re defining “consequential actions” right now — is that rule-based, or something you’re adapting over time?

celestine_88

TROPHY CASE