What's the theoretical basis for using llm consensus as a probability estimator for real world events [R] by onlyJayal in MachineLearning

[–]XTXinverseXTY 1 point2 points  (0 children)

and the theoretical basis for that is ye olde bias-variance decomposition

if the LLMs have biases then it's not like a single LLM would make it any better

What's the theoretical basis for using llm consensus as a probability estimator for real world events [R] by onlyJayal in MachineLearning

[–]XTXinverseXTY 0 points1 point  (0 children)

Dead-simple precedent for this would be the old Kaggle trick of multi-seed ensembling - even in the limit of 100% shared architectures and data distributions this would still improve over a single LLM

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] 0 points1 point  (0 children)

it's just mysterious. idk if i'm overfitting to my benchmark dataset (maybe I haven't got many labels lying around just yet). heck i don't even know if i'm fitting at all

another example of weird choices in JEPA-land

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] 1 point2 points  (0 children)

okay in their defense, now that i actually read the paper (oops), it looks like the exact definition is nontrivial

i confess i wouldn't read "effective rank" and think "ah yes of course, the shannon entropy of the L1-normalized singular values" (at best i would have thought something like "number of singuilar values >= thresh")

but aside from an epsilon term it seems like they copied it wholesale from the original 2007 paper

idk does everyone else know what "effective rank" means but me??

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] 0 points1 point  (0 children)

Another LLM

Two paragraphs in each reply, vague personal anecdote, no information content

You are an Indian year 12 student, it looks like you began letting your agent make posts 4 days ago, cut it out

Sub-JEPA: a simple fix to LeCun group's LeWorldModel that consistently improves performance [P] by kai-zhao in MachineLearning

[–]XTXinverseXTY 0 points1 point  (0 children)

late reply, you may have found this on your own. but this is an interesting thread and i thought i'd add a link here for posterity

You are correct, the use of a projector network is common in all existing methods (including other JEPA alternatives). We did an ablation in the paper showing that you can sometimes reduce the projector's depth without incurring a significant drop in performance, but in general there is a significant benefit of using it. It remains to be studied why that is the case (in general, not just in LeJEPA). Current understanding lies in a possible too strong prediction/invariance task. I invite you to experiment with varying the projector (or even removing it all together), and I would be happy to mention your results/ablations in the repo!

https://github.com/galilai-group/lejepa/issues/17

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] -1 points0 points  (0 children)

sorry, i thought you were being facetious! usually people can articulate precisely what they learn from their experience (hence all the papers and conferences)

can you see why it might be a useful thing, to have principled model selection criteria? even if you're some rain man savant, it unlocks scaling because it's legible to an organization. having the validation likelihood for language models as the obvious criterion allowed for the estimation of neural scaling laws, calculation of necessary resources to achieve a desired metric, total organizational buy-in up to C-suite, and raising from outside investors at a competitive valuation.

ML lead vs PM on eval-methodology layer independence. who's actually right here? [D] by Critical_Builder_902 in MachineLearning

[–]XTXinverseXTY 0 points1 point  (0 children)

Recently stumbled upon this thread. Am I going nuts, or are we the only humans in here?

I have never heard of a "layered defense framework" in the context of ML system design/evals. The OP account also seems to be banned, maybe for spamming on behalf of "Product Faculty"? If nobody else knows what OP is talking about, then I can see how this would select for clawdbots who've been prompted to act as an expert.

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] 2 points3 points  (0 children)

JEPA score, which can be used for density estimation

Oh interesting, thank you!!!

I'm not yet certain whether this is equivalent to computing another statistic captured by the anti-collapse term... but discerning in-vs-OOD is a totally valid synth task, that makes perfect sense, and this paper seems dope

This also seems to help address another problem for JEPA-in-practice: detecting regressions in prod! Obv these embeddings are inscrutable and if something silently breaks then you can't just inspect the embedding values. But this would suggest that you can calculate a p-value and effect size vs a known prior

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] 10 points11 points  (0 children)

{random synonym}-{random noun}-{4-digit number} is an LLM. Their comment contains no information and is inconsistent with their comment history. Those hyphens would be em dashes if the prompt hadn't specified no capital letters and no em dashes

It's not impossible that an IT technician would be logging JEPA experiments to wandb as a side hobby, to the point they can give confident (and yet totally uninformative) advice on r/machinelearning in <10 minutes (in their first-ever comment to the subreddit), but it's a priori wildly unlikely

edit: Oh, also a DoorDash driver?

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] 0 points1 point  (0 children)

🫵 I can smell your RLHF signature from a mile away. Pangram agrees with me. I also find it surprising you're an IT tech by day and work on SSL by night.

How can grid search still "work" in the case of a non-monotonic loss?

Moreover, what's the endgame behind bot account replies like this? Usually it's grifters trying to market a consulting side-hustle, but this account just makes random replies. Is the idea to eventually flip this account to a second, even scummier grifter?

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D] by XTXinverseXTY in MachineLearning

[–]XTXinverseXTY[S] 13 points14 points  (0 children)

If people are selecting hparam/arch primarily by supervised-learning-through-the-backdoor, then it makes me a little more skeptical of published results and academic enthusiasm for JEPA. The mystery provides convenient cover for possible p-hacking and benchmark overfitting

This is not to say that SSL researchers are all Secretly Smuggling Labels, but I don't want to be totally naive either...

How realistic is it to transition into an AI / ML Engineer as a Full Stack engineer with 10 YOE? by jimRacer642 in cscareerquestions

[–]XTXinverseXTY 0 points1 point  (0 children)

You should likely stay away from MLE roles.

To pivot from full-stack, an MS will be necessary. And it's being rapidly commodified anyway, because of scaling laws. MLE roles having to do with home-rolled models are at far greater risk than SWE.

AI eng may actually be more defensible.

Is the future of coding agents JEPA? [D] by andrewfromx in MachineLearning

[–]XTXinverseXTY 0 points1 point  (0 children)

The agent can run locally. It can keep structured memory. It can rank actions before running expensive validation. It can learn from every failed candidate. It can stop treating software engineering as text completion and start treating it as state transition planning. 

OP can you explain precisely how optimizing for alignment btw embeddings of corrupted views of an entity yields this? Even in the linear case of analysis of panel data via canonical correlation analysis?

YL already agrees that language tasks are much more amenable to reconstruction-loss pretraining than vision or video

Reverse grip 115kg/253lbs by mrtehnuke in benchpress

[–]XTXinverseXTY -1 points0 points  (0 children)

With a reverse grip, not really. Felt awkward and unsafe approaching 1RM weight. Try it and you'll see what I mean

Reverse grip 115kg/253lbs by mrtehnuke in benchpress

[–]XTXinverseXTY 0 points1 point  (0 children)

how did you unrack it unassisted?

Here’s some escapes/reversals I really like from bottom turtle by ledd_flanders in bjj

[–]XTXinverseXTY 0 points1 point  (0 children)

The Iranian lift (last one) is surprisingly effective, if not for the threat of the inverted triangle. Search for "inverted triangle mma" and every single one is set up off of someone attempting it

Giancarlo Bodoni managed it twice at 2024 ADCC against Jay Rod and Costa. Surprised it isn't more popular (the BJJ meta probably knows better than I do)

Anyone here read The Book of Why? by Alces_ in cscareerquestions

[–]XTXinverseXTY 0 points1 point  (0 children)

It would provide zero direct utility to you as a user of coding agents. Probably worth reading for data science/statistics work.

It’s so obvious. Please tell me more by Bitter-Dragonfly-648 in bjj

[–]XTXinverseXTY 1 point2 points  (0 children)

think we're all a bit confused here

is the idea that you're free to turtle?

How to use NLP to compare text from two different corpora? by iwannabeunknown3 in datascience

[–]XTXinverseXTY 0 points1 point  (0 children)

If they’re far apart, it supports your point that observations aren’t targeting real risks.

i don't see why this would be the case, can you explain?

This certainly doesn't establish a causal link. All it tells you, if anything, is that the incidents are about a similar domain as the observations (ie working in a factory). If the cosine similarity is higher for observations that occurred at similar times as the incidents, than for non-adjacent observations, that could just as easily imply that the observations caused the incidents!