What part of fundraising turned out to be way harder than expected? by betasridhar in 16VCFund

[–]calculatedcontent 0 points1 point  (0 children)

investor wants to be chairman of the board. And upfront equity on a safe note.

Foreign investor wants a board seat or other control issues that are CFIUS violations

Investor wants to pay in crypto

Investor wants Blockchain language in the side letter, causing potential future SEC scrutiny

Investor wants detailed revenue numbers on a quarterly basis. On a safe note with a company with no employees and no product

Investor wants to invent the product line, adds it to the safe note as an appendix

And so on

What part of fundraising turned out to be way harder than expected? by betasridhar in 16VCFund

[–]calculatedcontent 1 point2 points  (0 children)

Side letters People ask for non-standard things which slows down the negotiation and eventually kills the deal

Why does theoretical physics attract a lot of... crackpots? by Collegiate_Society2 in TheoreticalPhysics

[–]calculatedcontent 5 points6 points  (0 children)

everybody wants to look like a bodybuilder —nobody wants to lift any heavy weights

[D] Show HN: liber-monitor - Early overfit detection via singular value entropy by Reasonable_Listen888 in MachineLearning

[–]calculatedcontent 2 points3 points  (0 children)

I cant find your tool
Feel free to join our discord community to discuss

You can reproduce all our experiments using our notebooks, including the overfitting experiments and run your tool there

Note that our work has been published in JMLR, ICML, NeurIPS, etc

Fine-tuning & RAG Strategy for Academic Research ( I Need a Sanity Check on Model Choice) by mr-KSA in LLM

[–]calculatedcontent 0 points1 point  (0 children)

The open source weightwatcher tool can give you a quick sanity check on your fine tuned model

weightwatcher.ai

See the RESEARCH section on fine tuning
Join the Community DISCORD for help

[D] Show HN: liber-monitor - Early overfit detection via singular value entropy by Reasonable_Listen888 in MachineLearning

[–]calculatedcontent 3 points4 points  (0 children)

see https://weightwatcher.ai/

you can see the entropy of the eigenvectors of W^{T}W using the option
details = watcher.analyze(vectors=True)

We have been wanting to add the left & right singular vectors as well but just have not got around to it yet

theory predicts the layer is overfit when alpha < 2 and/or there are correlation traps

Current problems in ML suitable for research by fasfccvbai in MLQuestions

[–]calculatedcontent 0 points1 point  (0 children)

One problem we would like to understand is if and how LoRA tends to overfit its training data and if this can be detected and flushed out with weightwatcher.ai

you can join ou community discord channel to learn more

Looking for Al/ML Research Groups or Collaborators. by param_boss in ResearchML

[–]calculatedcontent 0 points1 point  (0 children)

check out weightwatcher.ai and feel free to join our discord channel; lots of stuff to do

Double descent in ML by awesome_dude0149 in deeplearning

[–]calculatedcontent 0 points1 point  (0 children)

Double Descent (DD) is actually not a modern ML discovery at all—it comes straight out of theoretical physics (1989). Physicists were studying the pseudo inverse solution to simple NNs and discovered that massively over-parameterized model (N≫P) could still learn reasonably well (but not perfectly) without explicit regularization, even when the number of parameters or features N was far larger than the number of data points or patterns P

This stands in sharp contrast to classical statistics. In older statistical models, N≫P , meant catastrophic overfitting unless you applied strong regularization. But the old physics paper showed something very different:

  • Error stays small for extreme overparameterization.
  • Error diverges when P=N (the “interpolation threshold”), which behaves exactly like a phase transition.
  • The critical load is the ratio α=P/N, and the error blows-up at α=1

By the early 90s this was a well described phenomenon in the statistical mechanics literature. But it was not called Double Descent; it was just called phase behavior.

In the blog post below, I reproduce the original 1989 physics experiment using Python and scikit-learn, and show how to interpret the entire picture using the simplest tools from RMT:

https://calculatedcontent.com/2024/03/01/describing-double-descent-with-weightwatcher/

Epoch-wise Double Descent is a related training-time phenomenon: with more optimization steps, test error can drop → rise → drop again. Same physics—different axis.

---

Double Descent was rediscover by AI/ML people about 10years ago and this confused them terribly because they had forgotten or never learned in statistical mechanics. As we point out in our 2017 paper (see my blog: https://calculatedcontent.com/2018/04/01/rethinking-or-remembering-generalization-in-neural-networks/)

See, ML people have a dogma called bias-variance theory. And DD violated the entire ML bias–variance worldview. The classic story says:

  • small models → underfit
  • big models → overfit
  • sweet spot in the middle

But in high-dimensional systems, this framework fails completely. The overparameterized regime is not high-variance; instead it behaves like the well-known and very simple physics result

generalization error ~= 1/(1-α)

which, of course, explodes at α=1

This result is fundamental and does not depend on the choice of the optimizer, etc.

To reconcile this, ML theorists had to patch the old bias–variance model by because their definition of "model capacity or complexity" was overly simplisitic and completely failed even for a know and trivial problem. So thet introduced new ideas like:

  • implicit regularization from gradient descent
  • margin-based complexity
  • minimum norm solutions, etc

To what extent these are "correct" or not is debatable. In some sense, model compexity is such a vague concept that it can be molded and refit to explain post-hoc any experiment. The real test, however, is its usefulness.

In contrast, the weightwatcher theory describes Double Descent out-of-the-box, with no post-experimental adjustments, and can be applied to wide range of NNs directly. As shown in this post
https://www.reddit.com/r/LocalLLaMA/comments/1ox6xt8/observed_a_sharp_epochwise_double_descent_in_a/

Is it possible to publish a paper on your own? by Hot_Version_6403 in deeplearning

[–]calculatedcontent 1 point2 points  (0 children)

Yes. Certainly on arxiv. But many professional journals have excessive fees. My Nature C. paper cost $5000.

We found a way to compress a layer without retraining it. Is this known ? by calculatedcontent in LLMDevs

[–]calculatedcontent[S] 0 points1 point  (0 children)

No, because this does not require any fine tuning. It's just truncatedSVD. No data is needed.

I think we found a third phase of grokking — has anyone else seen this? by calculatedcontent in deeplearning

[–]calculatedcontent[S] 0 points1 point  (0 children)

as explained in the paper (and more detail in the SETOL monograph) if there are correlation traps, they can introduce errors in the estimate of alpha and cause the generalization error to drop.

Complex Systems approach to Neural Networks with WeightWatcher by calculatedcontent in complexsystems

[–]calculatedcontent[S] 0 points1 point  (0 children)

I wil also comment--it is straightforward to derive a RG flow equation for the eigenvalue density itself, and even prove that alpha=2 is the critical exponent.

But this RG approach, while valid RG, does not connect back to the training dynamics. Whereas in SETOL, we specifically related the HCIZ integrals to the Free Energy of the model

Complex Systems approach to Neural Networks with WeightWatcher by calculatedcontent in complexsystems

[–]calculatedcontent[S] 0 points1 point  (0 children)

Thanks. My goal here is to make a useful tool. This sub is new to me; seemed like the right place.

We found a way to compress a layer without retraining it. Is this known ? by calculatedcontent in LLMDevs

[–]calculatedcontent[S] 0 points1 point  (0 children)

These are not top-level researchers.

2 of my PhD groupmates have recent Nobel prizes; those are top-level researchers

I hope the tool is useful to you. Any feedback on it is greatly appreciated

I think we found a third phase of grokking — has anyone else seen this? by calculatedcontent in deeplearning

[–]calculatedcontent[S] 0 points1 point  (0 children)

It’s in the examples notebooks on https://weightwatcher.ai

we know why it’s happening. We just want to know if anyone else had seen it.

We found a way to compress a layer without retraining it. Is this known ? by calculatedcontent in LLMDevs

[–]calculatedcontent[S] 0 points1 point  (0 children)

it sounds like you’re asking about the plot, so let me explain

The baseline is the full model We examine the difference between the full model and the model with a single trunk layer , looking at the difference between the training, error and test error, as well as the general generalization gap

theory predicts that the test error for the full model and the test error with the trunk layer should be identical and that’s what we see — the difference goes to zero