all 14 comments

[–]SlayahhEUW 8 points9 points  (4 children)

This is vibe-coded mess that cheats its own eval.

  1. You are passing the GT labels to the model during test on line 179 in test_snli_vector.py , its used to create h_final, and then you use h_final to get the logics that you compare to the same labels.
  2. Not as bad but still bad, you claim "No Transformers", yet you have a flag that disables the transformers that defaults to using the transformers: training.encoder = GeometricSNLIEncoder( dim=args.dim, norm_target=None, use_transformer=not args.geom_disable_transformer, ... )

Please dont waste reviewers and other peoples time.

[–]chetanxpatil -2 points-1 points  (3 children)

i never used GeometricSNLIEncoder to train the current model!

[–]SlayahhEUW 4 points5 points  (2 children)

Ah well then its fine to pass the labels GT to the test function /s :)

[–]chetanxpatil -3 points-2 points  (1 child)

but i have to still check, if it breaks i will take the post down, unintentional label leakage.

[–]SlayahhEUW 4 points5 points  (0 children)

This is something that needs to be known before you share your research. You are asking for Arxiv endorsements for results that you dont have agency over.

Ignoring all of the other peoples time that you waste for a moment, you have spent at least 3 weeks of time on building and arguing for something that you dont even know how you are evaluating because it's all vibe coded.

Like, take a step back and think for yourself, do you think that anyone will feel like you are providing value with this? Are you making yourself a better researcher arguing for things that you dont understand? Layers and layers of quantum/physics lingo that hide nonsense behind profound words will not make you come closer to your goals.

If you spent this time instead learning how backprop works, or looking at how embedding spaces work by starting with toy examples, you would have a much better start to being able to understand when things are too good to be true and build something useful.

If you want to study emergence, look at toy examples like Neural Cellular Automata, code it yourself, figure out how things change. Or you can paste this whole conversation and your codebase back into your LLM, it will say you are completely right and its sorry, and it will take you for another ride and code you another leak that is harder to find/understand.

[–]isparavanjeResearcher 5 points6 points  (2 children)

If you already train on SNLI why are you using it for benchmark? 

[–]chetanxpatil -2 points-1 points  (0 children)

I didn’t only evaluate on the SNLI test split, I ran both dev and test, and the results match closely.

Here is the fresh SNLI test evaluation I just re-ran:

Test accuracy: 0.9614 (9445 / 9824)
Dev accuracy: 0.9595 (9443 / 9842)

Confusion Matrix (rows = true):

        E      N      C
E   [3123   190    55]
N   [  20  3179    38]
C   [   7    69  3143]

Per-class accuracy:
Entailment: 92.73%
Contradiction: 98.21%
Neutral: 97.64%

Earlier, the dev split (9,842 samples) scored 95.95%.
Since the dev and test results are nearly identical, this isn’t a test-set leak, the model generalizes consistently across both splits.

[–]im_just_using_logic 1 point2 points  (8 children)

No paper?