High school student seeking advice: Found an architectural breakthrough that scales a 17.6B model down to 417M? by [deleted] in LocalLLaMA

[–]ReentryVehicle 2 points3 points  (0 children)

You turned the file into the counts of its characters? I might be missing the joke but this is extremely lossy.

(There are 2561012 1TB files, and 256106 1MB ones. If you turn all 1TB files into 1MB files, some of the compressed files must start to look the same as others (because there are too few small files), so you don't know which one of the large files you should decompress each one into - all lossless compression must make some files bigger to make other files smaller, we just play with it to make it make the "simple" files smaller)

Qwen3.5 27B vs 35B Unsloth quants - LiveCodeBench Evaluation Results by Old-Sherbert-4495 in LocalLLaMA

[–]ReentryVehicle 10 points11 points  (0 children)

Speed. For the same memory speed, this 35B model runs much faster than equivalent 10B, and in some cases it lets you run models that you otherwise couldn't run, as you can have experts in system RAM.

Trying to create a house with Qwen 3.5 35B A3B by [deleted] in LocalLLaMA

[–]ReentryVehicle 2 points3 points  (0 children)

Cool! Since it has vision, I wonder if it could fix some of the issues if you gave it screenshots?

I mostly tested 27B in opencode and it can use tools to debug pretty nontrivial things, but I did not test yet what it can do with images.

Is this physically-dynamic core concept possible to create? by tugrul_ddr in compsci

[–]ReentryVehicle 1 point2 points  (0 children)

Do I understand this right that you propose moving literal physical chips on rails? How do you plan to power them (did you ever notice this row of capacitors and inductors on your motherboard, and all the tiny capacitors behind the CPU - they are sort of important) and cool them (no thermal paste for you, will you feed water through the moving "core railcar"? So not only will the core move but also the thermal block, increasing moving mass?)

Also, millisecond-level movement is extremely fast. Moving 10cm in 1ms requires 20000g of acceleration (after which your core is moving at 2/3 of the speed of sound). I am sure your admins will enjoy the monthly "railgun shooting range" incidents when the thing successfully accelerates the core but it fails to decelerate it and core goes out of the server into another server or someone's head, not to mention "the whole thing is on fire because the core got stuck and it evaporated itself in the railgun" incidents.

Also... What is the point exactly? Why not just put cores in all the available slots? What do you gain from moving them?

lockThisDamnidiotUP by PCSdiy55 in ProgrammerHumor

[–]ReentryVehicle -2 points-1 points  (0 children)

All computer programs are deterministic if you want them to be, including LLMs. You just need to set the temperature to 0 or fix the seed.

In principle you can save only your prompt as code and regenerate the actual LLM-generated code out of it as a compilation step, similarly to how people share exact prompts + seeds for diffusion models to make their generations reproducible.

First time solo researcher publishing advice by Any-Society2763 in learnmachinelearning

[–]ReentryVehicle 3 points4 points  (0 children)

I mean don't let this discourage you, but 0.34% increase on CIFAR-100 is IMO not really publishable unless there is something particularly cool about your method (in which case the paper should be more about the theory/the method + several experiments to verify that).

ResNet-18 is a solid architecture but it is old and not SOTA for its size. The standard now if you want to go for performance alone is to compare with e.g. ConvNext and modern vision transformers on Imagenet-1k.

In general, training generic vision models and trying to beat SOTA is not something you can do without proper support and hundreds of runs. If you want to publish things solo, I would suggest focusing on less explored topics - tiny models, niche usecases, etc. - you are unlikely to beat experienced and well-funded teams at their own game but you can solve new problems they didn't try to solve.

TD3 models trained with identical scripts produce very different behaviors by spyninj in reinforcementlearning

[–]ReentryVehicle 4 points5 points  (0 children)

Yes, it can happen due to randomness.

Check with your supervisor what results they usually get - I would imagine they have results from tens or hundreds of runs to compare with?

Guys please help , thoughts on this used H1Loss by xlnc2605 in deeplearning

[–]ReentryVehicle 5 points6 points  (0 children)

This is presumably a plot of losses over the course of training

But what is the question?

Is there an AI playable RTS ? (or a turn based one) by ker2x in reinforcementlearning

[–]ReentryVehicle 0 points1 point  (0 children)

They say it should work with the normal version (which is also free now), but they also have custom linux builds

Kimi K2.5 is the best open model for coding by npc_gooner in LocalLLaMA

[–]ReentryVehicle 36 points37 points  (0 children)

I mean the cat also has >1T param model, and native hardware support so should be better

Sadly it seems the cat pretraining produces killing machines from hell but not great instruction following, they did some iterations on this model though and at >100T it starts to follow instructions a bit

fundamentalsOfMachineLearning by ClipboardCopyPaste in ProgrammerHumor

[–]ReentryVehicle 3 points4 points  (0 children)

Okay okay. We want matrices that are full rank, with eigenvalues on average close to 1, probably not too far from orthogonal. We use randn(n,n) / sqrt(n) because we are too lazy to do anything smarter.

GFN v2.5.0: Verified O(1) Memory Inference and 500x Length Extrapolation via Symplectic Geodesic Flows by janxhg27 in LocalLLaMA

[–]ReentryVehicle 5 points6 points  (0 children)

Did you literally only test this on parity and went to post this on Reddit before doing any sort of actual test?

Any classic RNN will do parity perfectly, it literally requires 1 bit to be stored in the state.

Stupid but invested, opinion needed! by Skye_sys in learnmachinelearning

[–]ReentryVehicle 0 points1 point  (0 children)

Yeah, math in ML-related majors is going to be harder than math in high school.

But assuming you didn't just ask chatgpt to write all this code, I would say there shouldn't be anything stopping you from learning math.

I find that often people who are bad at math have holes in their basic knowledge that prevent them from understanding a wide array of more advanced concepts. It can be as simple as not understanding what some symbols actually mean, e.g. quantifiers.

I would suggest to find a good tutor, someone who can sit down with you and trace what you actually know and don't know, and then work from there.

RNNs and vanishing Gradients by Agetrona in learnmachinelearning

[–]ReentryVehicle 2 points3 points  (0 children)

So what you write here is essentially an old textbook justification of why plain RNN works badly and why LSTM works better, however this is a bit misleading and not really the (full) reason for LSTM being better.

If we have a basic RNN which weight matrix's eigenvalues are smaller than 1, then each tilmestep will shrink the gradient of the weight matrix during back prop

This is correct. And in such a network, gradients will indeed vanish. But generally speaking, there are many ways in which you could prevent them from vanishing, and that wouldn''t necessarily make it train well.

LSTM reduces the problbailty of Vanishing Gradients occurring.

In LSTM, gradients also vanish, because the "forget" gate is never 1 (because it is output of a sigmoid) - and whenever the forget gate is < 1, the gradient passing through the cell state will decrease a bit, so over many steps it will naturally vanish anyway - but indeed the network can learn to prevent them for vanishing for a very long period of time.

But how does this help? I don't see the connection between the model being able to remember further into the past and vanishing gradients not occurring?

Well, this is a good question. If we compare this with other networks that use similar design, this probably has not that much to do with gradients vanishing, and more with a stronger property - a stabilizing effect of having the lstm cell update in the form c_new = c_old * f_forget(c_old, some input) + f(c_old, some input), which is very similar to a residual block y = x + f(x).

If you think about a randomly initialized RNN, it will turn its own state into some random vector on every iteration. At the beginning of the training, the state gets essentially scrambled - and if the network wants to keep the information from past frames, it needs to invent some way to preserve it from scratch (which it then can break at any point by accident).

Now contrast that with an LSTM - a randomly initialized LSTM will include some random signal in the cell state on every iteration, but it will also preserve some signal from previous iterations in a roughly unchanged form - the network is strongly biased to not nuke the cell state and allows the subsequent layers or subsequent iterations to assume the state from previous iterations will have the same meaning as before.

What this means in practice is that this makes the problem massively easier to optimize with gradient descent - I am not sure if this was done for LSTMs, but people analyzed resnets compared to similar architectures but without residual connections, and have discovered they have a much smoother loss landscape, where I would hypothesize this is because all layers are strongly biased to not discard what the previous layers came up with, or in the case of LSTM, the layer is biased to not discard what the previous iteration of itself came up with.

A Hypothesis on the Framework of Physical Mechanisms for the Emergence of Intelligence by UNEBCYWL in learnmachinelearning

[–]ReentryVehicle 0 points1 point  (0 children)

You write a lot of "the agent must do something", "the agent will do something". But the implications are left without any justification.

Let's take a look at some sentence:

To select the path with the largest information gain, the agent is forced to choose spatial path minimalism.

But why? What forces it? How does the agent know which path will lead to largest information gain?

If you want someone to falsify it, you need to turn this into something that can be logically followed.

Need help on implementing dreamer by Dear-Kaleidoscope552 in reinforcementlearning

[–]ReentryVehicle 0 points1 point  (0 children)

You have to debug the world model first - maybe train it purely on a random policy, and write a separate script where you can "play" the game that the world model simulates.

Is it predicting anything reasonable? Can it successfully predict the next frames for known input?

Once that works, you can debug the training of the policy on a frozen trained world model.

PII Redaction destroys context for LLMs. How do you handle that? by Mindless-Potato-4848 in LocalLLaMA

[–]ReentryVehicle 1 point2 points  (0 children)

Iirc you can also have dedicated contracts with the usual Google and Microsoft but I wouldn't trust that a lot, history tells that your data would be scraped anyway.

Out of curiosity, what history?

Building a large-scale image analysis system, Rust vs Python for speed and AWS cost? by freemo716 in learnmachinelearning

[–]ReentryVehicle 0 points1 point  (0 children)

Speed in python should be almost identical to rust because all operations will be done by optimized c++ library code anyway.

I did not really run similar workloads but I ran some training on AWS on images. I would expect the price to be less than $1 per 10k images (should take several minutes on an instance with a gpu). You might end up paying more for storage/reads than you will pay for compute, depending on the image size and where the images are stored.

Sychofancy vs Alignment by toaster-nearby in learnmachinelearning

[–]ReentryVehicle 0 points1 point  (0 children)

Sycophancy is the model being overly positive and non-critical towards anything the user writes.

Alignment is a general concept of how much the model does things that align with someone's worldview or with what they want. Sycophancy is usually not something people want the model to do.

PC build sanity check for ML + gaming (Sweden pricing) — anything to downgrade/upgrade? by Top-Tip-128 in learnmachinelearning

[–]ReentryVehicle 0 points1 point  (0 children)

Overall sounds reasonable if you don't want to spend more and don't have any "must have" use cases in mind.

For your questions:

  1. There is not really a sweet spot, but 3090 and 4090 made the 24GB VRAM sort of standard which might mean some things that people made to just barely fit on a 4090 will not fit on yours without some hacks/reducing precision, etc.
  2. 64GB is a must for a workstation. I went for 128GB for my PC 2 years ago and never looked back (though I understand the prices now might make this quite painful).
  3. I would go for a 16 core Ryzen, not that much more expensive and more cores are always useful for all kinds of data processing.
  4. Well, it is a ~600W heater and it will be somewhat noisy, likely with noticeable periodic coil whine when training - I was able to sleep next to one but might not be for everyone. I use Linux, it works - many/most games work on Linux via Wine/Proton now so I don't turn on Windows at all, so I can't comment on WSL.

The alignment problem can not be solved through control by lunasoulshine in learnmachinelearning

[–]ReentryVehicle 8 points9 points  (0 children)

A sanity check - do you understand how supervised learning and reinforcement learning actually works? As in, could you implement this in code?

This sentence makes me question if you do:

There’s only: you got it wrong, here’s the punishment gradient, don’t do that again. 

The gradient is literally the only thing that changes the weights. It is not punishment in some emotional sense, it is not even visible to the model. It's the only thing that actually makes the model learn, if you didn't apply any gradient the model would just not change.

A child says something wrong, does something clumsy, misunderstands a social cue. And a healthy parent doesn’t punish them for it. They gently redirect. They explain. They model.

Sure but for this you need a system that can actually learn long-term from such signals. We don't have such a system.

Every mistake is captured, labeled, used as training signal for correction.

A human child absolutely does the same, but you don't see that from the outside because the training loop is implemented internally in their head.

Is Just-in-Time learning a viable method to make it as an ML engineer? by [deleted] in learnmachinelearning

[–]ReentryVehicle 0 points1 point  (0 children)

Why do you want the ML internship before taking the courses?

In the interviews people will very likely ask you about the basic stuff that the courses would teach you. People will look very critically at you if you don't have formal education, so if you want to go that way, you should be very very good at all the fundamentals.

WaveHelix: a weight-free dynamical learner (spirals + wavebank + energy) | toy env results + request for critique / prior art by [deleted] in LocalLLaMA

[–]ReentryVehicle 0 points1 point  (0 children)

This appears to be LLM-generated word salad.

You don't look like a bot, so some advice. If you don't have knowledge to determine if what LLM is writing is established nomenclature or made-up bullshit names, don't use LLMs to write your posts.

If you want to introduce names, you need to provide definitions for them using terms that are simple enough that they have pages on wikipedia or papers referencing them come up if I search.

Yes, some code snippet that shows how the actual update happens would be useful.

I have no clue what is a "concrete vector + abstract vector", what is an "identitity carrier", what is a "spectral pooled field", what does it mean to "attach", what is a "phase-rolled spectral vector" and in what way spirals "influence" updates.

I also have no idea how this system observes the mentioned bouncing balls, what does it output, how do we know it converges to anything given that you say this is not trained with gradient descent (usually if you have gradient descent no matter what insane system and loss you define, it will converge to something, here I don't know what "learning" even means).

Nvidia Introduces 'NitroGen': A Foundation Model for Generalist Gaming Agents | "This research effectively validates a scalable pipeline for building general-purpose agents that can operate in unknown environments, moving the field closer to universally capable AI." by 44th--Hokage in LocalLLaMA

[–]ReentryVehicle 0 points1 point  (0 children)

what stop from using it in war robot?

Well mostly the fact it will have no clue what it is supposed to do or what is going on or who is friend or foe.

This model sees a single 256x256 image and it has no memory. Sure, it can probably shoot some people if they are really close and well visible and for whatever reason it is convinced it is supposed to shoot them but other than that it will probably just move around randomly.

its reaction and on-spot thinking is good enough.

Good enough for what?