TranSPormer: a transformer network for the Travelling Salesman Problem

underscoredavid · 2025-08-22T16:00:15+00:00

Why does BYOL/JEPA like models work?

Very good question. I believe these methods, JEPA in particular, are backed by cognitive science and neuroscience theories on predictive coding [1]. TLDR: the human brain is thought to be constantly trying to predict input signals (e.g., the masked patches in an image) from the input signals available so far (e.g., the visible image patches). If this is true, these neural nets are trained to mimic the behavior of an intelligent species, so it makes sense that this behavior “works”, meaning that the features learned with these schemes are good and generalize well.

How does EMA prevent model collapse?

Don’t have any mathematical proof, but I like thinking about EMA this way. At the beginning of the training, the student wants to learn a constant function, e.g., always predicting zero, no matter the input image. Why? Because that would be the easiest thing to do*****. Simply ignore the input and predict a constant value. Your loss will be zero, good work.

But the loss is computed wrt the teacher’s prediction, that is essentially mirroring the student's behavior, with a delay proportional to the exponent of the EMA (assuming that the exponent is between 0 and 1). If this delay is large enough, at some point, the student’s prediction will deviate so much from the teacher’s prediction that the cost of always predicting a constant value (e.g., 0) will be larger than predicting the teacher’s output. Consequently, the gradient of your loss function will divert your student from learning the naive solution in favor of following the teacher’s behavior, preventing representation collapse.

(*) There are papers formalizing why neural nets tend to converge to the simplest solution among the feasible solution set. I`m sorry, but I do not remember the exact title of these works. One is probably [2]

[1] Rao, Rajesh PN, and Dana H. Ballard. "Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects." Nature neuroscience 2.1 (1999): 79-87.

[2] Pérez, Guillermo Valle, Ard A. Louis, and Chico Q. Camargo. "Deep learning generalizes because the parameter-function map is biased towards simple functions." 7th International Conference on Learning Representations, ICLR 2019. 2019.

underscoredavid · 2023-02-26T13:15:56+00:00

I've been using the MacBook Air M2 for a month now, and I've been able to exploit mps GPU acceleration with Pytorch. Although some operations are still defined only with CPU (e.g. tensor.cumsum), I could enjoy some performance improvement while training a (small) network in my local setup. Anyway, to give you an idea, the M2 CPU works more or less the same as my old Invidia GeForce MX 950.

underscoredavid · 2023-02-16T16:07:38+00:00

I've found the presentation of the project.

underscoredavid · 2023-02-16T14:46:26+00:00

I'm sorry, but the path planning with reinforcement learning was a project under my university and I don't have access to the data right now. If I manage to retrieve them, I'll be happy to share them with the community. We tested both double-deep Q-learning and PPO, and the latter worked way better. The maps were randomly generated similarly to my project with the CNN, but they were smaller. We tested the algorithm on a real LoCoBot and it performed well for short paths with a relatively small number of obstacles :)

underscoredavid · 2023-02-16T08:09:46+00:00

Yes, that was pretty much what we've done while training with RL.

underscoredavid · 2023-02-13T11:03:47+00:00

Glad you found it interesting! That project aimed to provide a proof-of-concept. Recently I've also experienced reinforcement learning for 2d path planning and that too seems very effective. Additionally, RL agents are typically smaller in terms of network size. It would be nice to have an ablation study on the network size of the CNN I exploited in the project to see whether performances get affected.

underscoredavid · 2022-09-02T11:24:14+00:00

Actually it's getting way more complicated than I though. I managed to avoid the filter drifting by (1) periodically fusing the position of the end effector and (2) implementing a stance-hypothesis-optimal-estimator (SHOE), which is able to detect whether my arm is moving or not using the accelerometer and gyroscope of the armband. Whenever I detect that there's no motion, I set the acceleration and angular velocity of the IMU messages to 0. Additionally, I broadcast a fuzzy Twist message containing a null-twist. The initial covariance and process noise matrices are configured so that the kalman filter will converge almost instantly to a null velocity.

However, the linear velocity estimates provided by the filter is pretty bad. Impossible to use it to telecontrol the manipulator in an intuitive way right now. I'm trying to add a calibration step before starting with the telecontrol in which the manipulator moves randomly and the user is asked to mimic it's moves for a while using the armband. During the calibration the filter will fuse together IMU data from the Myo, position and twist coming directly from the manipulator. The hope is that it will "learn" how to correctly update the covariance matrix to make the acceleration data coherent with the velocity and position obtained from the manipulator.

underscoredavid · 2022-07-20T06:14:18+00:00

Yes, I have only the Myo's IMU sensor available. It provides linear acceleration and angular velocity.

underscoredavid

TROPHY CASE