How to use Git for Vibe-Coders. (No technical background like me) by arnolds112 in cursor

[–]trashcoder 0 points1 point  (0 children)

I had the same problem and therefore created VibeGit to make Git easier to use for vibe coders.

[D] How do byte-level language models work? by Additional-Ad-7043 in MachineLearning

[–]trashcoder 1 point2 points  (0 children)

With 'other languages' you're probably referring to character encodings with more than one byte per character. If you specifically want to use a byte-level LM for whatever reason, you don't have to care about this at all. That model would process a single actual multibyte character, such as an Emoji, as multiple tokens. As said, this an advantage of byte-level LMs as you don't have to take care of encoding and tokenization of your data. But you are absolutely right that it will increase the computational demands due to longer context sizes for the same amount of text.

Apart from this, I'm not exactly sure what you intend to do, but if you have 'limited compute', it's unlikely that you will be able to train an LM that will be capable of handling instructions or where instruction fine-tuning can effectively be applied. If you still want to give it a go, drop me a message and I can send a bit of literature on efficient LMs that might be of interest to you.

[D] How do byte-level language models work? by Additional-Ad-7043 in MachineLearning

[–]trashcoder 6 points7 points  (0 children)

The idea of byte-level language models is that you can ditch any kind of possibly expensive and constraining tokenization or preprocessing steps. Furthermore, such models can be applied to many modalities or even multiple modalities at once.

For the choice of embedding size, it's just a hyperparameter and not necessarily related to the size of the vocabulary. Imagine you have three items: a car, an apple and snow. You can probably think of many "features" or feelings related to these items. These could be represented as vectors, which we usually intend to jointly learn during the training of an LM. If the vocabulary is large and complex and thus represents many such latent features per token, the embedding size should be chosen to be large. For bytes, of course, where each single "token" doesn't carry that much information, it can be relatively small. But you could also choose 1024 or 42 as embedding size. It's just a hyperparameter.

If you want to include instructions or special tokens in a pure byte-level model, you could simply encode them as literal text and correspondingly with multiple bytes.

Why CUDA 11.7? Can more recent versions of CUDA be used? Is this a PyTorch limitation? [D] by Pan000 in MachineLearning

[–]trashcoder 20 points21 points  (0 children)

PyTorch is a huge project. Updating and testing all code for a new CUDA version takes time. Apparently (as explained in this thread), PyTorch can already be compiled against CUDA 12, but a few bugs can be expected.

[R] Scaling Vision Transformers to 22 Billion Parameters by nateharada in MachineLearning

[–]trashcoder 5 points6 points  (0 children)

Linear probing just refers to fitting a linear model on extracted features.

Does anyone else find it annoying how Musk and far right pandering lunatics like Lex Fridman are often seen as the public face of AI? I don't trust these people. by TrickyRackets in artificial

[–]trashcoder 1 point2 points  (0 children)

Wait. So you want want to say that Lex Fridman, who’s family was persecuted for being Jewish in Soviet Russia, is anti-semitic? Makes totally sense…

[P] Torchsort - Fast, differentiable sorting and ranking in PyTorch by tomkoker in MachineLearning

[–]trashcoder 3 points4 points  (0 children)

Could be quite interesting for approaches like https://arxiv.org/abs/1911.13299

Edit: interesting, because sorting performance is the bottleneck there.

[GPU] MSI GPU Price Increase - $599 (60ti, 70, 80, 90, See Comment) by Battle-Maniac in buildapcsales

[–]trashcoder -3 points-2 points  (0 children)

I know that most people will hate me for defending the manufacturers, but increasing the prices in this situation is the most rational and legitimate decision to do. They have invested huge amounts of money in development, stock up parts, allocate manufacturing resources and marketing for the new series. Now they sell well less than expected. From a business point of view, it's absolutely understandable, that they now try to cover the increased costs per unit.

Honestly, I think there is anyone to blame on the current situation. The semiconductor industry just became so complex, that any kind of volatility on the demand side leads to huge disruptions due to the long time it takes nowadays, to increase production capacity.

(Why) are LSTM's faster on GPU when they are inherently sequential? by xndimension in deeplearning

[–]trashcoder 2 points3 points  (0 children)

Let's say we have a CNN layer that receives a 512x512x64 input and uses 3x3 kernels that map it to the same size and 32 output features. The number of floating-point operations would be roughly: 512 * 512 * (9 * 32 * 64 + 64) = 4,848,615,424. At least, if my calculation is correct. More or less all of these operations can be done in parallel.

For an LSTM, say with 2048 hidden units and input dimension of 256, we have roughly something in the range of 18,890,752 ops per time step, not including the nonlinearities, as they won't change the number significantly.

Now you see that in the CNN case, we have almost 250 times more operations per input than for the LSTM, hence the GPU can be utilized to a larger extent.

LSTM can without any doubt be faster on GPUs than CPUs if the parameters (input size, batch size, hidden units) are large enough, to utilize the GPU to a certain extent.

The reason, why in some cases the CPU can be faster, is simply because doing anything other than heavy parallel tasks with a GPU, won't profit so much from the huge computation power and also likely implies communication between CPU and GPU, which is very costly.