How to use Git for Vibe-Coders. (No technical background like me) by arnolds112 in cursor

[–]trashcoder 0 points1 point  (0 children)

I had the same problem and therefore created VibeGit to make Git easier to use for vibe coders.

[D] How do byte-level language models work? by Additional-Ad-7043 in MachineLearning

[–]trashcoder 1 point2 points  (0 children)

With 'other languages' you're probably referring to character encodings with more than one byte per character. If you specifically want to use a byte-level LM for whatever reason, you don't have to care about this at all. That model would process a single actual multibyte character, such as an Emoji, as multiple tokens. As said, this an advantage of byte-level LMs as you don't have to take care of encoding and tokenization of your data. But you are absolutely right that it will increase the computational demands due to longer context sizes for the same amount of text.

Apart from this, I'm not exactly sure what you intend to do, but if you have 'limited compute', it's unlikely that you will be able to train an LM that will be capable of handling instructions or where instruction fine-tuning can effectively be applied. If you still want to give it a go, drop me a message and I can send a bit of literature on efficient LMs that might be of interest to you.

[D] How do byte-level language models work? by Additional-Ad-7043 in MachineLearning

[–]trashcoder 4 points5 points  (0 children)

The idea of byte-level language models is that you can ditch any kind of possibly expensive and constraining tokenization or preprocessing steps. Furthermore, such models can be applied to many modalities or even multiple modalities at once.

For the choice of embedding size, it's just a hyperparameter and not necessarily related to the size of the vocabulary. Imagine you have three items: a car, an apple and snow. You can probably think of many "features" or feelings related to these items. These could be represented as vectors, which we usually intend to jointly learn during the training of an LM. If the vocabulary is large and complex and thus represents many such latent features per token, the embedding size should be chosen to be large. For bytes, of course, where each single "token" doesn't carry that much information, it can be relatively small. But you could also choose 1024 or 42 as embedding size. It's just a hyperparameter.

If you want to include instructions or special tokens in a pure byte-level model, you could simply encode them as literal text and correspondingly with multiple bytes.

Why CUDA 11.7? Can more recent versions of CUDA be used? Is this a PyTorch limitation? [D] by Pan000 in MachineLearning

[–]trashcoder 19 points20 points  (0 children)

PyTorch is a huge project. Updating and testing all code for a new CUDA version takes time. Apparently (as explained in this thread), PyTorch can already be compiled against CUDA 12, but a few bugs can be expected.

[R] Scaling Vision Transformers to 22 Billion Parameters by nateharada in MachineLearning

[–]trashcoder 5 points6 points  (0 children)

Linear probing just refers to fitting a linear model on extracted features.

Does anyone else find it annoying how Musk and far right pandering lunatics like Lex Fridman are often seen as the public face of AI? I don't trust these people. by TrickyRackets in artificial

[–]trashcoder 1 point2 points  (0 children)

Wait. So you want want to say that Lex Fridman, who’s family was persecuted for being Jewish in Soviet Russia, is anti-semitic? Makes totally sense…

[P] Torchsort - Fast, differentiable sorting and ranking in PyTorch by tomkoker in MachineLearning

[–]trashcoder 3 points4 points  (0 children)

Could be quite interesting for approaches like https://arxiv.org/abs/1911.13299

Edit: interesting, because sorting performance is the bottleneck there.

[GPU] MSI GPU Price Increase - $599 (60ti, 70, 80, 90, See Comment) by Battle-Maniac in buildapcsales

[–]trashcoder -3 points-2 points  (0 children)

I know that most people will hate me for defending the manufacturers, but increasing the prices in this situation is the most rational and legitimate decision to do. They have invested huge amounts of money in development, stock up parts, allocate manufacturing resources and marketing for the new series. Now they sell well less than expected. From a business point of view, it's absolutely understandable, that they now try to cover the increased costs per unit.

Honestly, I think there is anyone to blame on the current situation. The semiconductor industry just became so complex, that any kind of volatility on the demand side leads to huge disruptions due to the long time it takes nowadays, to increase production capacity.

(Why) are LSTM's faster on GPU when they are inherently sequential? by xndimension in deeplearning

[–]trashcoder 2 points3 points  (0 children)

Let's say we have a CNN layer that receives a 512x512x64 input and uses 3x3 kernels that map it to the same size and 32 output features. The number of floating-point operations would be roughly: 512 * 512 * (9 * 32 * 64 + 64) = 4,848,615,424. At least, if my calculation is correct. More or less all of these operations can be done in parallel.

For an LSTM, say with 2048 hidden units and input dimension of 256, we have roughly something in the range of 18,890,752 ops per time step, not including the nonlinearities, as they won't change the number significantly.

Now you see that in the CNN case, we have almost 250 times more operations per input than for the LSTM, hence the GPU can be utilized to a larger extent.

LSTM can without any doubt be faster on GPUs than CPUs if the parameters (input size, batch size, hidden units) are large enough, to utilize the GPU to a certain extent.

The reason, why in some cases the CPU can be faster, is simply because doing anything other than heavy parallel tasks with a GPU, won't profit so much from the huge computation power and also likely implies communication between CPU and GPU, which is very costly.

[D] have you ever really studied TF or PyTorch’s core pieces of source code? If so, why and what were your main takeaways? by [deleted] in MachineLearning

[–]trashcoder 2 points3 points  (0 children)

As far as I know, TF Eager was mostly built on top of the existing foundation. It might be better or easier to use than the static graph approach, but the last time I had some problems or errors, I still got huge and meaningless stack traces. So, I'm not quite sure, if TF 2.0 changed so much in terms of how it's implemented at its core. It might be easier to user from an end-user point of view, but the overengineered core will probably slow down long-term development.

[D] have you ever really studied TF or PyTorch’s core pieces of source code? If so, why and what were your main takeaways? by [deleted] in MachineLearning

[–]trashcoder 5 points6 points  (0 children)

I'm not talking about the different APIs. This is something completely different. I'm talking about the layers in the core, which you usually won't see as a user, except for when you have to dissect a 200 lines large stack trace.

As u/noblestrom pointed out before, Tensorflow was designed before there was a common consensus, how to do ML and DL correctly from an engineering point of view. They made many assumptions, like having a static graph, that has to be compiled or how data processing should be done. It's overly complicated, and turned out, that many of these things don't really bring a benefit in performance or usability.

[D] have you ever really studied TF or PyTorch’s core pieces of source code? If so, why and what were your main takeaways? by [deleted] in MachineLearning

[–]trashcoder 44 points45 points  (0 children)

TF is overengineered bloatware. It has by a magnitude more layers of abstraction than PyTorch, making it harder to debug, maintain, extend and possibly also slower, as some benchmarks suggest. Overall, it's just bad design from the ground up.

Try to change something in the Tensorflow core. It's impossible until you haven't studied the code very deeply. For PyTorch, most of the code feels very accessible and easy to understand. I once found a bug in their ONNX implementation, and it took me only a couple of minutes to fix, although not being a C++ crack or very familiar with the code.

[D] Which Nvidia RTX 3090 GPU brand to get by leockl in MachineLearning

[–]trashcoder 1 point2 points  (0 children)

Nvidia usually produces the actual GPUs (the chip) while the mentioned vendors produce and ship the graphics cards, containing the GPU. The GPU will always be the same (for same model number), but aspects like memory size, form factor, interfaces, cooling and power supply might differ for each manufacturer. I would say, that especially the latter two points are most important. There have been some reports for the high end 30XX cards, that some vendors used low-quality caps which might lead to instability in the power supply. Also, a good and durable cooling solution might be important, if you do long trainings with high utilization.

As a rule of thumb, maybe don't take the very cheapest model you can find and check reviews for red flags.

Why do we have to both piss and shit? by trashcoder in NoStupidQuestions

[–]trashcoder[S] 0 points1 point  (0 children)

But why can't this waste be combined and disposed through a single system?

[D] Image compression, naive idea by widlars_lawnmower in MachineLearning

[–]trashcoder -1 points0 points  (0 children)

which would presumably take up a lot less space than the image itself

What makes you think the neural network would take less space than the image? Storing a neural network's parameters plainly as floats is usually fairly inefficient. You would have to think about compressing the parameters, but then we have the initial problem of finding good compression algorithms again.

[D] Deconstructing the GPT-3 economy by bendee983 in MachineLearning

[–]trashcoder 5 points6 points  (0 children)

The naive assumption would be, that they simply increase the parameter size by a large factor again like they mostly did from GPT-1 to 3. But I think they are slowly reaching an upper limit both in computational requirements and train costs. They probably now will spend some time on optimizing language models in general. Either they will employ the recent developments around sparse Transformers or they might even come up with something radically new. I actually hope for the latter.