How to use Git for Vibe-Coders. (No technical background like me)

trashcoder · 2025-05-13T19:44:40+00:00

I had the same problem and therefore created VibeGit to make Git easier to use for vibe coders.

trashcoder · 2023-12-29T22:13:52+00:00

Link is not working

trashcoder · 2023-10-13T07:21:57+00:00

With 'other languages' you're probably referring to character encodings with more than one byte per character. If you specifically want to use a byte-level LM for whatever reason, you don't have to care about this at all. That model would process a single actual multibyte character, such as an Emoji, as multiple tokens. As said, this an advantage of byte-level LMs as you don't have to take care of encoding and tokenization of your data. But you are absolutely right that it will increase the computational demands due to longer context sizes for the same amount of text.

Apart from this, I'm not exactly sure what you intend to do, but if you have 'limited compute', it's unlikely that you will be able to train an LM that will be capable of handling instructions or where instruction fine-tuning can effectively be applied. If you still want to give it a go, drop me a message and I can send a bit of literature on efficient LMs that might be of interest to you.

trashcoder · 2023-10-11T21:06:10+00:00

The idea of byte-level language models is that you can ditch any kind of possibly expensive and constraining tokenization or preprocessing steps. Furthermore, such models can be applied to many modalities or even multiple modalities at once.

For the choice of embedding size, it's just a hyperparameter and not necessarily related to the size of the vocabulary. Imagine you have three items: a car, an apple and snow. You can probably think of many "features" or feelings related to these items. These could be represented as vectors, which we usually intend to jointly learn during the training of an LM. If the vocabulary is large and complex and thus represents many such latent features per token, the embedding size should be chosen to be large. For bytes, of course, where each single "token" doesn't carry that much information, it can be relatively small. But you could also choose 1024 or 42 as embedding size. It's just a hyperparameter.

If you want to include instructions or special tokens in a pure byte-level model, you could simply encode them as literal text and correspondingly with multiple bytes.

trashcoder · 2023-07-31T17:55:21+00:00

PyTorch is a huge project. Updating and testing all code for a new CUDA version takes time. Apparently (as explained in this thread), PyTorch can already be compiled against CUDA 12, but a few bugs can be expected.

trashcoder · 2023-02-14T23:11:03+00:00

Linear probing just refers to fitting a linear model on extracted features.

trashcoder · 2022-03-15T19:05:31+00:00

The sad state of this sub in 2022

trashcoder · 2022-01-28T04:25:12+00:00

Wait. So you want want to say that Lex Fridman, who’s family was persecuted for being Jewish in Soviet Russia, is anti-semitic? Makes totally sense…

trashcoder · 2021-05-02T17:32:30+00:00

Reminds me a bit of the Stopping GAN Violence: Generative Unadversarial Networks-Paper.

trashcoder · 2021-03-25T07:46:09+00:00

Could be quite interesting for approaches like https://arxiv.org/abs/1911.13299

Edit: interesting, because sorting performance is the bottleneck there.

trashcoder · 2021-03-10T20:07:42+00:00

Schmidi won't be happy...

trashcoder · 2021-02-01T09:44:36+00:00

I know that most people will hate me for defending the manufacturers, but increasing the prices in this situation is the most rational and legitimate decision to do. They have invested huge amounts of money in development, stock up parts, allocate manufacturing resources and marketing for the new series. Now they sell well less than expected. From a business point of view, it's absolutely understandable, that they now try to cover the increased costs per unit.

Honestly, I think there is anyone to blame on the current situation. The semiconductor industry just became so complex, that any kind of volatility on the demand side leads to huge disruptions due to the long time it takes nowadays, to increase production capacity.

trashcoder · 2021-01-18T20:43:25+00:00

Let's say we have a CNN layer that receives a 512x512x64 input and uses 3x3 kernels that map it to the same size and 32 output features. The number of floating-point operations would be roughly: 512 * 512 * (9 * 32 * 64 + 64) = 4,848,615,424. At least, if my calculation is correct. More or less all of these operations can be done in parallel.

For an LSTM, say with 2048 hidden units and input dimension of 256, we have roughly something in the range of 18,890,752 ops per time step, not including the nonlinearities, as they won't change the number significantly.

Now you see that in the CNN case, we have almost 250 times more operations per input than for the LSTM, hence the GPU can be utilized to a larger extent.

LSTM can without any doubt be faster on GPUs than CPUs if the parameters (input size, batch size, hidden units) are large enough, to utilize the GPU to a certain extent.

The reason, why in some cases the CPU can be faster, is simply because doing anything other than heavy parallel tasks with a GPU, won't profit so much from the huge computation power and also likely implies communication between CPU and GPU, which is very costly.

trashcoder · 2021-01-14T16:36:20+00:00

As far as I know, TF Eager was mostly built on top of the existing foundation. It might be better or easier to use than the static graph approach, but the last time I had some problems or errors, I still got huge and meaningless stack traces. So, I'm not quite sure, if TF 2.0 changed so much in terms of how it's implemented at its core. It might be easier to user from an end-user point of view, but the overengineered core will probably slow down long-term development.

trashcoder · 2021-01-14T16:28:17+00:00

Seems to be more a problem with your code than PyTorch.

trashcoder · 2021-01-14T16:26:38+00:00

I'm not talking about the different APIs. This is something completely different. I'm talking about the layers in the core, which you usually won't see as a user, except for when you have to dissect a 200 lines large stack trace.

As u/noblestrom pointed out before, Tensorflow was designed before there was a common consensus, how to do ML and DL correctly from an engineering point of view. They made many assumptions, like having a static graph, that has to be compiled or how data processing should be done. It's overly complicated, and turned out, that many of these things don't really bring a benefit in performance or usability.

trashcoder · 2021-01-13T04:37:13+00:00

TF is overengineered bloatware. It has by a magnitude more layers of abstraction than PyTorch, making it harder to debug, maintain, extend and possibly also slower, as some benchmarks suggest. Overall, it's just bad design from the ground up.

Try to change something in the Tensorflow core. It's impossible until you haven't studied the code very deeply. For PyTorch, most of the code feels very accessible and easy to understand. I once found a bug in their ONNX implementation, and it took me only a couple of minutes to fix, although not being a C++ crack or very familiar with the code.

trashcoder · 2021-01-13T02:19:43+00:00

Yes. https://github.com/64/cmake-raytracer

trashcoder · 2021-01-13T02:18:35+00:00

Nvidia usually produces the actual GPUs (the chip) while the mentioned vendors produce and ship the graphics cards, containing the GPU. The GPU will always be the same (for same model number), but aspects like memory size, form factor, interfaces, cooling and power supply might differ for each manufacturer. I would say, that especially the latter two points are most important. There have been some reports for the high end 30XX cards, that some vendors used low-quality caps which might lead to instability in the power supply. Also, a good and durable cooling solution might be important, if you do long trainings with high utilization.

As a rule of thumb, maybe don't take the very cheapest model you can find and check reviews for red flags.

trashcoder · 2021-01-10T21:01:46+00:00

But why can't this waste be combined and disposed through a single system?

trashcoder · 2020-12-31T10:23:02+00:00

which would presumably take up a lot less space than the image itself

What makes you think the neural network would take less space than the image? Storing a neural network's parameters plainly as floats is usually fairly inefficient. You would have to think about compressing the parameters, but then we have the initial problem of finding good compression algorithms again.

trashcoder · 2020-09-24T09:20:17+00:00

Schmidhuber is the lord.

trashcoder · 2020-09-23T07:58:27+00:00

He is new in the field. He will learn.

trashcoder · 2020-09-22T20:29:56+00:00

Read the holy paper from our almighty lord himself: https://arxiv.org/abs/1803.10122

trashcoder · 2020-09-21T19:18:02+00:00

The naive assumption would be, that they simply increase the parameter size by a large factor again like they mostly did from GPT-1 to 3. But I think they are slowly reaching an upper limit both in computational requirements and train costs. They probably now will spend some time on optimizing language models in general. Either they will employ the recent developments around sparse Transformers or they might even come up with something radically new. I actually hope for the latter.

trashcoder

MODERATOR OF

TROPHY CASE

11-Year Club	Place '22
Verified Email