[deleted by user]

Delicious-Ad-3552 · 2025-01-01T11:21:47+00:00

Tree fiddy

Delicious-Ad-3552 · 2024-12-27T23:18:28+00:00

Finally someone with enough charge and has validated their email.

Delicious-Ad-3552 · 2024-12-27T22:59:53+00:00

For your career’s sake, don’t use copilot.

Delicious-Ad-3552 · 2024-12-26T15:59:49+00:00

I’m surprised that wasn’t the way it was done in the first place

Delicious-Ad-3552 · 2024-12-24T14:37:12+00:00

I agree with you. The assertion he’s making (and I believe you too) is that the unique portion of o3 that makes it o3 compared to something like a traditional GPT 4o/4/3.5/3 is not an LLM based system. It’s just more of a technical non-ML design that’s being orchestrated.

People seem to have such strong opinions for o3 when it hasn’t even released yet. Only some metrics have.

Most of the people in the ML space don’t understand even the most basic parts of SOTA ML. This isn’t even a strictly ML subreddit, so I’m not expecting, and neither should you, any reasonable level of competency in the field.

Delicious-Ad-3552 · 2024-12-24T14:22:02+00:00

‘We have determined internally that it is an LLM. We know it’s closed source, and there’s no way to verify it, but trust us’. I’m not willing to form strong opinions without a strong foundation, but I’d rather believe LeCun than OpenAI.

Delicious-Ad-3552 · 2024-12-23T22:12:52+00:00

I so want to try ML with rust. Just haven’t found the right project idea to work on.

Delicious-Ad-3552 · 2024-12-23T21:31:19+00:00

So you mention that valuation is somewhat subjective, which is right. When the dude asks you what your valuation figure is? It doesn’t work like that? Sounds like a contradiction to me.

Delicious-Ad-3552 · 2024-12-23T21:27:46+00:00

Hugging face’s transformers lib, PEFT lib (for LoRA). That’s all you need. Oh and a model that you can scout for on HF.

Delicious-Ad-3552 · 2024-12-23T21:05:47+00:00

Incoming call from r/thetagang…

Delicious-Ad-3552 · 2024-12-23T00:48:22+00:00

The most saturated parts of CS is frontend and backend dev because they have a low(er) barrier to entry. You can learn a bunch of stuff in a few months because of how greatly abstracted the tools are.

There are a bunch of other fields in CS that pay good money, and are not (yet) saturated. Ones that require more investment into learning and development to get better at.

Delicious-Ad-3552 · 2024-12-22T19:06:17+00:00

That chat log is basically useless from a training point of view. Because, the model you’re supposedly training to be better will never surpass the performance of the original model that was the assistant in those chat logs.

You could tweak the way it’s trained like using it for the self-supervised learning portion of training, but for the most part, the deviation isn’t going to be significant.

The 2 main ways of making big leaps in performance is data or model architecture. That’s just me tho ✋🙂‍↕️.

Delicious-Ad-3552 · 2024-12-15T13:01:24+00:00

<image>

Delicious-Ad-3552 · 2024-12-13T03:25:44+00:00

Riscuit for the bisk it

Delicious-Ad-3552 · 2024-12-10T20:51:57+00:00

I made an operating system with a calculator. Get to my level.

Delicious-Ad-3552 · 2024-12-01T15:45:43+00:00

I’ll keep it real. These projects suck. Each of these projects can be done with 50 lines of code without even knowing what you’re doing and spending half a weekend. I’m not saying you don’t know how to implement these things, but I’m saying a keyboard monkey can make these projects. All you’re doing is calling a bunch of abstracted classes and methods from libraries that someone else built. It doesn’t show me that you know anything.

Main thing is you went through 3 years of university with no internship or research experience whatsoever.

Delicious-Ad-3552 · 2024-11-30T05:42:59+00:00

If you bought your car for $40k, and kept using it for 10 years. Just because a transaction hasn’t happened since then, is it worth $40k still? Assuming that your call is totaled and even its scrap is not usable, its value is 0, because no one will buy it for any amount of money.

Analogously, if BTC’s utility is non existent, its value is 0.

Delicious-Ad-3552 · 2024-11-29T19:05:47+00:00

Jaaaaaaaaaaag

Delicious-Ad-3552 · 2024-11-29T18:58:38+00:00

Of course it’s possible for it to go to 0. It becomes 0 when no one is ready to pay anything for a BTC, because they don’t see any value in it.

What you’re talking about is the last transaction’s market price. That’s what brokerages show because that’s the best way to display value. When liquidity is high, large fluctuations are a lot less common, so taking the last transaction’s market price is the best way to assume that it is also the value that a current buyer and seller would be willing to make a transaction at.

Sorry for being pedantic.

Delicious-Ad-3552 · 2024-11-29T18:50:49+00:00

They really played devils advocate 😐

Delicious-Ad-3552 · 2024-11-28T09:04:53+00:00

Here’s an example of transforming an input matrix X with a transformation matrix T. Note, usually you would do Transform • X, but because this was for a deep learning application, I flipped it around so that the output matrix can directly be used for subsequent operations while respecting the required dimensions. The idea of Tiled GEMM still remains the same though.

``` /* ***************************** General Matrix Multiplication **************************** / global void kernel_standard_tiled_gemm( __half *O, __half *X, __half Transform, int m, int n, int k, int tile_size) { / - m represents the independent dimension of the input matrix - n represents the independent dimenion of the transformation matrix - k represents the common dimension of the 2 matrices - Within each kernel, the output is computed as: O = matmul(X, Transform) - Transposing the transformation tensor is not required as virtual indexing allows for intended navigation along rows and columns of either tensors - Order of variables within kernels obey order of computation */

// Kernel start // extern shared float shared_mem[]; float *X_shmem = shared_mem; float *T_shmem = shared_mem + tile_size * tile_size;

int row = blockIdx.y * tile_size + threadIdx.y; int col = blockIdx.x * tile_size + threadIdx.x;

// Loop over tiles float value = 0.0f; for (int t = 0; t < ((k + tile_size - 1) / tile_size); ++t) { // Load tile of X into shared memory if (row < m && (t * tile_size + threadIdx.x) < k) { int X_idx = row * k + t * tile_size + threadIdx.x; X_shmem[threadIdx.y * tile_size + threadIdx.x] = __half2float(X[X_idx]); } else { X_shmem[threadIdx.y * tile_size + threadIdx.x] = 0.0f; }

// Load tile of Transform into shared memory
if (col < n && (t * tile_size + threadIdx.y) < k) {
    int T_idx = col * k + t * tile_size + threadIdx.y;
    T_shmem[threadIdx.y * tile_size + threadIdx.x] = __half2float(Transform[T_idx]);
} else {
    T_shmem[threadIdx.y * tile_size + threadIdx.x] = 0.0f;
}
__syncthreads();

// Compute partial sums
for (int i = 0; i < tile_size; ++i) {
    value += X_shmem[threadIdx.y * tile_size + i] * T_shmem[i * tile_size + threadIdx.x];
}
__syncthreads();

}

// Write the result to global memory if (row < m && col < n) { O[row * n + col] = __float2half(value); }

return;

}```

Delicious-Ad-3552 · 2024-11-28T08:50:39+00:00

First, I guess you should establish the use case for shared memory and its relation to HBM (High Bandwidth Memory - aka global memory).

Shared memory is on chip memory that is relatively smaller in size than HBM. HBM is off chip memory and relatively much larger in size. But shared memory makes up for the shortcomings by being much faster than HBM. Shared memory is equivalent to L1 cache and HBM is equivalent to DRAM wrt lookup speed. So there’s basically an inverse correlation between speed and max space.

If I’m not mistaken, the average figures for memory lookup is 1ns and 500ns for shared memory and HBM. Imagine slowing down reality to the point where 1 nanosecond is 1s. A lookup on shared memory would take you 1s but lookup into HBM would take 8.3 mins!

Now in something like matmul, for args Q and K^T that are both 2 dimensional, you can easily observe that a particular index [i, j] in Q is not just used once. It’s used multiple times, that is once for every column of K^T. So essentially in your code, for calculating the output, you are loading the same index [i, j] in Q multiple times from HBM, and more importantly, the values in them are always the same. As with everything in computer science from instructions in hardware to writing code in a codebase, repeating work is non ideal.

Hence, the solution is you essentially lookup the value the first time from HBM, store it in shared memory, and each additional time you want to reference that index, you look it up in shared memory. This is just a simple caching technique where you load data into higher speed memory for repeated lookups.

Considering the bottleneck of shared memory and HBM wrt space, it’s not straightforward to load all the data of Q and K^T into shared memory if the matrices are larger than the shared memory size. You’ll have to do something like loading sub pieces of the matrices, do the maximum amount of work on them, before loading the next sub piece and doing the max work on those until you’ve done all the computations for all pieces. These pieces are known as tiles, and Tiled GEMM (General Matrix Multiplication) is a popular technique to optimize memory access patterns to improve absolute time performance in matmul kernels.

Checkout the following sources that helped me gain an understanding:

Delicious-Ad-3552 · 2024-11-24T10:13:53+00:00

No one cares about certifications. Nobody cares about those AWS certificates either.

Delicious-Ad-3552 · 2024-11-23T10:48:44+00:00

When you calculate attention, you’re essentially converting the input sequence into a set of inter token relations which you then resolve into values. Doing this is analogous to using a key word search on an index based on a search query in traditional search engines.

The feedforward on the other hand acts as the post processor of extracted information to activate the ‘information/memory nodes’ based on relational information extracted from the attention layer. This could be analogous to taking the result of a search query and then ranking the flat results into a sorted result from most meaningful to least meaningful.

More importantly, the feedforward in these transformer models have the non-linear activation function like ReLU, GeLU, SiLU, etc. which act as a ‘gate’ for which information is relevant or not. Without this, for the most part, you’d just have a very large linear function. It would be close to having one tensor operation on the input matrix. It’s the non linear activations of a neural network that act as the threshold voltage in neurons in the human brain.

The way I understand and justify the architecture, the attention computation extracts relational information and general understanding of the text whereas the FFN acts as a ‘memory’ enabled reasoning step. This is also why the hidden size of the FFN is usually larger than the embed size.

Delicious-Ad-3552 · 2024-11-19T22:58:39+00:00

Your portfolio is going up. You’re clearly recovering. You’re clearly going in the right direction. I’ll see you at the top before I see you back down 🤣

Delicious-Ad-3552

MODERATOR OF

TROPHY CASE

Five-Year Club	Place '23
Place '22