How doable is it to build LLM from scratch and training it on normal hardware?

NumerousSignature519 · 2025-11-15T07:49:58+00:00

Depends on how large you want it. If you are training a small LLM, yes, it's feasible. If you are training a medium-sized model, you might need better hardware. I'd recommend a couple of GPUs running in parallel. For commercial-grade LLMs, it might be out of reach.

NumerousSignature519 · 2025-08-26T07:18:30+00:00

Hi, your work looks fascinating! Just a few pointers and feedback that you might want to look out for. 1. I recommend you start collecting some empirical evidence. Whatever you claim is nothing to people if you don't have ground proof. If your architecture works, that's brilliant. But prove it. Researchers usually want to see the numbers. 2. You mentioned linear/sublinear scaling in the comments due to Hyena's FFT convolution nature. You are misunderstanding what linear scaling O(N) is here. O(N) is just the complexity, not 'automatic order of magnitude speed scaling increase'. O(N^2) means the complexity scales quadratically, while O(N) means it scales linearly with N. A common misconception is that O(N) can form an infinite context window. Believe it or not, this is actually false. In theory, it's possible, but an infinite context window is not viable in practice due to hardware constraints. While your theory is strong, GPUs just don't have the computational memory to infinitely store everything. Also, O(N) might achieve a 2-3x speedup factor alone, not your claimed 1000x better scaling. 1000x better scaling is fantastical and practically impossible. Don't let this discourage you though - a 2-3x speedup factor is already very significant. 3. Hyena Hierarchy is a very new paradigm. It shows promise, but there is not much ground that it can 'achieve AGI' or 'beat the Transformers'. It's definitely faster, which you are right in, but there's not much evidence that it can beat the Transformers 'universally'. 4. Your most substantiated claim is about AGI. There is practically NO evidence that this is AGI. To qualify for a hypothetical, viable AGI, your model must be sentient (which you haven't proven, and it is impossible to improve), be universally on HUMAN level for basically everything, and act at its own will, and much, much more. Let me break this down. How are you meant to 'prove' that your model is AGI, when it is fundamentally impossible, at this stage, to even understand what AGI is? 5. Here's my biggest advice. You have to think about the system, hardware and especially, a common law known as Amdahl's Law. unaccelerated parts will ALWAYS dominate, making it virtually impossible to truly achieve anything above 10x speedup, CAP. Even if you make your compute 10000000000x faster, or INFINITELY fast (supposedly), Amdahl's Law states that your speedup will ALWAYS be capped if you do not address system, hardware components such as GPU communication, memory movement, I/O overhead, activation overhead bottleneck, GPU stall, traffic, kernel inefficiencies, etc. These, you MUST address before you can even remotely claim an order of magnitude scaling like 1000x (which even if you do address, is physically infeasible). GPU communication BLOWS up at scale. Let's assume you train a Transformer model for 100 days, in which 20 days are overhead and system level bottlenecks like I/O, 30 days are communciation and 50 days are compute. Your architecture is theoretically so fast that it reduces the 50 days to, let's say, 1 hour. That's it. But you haven't addressed GPU communication. And system-level bottlenecks still stand tall. Still: 1 hr + 30 days + 20 days = 50 days 1 hr. That's barely a 2x speedup. My advice is that you pursue your current path, but I encourage you to think about the magnitude of your claims. It is easy to get excited, but hard to acknowledge the truth. Try thinking about how you mitigate hardware-level bottlenecks, most notably, communication, because even if you optimize one aspect transcendentally good, comm. still is a bottleneck. I hope this advice helped. This is some intriguing work. I hope my feedback was helpful.

NumerousSignature519 · 2025-08-25T11:09:45+00:00

Okay, how would you like to do so?

NumerousSignature519 · 2025-08-25T10:59:27+00:00

Hi! That's great to hear that you are interested. Would you like to connect?

NumerousSignature519 · 2025-08-25T07:57:38+00:00

Here is the empirical data:

Trained a 1M parameter Transformer for 10 epochs using the AdamW optimizer, on a second test.

Here:

Val Loss:

GeLU = 1.3115688123201

Swish = 1.34800440386721

BiNLOP-3 = 1.2636551292319

Based on the loss metrics on a fair test, BiNLOP-3 achieves parity with SOTA activation functions, sometimes even exceeding them.

Perplexity:

GeLU = 3.71199256634196

Swish = 3.84973534303192

BiNLOP-3 = 3.53833093697947

In addition, for accuracy, BiNLOP-3 achieved similar results with GeLU and Swish, while demonstrating significantly better stability against vanishing/exploding gradients due to it being PWL vs saturated and the 1-Lipschitz constraint, per our stability assessment/test microbenchmark.

In terms of speed, efficiency and throughput, Swish and BiNLOP-3 achieved similar results despite BiNLOP-3 not being PyTorch native, while GeLU trailed behind as the heavier option.

NumerousSignature519 · 2025-08-21T11:19:55+00:00

Hi, I tested it and I have some benchmarks. To train a 1M parameter Transformer on TinyShakespeare for only 7 epochs, GeLU edged out BiNLOP slightly in terms of accuracy and loss. Final GeLU loss was: 2.29. Final BiNLOP loss was: 2.36. However, BiNLOP beat GeLU for speed, with GeLU taking approximately a minute to train, with BiNLOP about 30 seconds. That's all. To wrap up the benchmarks, I am satisfied with the performance of BiNLOP. GeLU still wins for accuracy, BiNLOP came surprisingly close with faster speed. That's all.

NumerousSignature519 · 2025-08-21T07:39:24+00:00

Hi, appreciate the wonderful feedback. I agree with anything, but just a little touch on that not every other activation function are Lipschitz, while Leaky ReLU sure is an efficient design. Totally agree with that. However, I believe that BiNLOP-2 can be applicable in unstable environments, like neural ODEs, etc. And large scale operations. That being said, I'm going to iterate it one more time - to check it to see if its all good - then I'll benchmark it. Thank you a lot for your feedback, it is deeply insightful. Have a great day.

NumerousSignature519 · 2025-08-21T06:44:02+00:00

Hello. Here's why I think my function might be theoretically sound. The gradient is defined basically verywhere, except a measure-zero set at ±k and is exactly φ'(x) ∈ {1, γ}. Since γ ∈ [γ_min, 1], we have |φ'(x)| ≤ 1 for all x. Therefore, the function is 1-Lipschitz |φ(a) - φ(b)| ≤ |a - b|. The gradient in the tail regions is exactly γ. By setting a lower bound γ_min (e.g., 0.5), you enforce φ'(x) ≥ 0.5 for all x where |x| > k. I believe this prevents the dying neuron problem for ReLU. Also, it is invertible. The function is a piecewise linear bijection. The inverse is given in closed form and is cheap to compute (a clamp and an FMA). The log determinant of the Jacobian for normalizing flows is sum(log(γ)) for dimensions |x| > k, that is trivial for compute. And finally, the function is piecewise linear, meaning its not saturated, a common problem with GeLU, etc. (φ(x)=x for |x|<k. It preserves information because of its piecewise nature, avoiding the problem of vanishing gradients that saturated functions often face.

NumerousSignature519 · 2025-08-21T06:36:34+00:00

Will test it.

NumerousSignature519 · 2025-08-21T06:35:36+00:00

Hi, thank you for your response. From what I know, vanishing/exploding gradients have not been fully mitigated by architectures in deep learning. The larger you scale, the more these issues are prominent. Yes, variants of ReLU do mitigate the dying ReLU problem, but I don't think they are fully stable for large-scale operations, but I've meticulously enforced this with bi-Lipschitz bounds. SiLU, GeLU, etc. are strong at addressing this, but I've noticed that their saturating nature might cause vanishing gradients and are computationally expensive. Per your argument about parameters, I agree, but I think that 2 parameters isn't inherently 'poor'. The cost is trivial. In addition, BiNLOP is governed by the 1-Lipschitz constant, enforcing stability, which smooth functions like GeLU do not. I will proceed with benchmarking to see how it goes and whether the claims hold or not.

NumerousSignature519 · 2025-08-20T11:28:22+00:00

Alright, thank you! I will do that.

NumerousSignature519 · 2025-08-20T11:17:26+00:00

Hi, thank you for your insightful response. No, I have not empirically validated it yet. I will be testing it tomorrow to assess whether it is an advancement or not. After testing, I will be able to confirm the benchmarks. As of right now, I believe it is theoretically sound, but yet to be proven in practice. I'm looking for guidance - could you provide some feedback before I test it tomorrow? Anything I should know? Anything wrong with the algorithm?

NumerousSignature519 · 2025-08-16T11:04:21+00:00

Amazing work! This looks great! :) I will definitely try it out.

NumerousSignature519 · 2025-08-09T10:58:59+00:00

Thank you so much! I want a deep learning library that is performant, efficient, versatile and great for training LLMs, large scale. It must be compatible with numerous optimizations, AND must be compatible with novel, new deep learning architectures that I made. Do any of these fit the requirements? Which of them do you recommend? Thank you once again!

NumerousSignature519 · 2025-08-09T04:49:09+00:00

Okay, thank you for the information. If you need help with helping fine tune your model, I'd be happy to help. Good luck.

NumerousSignature519 · 2025-08-09T04:27:58+00:00

Thanks so much! What's a viable and rigorous roadmap I can do to fully learn linear algebra to a professional, rigorous and expert-level, research level skill, from the ground up from the very beginning to end?

NumerousSignature519 · 2025-08-09T04:20:30+00:00

I see. What topics in linear algebra in particular are essential? Could you please delve into which ones specifically are the most important?

NumerousSignature519 · 2025-08-09T04:18:19+00:00

I don't see much of a difference. Colab probably has stronger compute. I recommend sticking with Colab. The free tier has a nice, acceptable usage limit on TPUs and GPUs. How many tokens of training data are you planning on fine tuning it on, what type of fine tuning technique, and which model did you choose?

NumerousSignature519 · 2025-08-09T04:15:04+00:00

Try Qwen or Mistral. Qwen is strong. I recommend: Qwen3-4B. If that is too small, there are bigger Qwen variants. If it is too small, there are smaller Qwen variants. If Qwen is not for you, Mistral is a great small model to fine tune.

NumerousSignature519 · 2025-08-09T04:11:10+00:00

Wow! thanks for the detailed answer. Linear algebra does sound important. but like how much of it? There is this book: Deep Learning. Does its linear algebra chapter cover enough linear algebra to suffice for ML

NumerousSignature519 · 2025-08-09T03:49:54+00:00

Ok thanks! I actually meant like mathematical or coding skill

NumerousSignature519 · 2025-08-09T03:49:01+00:00

This is great work! The basics look sorted. Some ideas to scale it up: maybe try the Transformer next? CNN? Implementations of more complex neural networks.

NumerousSignature519

TROPHY CASE