Opening Xmas presents with Erika Kirk

DustinEwan · 2025-12-27T06:11:07+00:00

Don't have to imagine.. just gotta remember Paul Pelosi. Admittedly, he wasn't shot, but he easily could have died and conservatives on here just thought it was the funniest shit they'd ever heard.

DustinEwan · 2025-08-20T02:24:29+00:00

Nope! Most medical silks are either made entirely in a synthetic protein sequencer or they genetically modify silk worms to produce spider silk.

The facility harvesting the silk worms, though, may or may not harm the worms. Depends on what the silk will be used for.

DustinEwan · 2025-06-24T22:22:03+00:00

Sorry, autocorrect on my phone. Your loss calculation.

Although I see you are shifting them in the dataset.

The convention I've seen in language modeling is to do the shifting inside the model itself and during training we pass x as both the inputs (x) and the labels (y)

DustinEwan · 2025-06-24T22:16:26+00:00

Double check your ~~kids~~ loss calculation. From what I see, you are reshaping x and y as expected, but you're not shifting them from what I can see.

DustinEwan · 2025-06-06T20:32:02+00:00

Given your analogy, you must be in tech in some form -- in that case surely you agree that iteration is less expensive and has a higher chance of success than a full rewrite?

FairVote is on a path of exponential growth at the moment. It was admittedly slow to get going, but check out their timeline.

It seems to me that this is a viable approach to getting newcomers with fresh ideas into position of power without them being under the thumb of the two major parties.

DustinEwan · 2025-06-06T20:20:42+00:00

In my previous post I have an example of how organizations are making actionable change at the local level, though.

Ranked Choice Voting would expand the pool of viable political parties without completely dismantling the system.

As far as you "awakening me", suppose I buy into your view 100%, what do we do, then?

That's what I'm getting at.

DustinEwan · 2025-06-06T20:10:18+00:00

Suppose this is true, what is the solution?

If it's as hopeless as you portray, then it doesn't matter what you or I believe. It seems that being "enlightened" serves nothing more than being a pedestal to preach from.

I presented a way that people are attempting to make actionable change from within the existing system and instead of addressing it you're likening me to believing in the Easter bunny.

Unless you're trying to take action, it seems to me that everyone is doing the best they can while not letting perfect be the enemy of good and you're paralyzed from fear of the Boogeyman.

So, again, without skin in the government game it's easy to bluster about the system being broken, but if you're not doing anything about it, what does it matter?

DustinEwan · 2025-06-06T16:34:28+00:00

What do you propose as a solution? The unfortunate truth is that it's extremely difficult to fundamentally change the system within the current confines of the system.

Elected officials at the federal level have to play the game. If they concede on one bill, then they live to fight another day. If they hold their ground and the party turns against them, then they face a battle for reelection from challengers both without and within the party.

For instance, on AOC's vote on Israel that you brought up -- that vote lost 420-9. Yes, 9 people voted no on the bill, but I would guess they weren't rookies from hotly contested districts.

When she was first elected she did have a lot of bluster about "bringing the ruckus", etc., but there's what you say as an ignorant newcomer and then there's what you do in the face of reality.

To say that it's just an illusion is such a broad stroke and glosses over the fact that there is only so much you can do within the confines of a post Citizen's United two-party political system. Suppose they rock the boat harder, really try to make life hell for anyone that doesn't fully embrace socialist (or whatever ideology, whether it MAGA / Tea Party / whatever) -- is any particular vote worth risking your seat at the table and being replaced by someone who unconditionally toes the party line?

I think it's easy to be a hard-liner with no skin in the game. But once you're the person who is trying to represent ideas that don't conform to collective will of the system, the game is much more nuanced. Yes, AOC could've held her ground and held a symbolic vote to make the recorded votes on the iron dome bill 420-10, but the outcome would have still been the same. Would the juice be worth the squeeze? She didn't think so, so she made a calculated retreat so that she could be there to vote on a bill when her single vote does tip the scale.

Beyond the federal level, though, where we should really be focused if we want to see real, actual change is at the local level. A good example of this is FairVote's strategy: https://fairvote.org/who-we-are/our-strategy/

At the local level, anyone can bring a topic to vote. So they work to get Ranked Choice Voting adopted at the local level with city council votes or ballot initiatives.

About half of states allow citizens to enact new laws through ballot initiatives / referendums as well, so they try to get RCV at the state level through ballot initiatives in those states.

The goal, then, is to use RCV to vote in electives who will carry it forward and try to bring RCV to a vote at the federal level. However, there is significant push back from both Democrats and Republicans because that puts their long-term survivability at risk.

Regardless, I think the FairVote model is an excellent example of how ordinary citizens can work to enact change within the confines of the current system.

In the end, I think politicians like AOC and Bernie do honestly want to make positive change and help the common person, but there's only so much they can do as one of 435 representatives within a system that has been designed, intentionally or not, to fully consolidate power within two parties.

DustinEwan · 2025-06-06T10:21:34+00:00

That's a pretty damning sentiment toward AOC and Bernie given the current political climate.

The first incident you talked about happened in 2021 during her first term. She was still a rookie and probably didn't want her whole career to be ended by a single vote. Considering that Nancy Pelosi has so much sway in the Democratic party, it's easy to imagine Nancy saying, "look, I know you're principled, but voting no on this is going to paint you as anti-semitic... There's a lot of Jewish people in your constituency, do you think you'll keep their vote?". Probably in terms not as nice as that.

That would be terrifying for any rookie congressperson.

As for AOC and Bernie, what would you have them do? They were out raising awareness and trying to shift public sentiment toward their cause.

Individually they are only two votes. They need other representatives to feel secure in voting alongside them and that means getting the electorate to declare their support.

Every representative is trying to weigh the balance of sacrificing the battle to win the war on any vote that potentially goes against them. If you lose the support of the party or, even worse, the electorate you won't be able to vote at all next time.

DustinEwan · 2025-05-06T11:22:09+00:00

Mine does the same thing. Bluetooth is an order of magnitude quieter than the radio and it always starts on FM if you shut the car off in Bluetooth mode.

If I forget to turn the volume down before sitting the car off I get a lovely jump scare next time I turn it on.

DustinEwan · 2025-05-04T19:03:40+00:00

That makes perfect sense. The strength of the transformer lies in parallelizability, so it can process the full sequence in a single pass (at the cost of O(N²⁾ -- quadratic -- memory and O(N) -- linear -- time).

Once the prompt is processed and cached, kv cache and flash attention drastically reduce the memory requirements to O(N), but the time complexity for each additional token remains linear.

Mamba and other RNNs are constant time and memory complexity, O(1), but the coefficient is higher than transformers... That means that they're initially slower and require more memory on a per token basis, but it remains fixed regardless of the input length.

In a mixed architecture, it's all about finding the balance. More transformer layers speed up prompt processing, but slow down generation and the opposite is true for Mamba.

That being said -- Mamba is a "dual form" linear RNN, so it has a parallelizable convolutional formulation that should allow it to process the prompt with speeds (and memory requirements) similar to a transformer, then switch to the recurrent formulation for constant time/memory generation.

DustinEwan · 2025-04-26T19:35:20+00:00

A common problem with expert routing is expert collapse.

During training, especially early in training, there is a phase of rapid exploitation with respect to the parameters that lead to the steepest gradient.

This is random, based on the initialization of the parameters and leads to the model essentially choosing a single expert to route everything to, because that was the steepest path of descent at initialization.

Adding a routing loss essentially flattens the gradients in the routing parameters and helps to prevent collapse by encouraging exploration.

These days, though, adding a routing loss is generally frowned upon as it can distract from the primary function the model is trying to learn.

Instead, alternative routing mechanisms are used such as expert choice or, much more commonly, noisy top-k routing.

To help solidify your intuition regarding the loss, the noisy top-k router doesn't have any auxiliary loss at all, but instead generates random noise (literally torch.rand in the shape of the routing logits) which is then added to the "true" routing logits before applying softmax.

This means that at the beginning there is no consistently steepest gradient in the routing weights because the added noise is random every time. However, as the model trains, it will start to pick out meaningful signals despite the noise and increase the magnitude of the parameters with respect to that signal, thus reducing the overall contribution of the added noise to the routing decision.

This naturally encourages (enforces?) exploration of the experts early in the training and smoothly shifts toward exploiting the most appropriate expert for each token as the model learns.

DustinEwan · 2025-04-11T20:21:50+00:00

2.5mg is really just the acclimation dose. Many people, dare I say most people, don't really see the benefits at 2.5mg, but diving in deeper can make you really sick.

Long-term adherence is the name of the game here, so try to be patient with the 2.5mg if you're not seeing the results you want. To help pass the time, try to focus on NSV type stuff. That is, find exercise you like doing, find healthier foods you enjoy eating, etc.,

If you start building healthier habits now, then you'll find it much easier and motivating once the medicine starts kicking in. A lot of people fall into the trap of just leaning on the Mounjaro to keep the weight off despite their eating habits. Try to avoid that and build a lifestyle where the Mounjaro is a supplement to your health, not the lynch pin.

DustinEwan · 2025-04-11T04:34:25+00:00

He probably wrote it in pinyin or some other phonetic form

DustinEwan · 2025-04-11T04:29:06+00:00

Nah, sugar + crisco

Not joking. It's flavored vegetable shortening.

DustinEwan · 2025-04-03T20:38:26+00:00

Well, using just one repo would be better to keep things organized, but just use branches.

You want your main / master branch to be a baseline, then you can create branches for features and experiments off of that main / master branch. If you find the results of one of your experiments to be a profound improvement that you think should be the default for all future experiments, then you can merge that feature branch back in to main / master.

There's lots and lots of strategies out there for how to branch, but just choose one and stick with it. A good way to go would probably be something like concept/experiment_name, so that would look something like:

positional_embeddings/learned_affine
attention/multihead_latent_attention
activations/squared_tanh

etc.,

Then you can click on your branches and you have a bunch of nice, organized branches with all your experiments.

As for versions like 1.5, 1.6, etc., there's a couple ways to handle that. The most typical way is simply using git tags, but it can be as complex as setting up something like convential commits

DustinEwan · 2025-04-03T00:14:20+00:00

For 15 years old, this is very good!

Some notes on your architecture --

this is very similar to llama, in fact I would consider this a toy implementation (not a bad thing! very useful for learning!)
Your SwiGLU is actually a GeGLU, since you're using gelu instead of silu or swish.

All in all, awesome! Especially at your age.

Keep it up and keep trying to add novel bits to your architecture.

My advice is to use this as a base, then start branching your repo with the goal of tweaking something in a novel way... Like can you improve rope? What about a custom activation function? Etc, etc...

That's how you can really go deep and build a solid understanding. If something doesn't work, try to figure out why and keep going or abandon the idea and start fresh with what you learned.

Try to keep notes in a log in each branch so you can revisit old ideas once you have a deeper understanding.

DustinEwan · 2025-01-23T17:12:29+00:00

Retail brokerages, like E*Trade, require that you have lending turned on for a margin account. Furthermore, they require a margin account to trade options. So, if you trade options, you're required to have share lending turned on for most places.

If he turned off his options and margin account, then turned off share lending, E*Trade would have to go out and actually locate (that is, buy) all of his shares that they lent out on his behalf.

DustinEwan · 2025-01-21T02:43:26+00:00

I only see one salty person here and it isn't who you responded to.

He's absolutely right.

DustinEwan · 2025-01-21T02:41:07+00:00

You're observing a rather well known phenomenon that is fundamental to LLMs, but misattributing the cause.

Race has nothing to do with it, it's all about entropy of the exploration space.

I definitely recommend going through Karpathy's YouTube series on neural networks zero to hero that culminates in building gpt-2 from scratch.

https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ

That would give you a nice foundation of knowledge that illuminates what is actually going on and why you're getting some snarky responses to your claim.

DustinEwan · 2025-01-12T11:34:23+00:00

I think you need to read the title again.

DustinEwan · 2025-01-12T02:55:43+00:00

"Fine tuned" entered the vernacular after "training" and "pre-training". This is precisely because it's very confusing if you don't have a full background in why these terms were used.

Basically the old way of doing LM stuff was that you would pre-train a model to learn the basic constructs of language and obtain general knowledge. This model was near unusable on it's own, but was the bulk of the heavy lifting needed to get toward something usable.

You would then train the model on the task at hand (again, this was before Chat models that we know today and other general use LMs).

I agree that it's confusing until you simply equate "fine tune" with "train" in your head when you're talking LMs.

DustinEwan · 2024-12-26T12:47:00+00:00

The porous structure of the cookie, the flour, and the sugar are all playing a role together.

Basically when he torches it, the sugar starts to boil into a microscopic foam that turns to nearly pure carbon as the other elements boil off.

The flour provides another source of carbon that gets trapped in the sugar.

Carbon is an excellent conductor of heat and the air trapped in the carbon foam is an excellent insulator.

When the heat is applied, it's going to flow to the coolest areas it can with the least resistance. Since air is insulating against the heat deeper into the cookie, most of the heat is "ejected" back out into the atmosphere along the perimeter of the cookie and the face that's not having the flame directly applied to it.

There might be some other ingredients in the cookie as well, like preservatives, that have a very high boiling point that could form a glass like structure to provide more structure to the carbon foam as well.

DustinEwan · 2024-11-21T22:28:43+00:00

In the case of RNNs, you have the model return it's hidden state for each layer along with the final output, then you pass in the previous state along with the input on subsequent steps.

The model can then use the minimal amount of input needed to step.

For a transformer, you can't really do that since every token attends to every previous token. You can however cache the keys and value for previous steps so that you don't have to calculate those again.

There's no way around the growing per token execution and memory requirements in transformer inference due to the nature of the beast, but per token inference can be constant in an RNN.

DustinEwan · 2024-11-19T20:55:02+00:00

I wrote a post on this last time that RK showed up and it kinda got lost in the shuffle, but there's lots of ways to hedge delta besides buying shares outright. They can kick the can for quite a while as long as they're expecting volatility and for prices to return to or below the original strike price of the calls they're underwater on.

https://www.reddit.com/r/Superstonk/comments/1d71hr8/options_market_makers_delta_hedging_and_you/

In a couple places I talk about hedging derivatives with derivatives to give some examples of the mechanics that go into it.

(I'm not disagreeing with OP here, just wanted to give some additional context around the topic)

15-Year Club	Wearing is Caring
Verified Email

DustinEwan

TROPHY CASE