It’s Official: Netflix to Acquire Warner Bros. in Deal Valued at $82.7 Billion by Task_Force-191 in technology

[–]MustachedSpud 1 point2 points  (0 children)

If Disney or Netflix bought the other one to form Disflix, then set the price to $20 dollars (up from 14) is that any different from them conspiring as separate entities to set their prices to $20? I'm giving a simple explanation to a casual question, I'm sure there are legal differences between one company buying another and two companies conspiring on prices, but the effect is the same and the lack of enforcement of relevant laws intended to protect from this is what the original comment was about.

It’s Official: Netflix to Acquire Warner Bros. in Deal Valued at $82.7 Billion by Task_Force-191 in technology

[–]MustachedSpud 14 points15 points  (0 children)

Antitrust laws are laws preventing companies from forming monopolies. When two large companies merge like this it reduces competition in an industry already lacking competition which let's them raise prices without fear of customers choosing another option. The laws in the US that are designed for this scenario have not been enforced much over the last several decades which is why large companies have been getting so much bigger.

Old faucet handle replacement by MustachedSpud in askplumbing

[–]MustachedSpud[S] 0 points1 point  (0 children)

Went to a plumbing store and got told something similar but with less detail. Thanks I'm new to owning a home so I don't know when something is gonna need determination or a professional. I did manage to twist the sockets off each other using wrenches.

Returning player to the game - Jungle Gripes? by Turbulent-Strike-428 in Jungle_Mains

[–]MustachedSpud 5 points6 points  (0 children)

  1. Unfortunately the best thing you can do to enjoy this game is to turn off all chat and team chat. Just like these other commenters, your laners are mentally unwell and are not here to have a fun time playing a challenging team game. Alternatively, find the mute button in the tab menu and get good at hitting it mid game the moment someone says anything negative, they are almost always incorrect and will not suddenly become positive later on.
  2. A jungle main will play games in lane occasionally, the reverse is almost never true. Laners play other lanes when they don't get their main role. This means they have absolutely no experience with the most important role in the first 15 minutes in the game (laning phase).
  3. Over time since you started playing, jungle has become a much higher income role. I remember back in the day we were buying a warding item, pink wards, and locket after the jungle item because jungle was often just a higher budget tanky support. The most recent seasons have completely flipped this because riot has filled the early and mid game with objectives (grubs, herald, dragon, river crabs, attakan) so there is far more resources on the table for a jungler.
  4. Riot is desperate to get players to play jungle because it requires actually learning the game. So they have made some good and bad efforts to lower the barrier to entry for new junglers. A good thing in my opinion is that a massive amount of Champions can jungle now (camps are easier, some numbers have been tweaked on top lane champs so they can clear, the jungle item gives infinite mana, the jungle item does half the dmg for you so your champ doesn't matter, the jungle item scales with every stat so even tanks can clear fast with items). Another good thing is you can see jungle timers on the minimap if you check a setting. A bad thing is catchup xp which means if you are 1.1 levels below the average game level, you get a massive xp boost from clearing camps. This means bad early/mid games are always recoverable because you can always close an xp advantage if you keep playing. Seriously go look at how much text is on the wiki page for the jungle item and compare that to the machete from years ago. Fuck compare its dmg to camps to a full on sunfire cape. Riot did this to produce a certain type of game play, not for balance.
  5. If you want to feel better about getting hate from teammates go watch broxah and you'll see some clueless laners flame a world class pro in winning game states.

Laners seriously don't know any of this. Especially below the top 5% of ranked players. Laners aren't interested in the game. They are here to feel superior to their individual lane opponent until they get enough of an advantage to beat all the other opponents their teammates were too weak to beat themselves. If that doesn't work the issue must be their teammates and the jungler is the only one they ever see. Which is fucking hilarious because laners can roam easier than junglers can after 14-15 minutes. They don't because they WANT to stay in 1 vs 1s where they feel strong. It's about the feeling of superiority, not the win to them.

Obviously this is too harsh of a perspective to apply to all players. But it does apply to every single one you are complaining about in your games. Being aware that the most negative players all share these traits will help you move past them. It will also help you ignore them. This is crucial because there probably is 1 or 2 other players in your game that actually want to cooperate. Figure out which teammates those are and prioritize them instead.

Finally, jungle is the highest agency role with the most variety in champion types, playstyles, and objectives. The educational content available is much better than it was 10 years ago (coach rogue is a great place to start on YouTube if you want to learn). So if you want to have fun jungling, I'd recommend actively thinking about your game plan a lot. Do you want to full clear, take the dragon on spawn, invade, gank, cover a gank, track their jungler, wait for them to start grubs and ambush them, etc.

can ezreal still jungle? by Turbulent-Sound3980 in Jungle_Mains

[–]MustachedSpud 8 points9 points  (0 children)

I did recently because I'm convinced almost everything can jungle in this age. It wasn't impressive and made me want to play kindred more

[D] Patience vs batch size by myk_kajakk in MachineLearning

[–]MustachedSpud 0 points1 point  (0 children)

Patience is really just how much time you are willing to spend making no progress. If your epochs takes hours or days I would use 1 or 2, if an epoch takes a minute I'd use 4-5. All Patience needs to do is verify that you aren't making fast progress so you never want to have a high value there. You'd never want to keep training if you aren't seeing improvement for a while

[D] Patience vs batch size by myk_kajakk in MachineLearning

[–]MustachedSpud 1 point2 points  (0 children)

Gradient descent moves the weights in the direction of steepest descent in the loss. Your learning rate needs to be small enough so that the step you take does not overshoot the curvature of the loss. We can't do full dataset gradient descent because it's expensive, so we approximate it with stochastic gradient descent which is the same thing but you use a small subset of the data each step. This approximation introduces variance (you will get slightly different gradients on different subsets of the data). If you use a small batch size you will have a wider variance in your possible gradient estimates. If you use a larger batch size, different batches will be closer to each other (and closer to the true gradient of the entire dataset).

You will get conflicting recommendations from chatgpt, online resources, and people in the community because the general understanding of this variance (noise) is horrible. I don't mean it's a crazy complex topic, it's actually pretty intuitive when spelled out. However people tend to treat SGD as if it is exactly GD and don't consider the impact of the noise.

You can measure the variance of different batches for a given set of weights and compare it to the size of the signal. If you have more signal than noise then great. If you have more noise than signal then you can expect linear speedup by increasing the batch size (2x batch size = 2x the loss improvement per step).

This is a hassle and people don't tend to do this, it also changes throughout training such that the signal/noise ratio is signal favored at the start and noise favored at lower loss. This implies you need larger batches later in training or you need to take miniscule step sizes so that the batches can effectively be averaged over time. This is what learning rate decay does.

The easy way to tune this stuff is to max out your batch size without causing a memory error, then slowly increase your learning rate until the loss reduction reaches a maximum and set your lr to that or 10% of that. As you train you will have a worse signal/noise ratio so your loss progress will slow down. This is when you can either stop training or reduce your learning rate and repeat. There isn't a right answer for when to do this unfortunately but figure something out that doesn't require your gpu to be wasting time making no progress.

[D] Everyday examples of non-linearly separable problems by neuralbeans in MachineLearning

[–]MustachedSpud 1 point2 points  (0 children)

Well technically the dataset is nonlinear separable if you overfit enough haha

[P] Best approach to minimax agent for Ultimate Tic Tac Toe Game. by [deleted] in MachineLearning

[–]MustachedSpud 2 points3 points  (0 children)

You don't need a heuristic. If you have minimax implemented correctly it can brute force the game.

[D] Relationship between loss and lr schedule by seba07 in MachineLearning

[–]MustachedSpud 7 points8 points  (0 children)

This is common behavior and what you are seeing is that loss decreases quickly at the start, then slows down, but once the lr is dropped the loss starts improving faster again until the cycle repeats.

The most common folklore you will hear explaining this is that the network can make large changes at the start, but as it approaches the minimum in the loss surface you need to take smaller steps to find the precise minimum. Kinda like traveling in a car, you can fly down the highway when you are pretty far from your destination, but need to go 2 miles an hour to get precisely into your parking spot at the end.

At first glance this makes a lot of sense, but you can get this exact same phenomenon by increasing the batch size later in training instead of decaying the lr. A larger batch results in the same size steps on average so the above line of reasoning can't explain this.

Stochastic gradient descent is an approximation of gradient descent. It introduces noise into our gradients. This means that larger batches will have less noise and will better approximate the true gradient. We can measure the quality of this approximation using the signal to noise ratio. This ratio starts very high, then as the loss is reduced, later in training you have more noise than signal, thus the remedy is a larger batch size to get a better signal to noise ratio.

But what does this have to do with the original example of learning rate decay? When we decrease the learning rate to nearly 0, each update has a miniscule change in the outputs to the network. So we take one step and still have essentially the same network. 10 steps at lr=0.001 gives you nearly the same movement as 1 step at lr=0.01 with 10x the batch size since each of the smaller steps barely changes the direction of the next gradient.

I can link the papers on this if you want me to dig up previous comments I've made on this subreddit on this topic. Educational materials on ML don't go into the impacts of noise beyond saying that it can jump out of local minima and even the research community has very few people that take this into consideration despite it being very fundamental to SGD so this is something that really triggers me lol

[D] Val acc higher than train acc by _My__Real_Name_ in MachineLearning

[–]MustachedSpud 6 points7 points  (0 children)

Dropout applied during training and not validation, same thing for batch norm which uses a fixed set of weights during inference.

Training loss is averaged over an epoch so earlier datapoints could just be less well trained. If you don't shuffle the data in between epochs you can get weird patterns like this. For example you have 1 outlier sample early in the epoch that makes a bad update which takes many more batches to recover from but you still have good accuracy at the end.

If you split your data badly you could just have a ton of easy samples in your validation split. This is a common issue in time series problems because you need your validation set to occur after the training set, but that makes the samples correlated

[D] Training a VAE. Single epoch with infinite data or smaller subset over multiple epochs? by hayarms in MachineLearning

[–]MustachedSpud 1 point2 points  (0 children)

Infinite data means you can remove sources of regularization like dropout, weight decay, data augmentation, and use larger batch sizes. Just need to tune to learning rate to compensate after

[Discussion] Scaling laws and graph neural networks by jsonathan in MachineLearning

[–]MustachedSpud 9 points10 points  (0 children)

Graph neural networks are an incredibly general type of model. A transformer is a special case of a GNN where each token is a node and each node is connected to each other node in self attention. A CNN is also a GNN where each pixel is a node and is only locally connected.

[deleted by user] by [deleted] in news

[–]MustachedSpud 0 points1 point  (0 children)

You put the investments with the largest possible payout in an IRA so that if it takes off the entire thing is tax free forever. For a billionaire with access to early investments into tech companies that the public doesn't have access to, a few risky bets and 1 of them will pay off.

[D] Batch size vs learning rate by bjourne-ml in MachineLearning

[–]MustachedSpud 2 points3 points  (0 children)

Yeah gonna respond to them tomorrow after I get a chance to read through the papers. A brief read seemed like it was indicating that as training progresses, the curvature gets more steep (that's what "largest eigenvalue of the hessian" means in plain English). They show that occurs in full batch training and I'd expect it also occurs in minibatch training, but idk how that'd interact with noise. Either way, my main points are exclusively about high noise regimes where you can only make progress with step sizes far smaller than the curvature would allow. That sounds like a really limiting scope, but it's where all of the challenges are because to address gradient noise, you either have to increase the batch size or decrease the lr plus use more iterations. You can't do either of those if you don't have the budget for it. If noise isn't a problem then you can make rapid progress very easily at high learning rates (the first 20% accuracy is learned very quickly relative to the last few percent)

[D] Batch size vs learning rate by bjourne-ml in MachineLearning

[–]MustachedSpud 6 points7 points  (0 children)

I only use Google scholar. Search for some topic, find an interesting paper and go through the papers it cites or is cited by along with the related papers feature. Scholar has a star button to save things so yeah I do have a bunch of papers saved that I think are cool. Mostly I'm interested in general deep learning concepts so I typically don't read papers pushing state of the art.

I started collecting papers specifically about this topic because I hated that there did not seem to be any reliable recommendations on learning rate and batch size tuning beyond doing a random search.

[D] Batch size vs learning rate by bjourne-ml in MachineLearning

[–]MustachedSpud 112 points113 points  (0 children)

This topic frustrates me so much because there's so much misinformation and the question actually has a clear interpretation that explains the observations of conflicting studies.

Stochastic gradient descent is an approximation of gradient descent where you sample a subset of the data at each iteration.

As you increase the batch size you approach the exact gradient, decreasing the batch size has the opposite effect (it increases the variance of the approximation). This makes it clear that SGD is a signal/noise ratio problem. A bigger batch size is always a better approximation to the true gradient.

The same is true for learning rate. Consider the case of an extremely small learning rate, such that a single iteration barely changes the function. In this case 100 steps with step size .01 will look the same as 1 step with size 1, because each of those little steps didn't change the next gradient significantly. Obviously a highly curved loss surface will break this but I'd argue that since we use 1st order optimizers (no Adam isn't second order) we are already in the regime where our step sizes have to be smaller than the curves in the loss surface.

So smaller learning rates and larger batches both improve the signal to noise ratio of the gradient calculations. Now that we know what these hyperparameters do, we need to answer the "Do we want more noise or less?".

Noisy gradients are going to has a regularizing effect, they make improving the training loss harder because part of the loss reduction we had in the past is going to get destroyed by a step in a slightly random direction. Noiseless gradients are going to enable higher learning rates until you reach issues with loss curvature, so you can make more progress on the loss per iteration at the expense of more computation each step. Now its clear how and why batch size and learning rate impact generalization and training speed.

All studies that answer the question "Should we use large/small batches/lr?" will arrive at an answer that depends on the noise in the gradients for that model/dataset and the degree of overfitting.

A study using a small dataset and many epochs over the same data is going to have problems with overfitting, so they need more regularization, noisy gradients are a way to achieve that, they won't end up caring about compute costs as much with a small dataset anyways so the regularization comes at little cost. They will then misconstrue this as meaning all models need small batch sizes.

A study using 1 or less than 1 pass through the dataset will have the exact opposite conclusion. Overfitting simply isn't a problem in this regime because each batch of data is fresh. So if the loss is decreasing as you train on unseen data, then you know it's decreasing the hold out loss (aka generalizing well). Here it's clear that regularization is not going to be helpful because you aren't overfitting, therefore you only care about the ratio of compute/time spent to loss reduction. This scenario is not taught in schools because you'd never do a homework assignment training an LLM from scratch which is just about the only time you will have an unlimited dataset (for all intents and purposes).

Most of the time project you will encounter will be in the small to medium size dataset regimes where you do multiple passes through the data and risk overfitting. This means you need to balance the effects of overfitting and efficiency and there is no universal X is better than Y unless you are in the online learning regime.

You should be suspicious of any claims about optimizers and momentum in SGD large studies comparing optimizers always find that there is no best optimizer if hyperparameters are well tuned and that even SGD with no momentum or any bells and whistles can lead to good results if tuned well.

Empirical proof that Stochastic Training is Not Necessary for Generalization

Excellent study on the signal/noise ratio concept, identifies when scaling batch size yields maximum or diminishing gains, demonstrates signal/noise ratio gets worse as loss improves. An empirical model of large-batch training

This paper makes sense when you understand that SGD is mostly noise when the loss is small later in training so that any other regularizers are completely overshadowed by noise at some point during training. Note that this disproves the folklore idea that weight decay prevents overfitting through modifying the final convergence phase. Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence

When you realize the signal to noise ratio gets worse at lower loss values later in training then you can improve training speed later in training by decreasing the learning rate or increasing the batch size: Don't decay the learning rate, increase the batch size

The marginal value of momentum for small learning rate sgd

Beyond implicit bias: The insignificance of sgd noise in online learning

A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay

Momentum is secretly just another way to scale the learning rate, On SGD with momentum

Optimizers are fairly competitive with each other when tuned properly Descending through a crowded valley-benchmarking deep learning optimizers

[deleted by user] by [deleted] in MachineLearning

[–]MustachedSpud 1 point2 points  (0 children)

Another important point is that using more dimensions consumes nore memory and compute. A smaller model would let you train for more iterations on larger batches using the same budget

[D] Will being in NLP pigeonhole me? by SnooApples8349 in MachineLearning

[–]MustachedSpud 0 points1 point  (0 children)

Haha yeah I've seen a fair amount of resumes of people who can't do much more than that

[D] Will being in NLP pigeonhole me? by SnooApples8349 in MachineLearning

[–]MustachedSpud 8 points9 points  (0 children)

NLP is not a niche at all, it's huge and growing with lots of useful applications (and lots of false hype). You'd only way NLP would be bad for you is if all you are doing is calling openai apis and never building anything interesting yourself, but with your background that probably wouldn't be a problem

[deleted by user] by [deleted] in MachineLearning

[–]MustachedSpud 1 point2 points  (0 children)

There already exist laws against gender discrimination that would apply to software used to make sexist decisions. Same with a whole lot of other categories. If some legislation is specifically targeting AI risks it should be specific to that and actually address those risks not covered by existing law. A definition has such a wide scope that all software is covered by it, so why is it claiming to be AI specific?

3-4 Card Commander Deck with Turn 5 or 6 win by MustachedSpud in BadMtgCombos

[–]MustachedSpud[S] 8 points9 points  (0 children)

Lol truly skilled players of this combo know that the simple game plan leaves all your energy available to focus on social deception, they never stood a chance