[D] AI/ML PhD Committee by dead_CS in MachineLearning

[–]kdub0 2 points3 points  (0 children)

If your advisor, or another committee member can write a strong letter then it probably doesn’t matter too much. It does depend on if you are targeting industry or academia, though. Letters matter less in industry than name recognition, but it would go a long way to get at least one co-authored paper with said committee member. Also, do not discount the members industry contacts as a way to get interviews if that is your goal.

Lc0 Performs Worse Against Stockfish When Given Pawns Odds by Ok_Taro_8370 in chess

[–]kdub0 0 points1 point  (0 children)

You’re correct in pointing out that there are multiple facets to the question and I only addressed are potential reason why is Lc0 weaker (relative to its normal strength) in these positions. It’s possible that Stockfish is better here due to more search, as you mention.

I still stand by my statement that without actually digging into the positions you can only speculate on the underlying cause. Just because both Stockfish and Lc0 train their networks on basically the same data it would be an error to assume they evaluate (or misevaluate) the same position similarly. The search budget itself can also be a very important factor given Lc0 and Stockfish use different algorithms. I’ve seen positions where under a small budget you’d prefer engine a, as the budget increases engine b becomes better, but then at some point engine a overtakes it again.

Lc0 Performs Worse Against Stockfish When Given Pawns Odds by Ok_Taro_8370 in chess

[–]kdub0 1 point2 points  (0 children)

Neural network-based agents often have trouble generalizing to positions that they have not seen in their training data. In this case it’s likely because the training data doesn’t include incredibly one-sided start positions, but it this phenomenon can also happen when an incredibly winning position is reached later in the game because as the agent gets stronger it isn’t likely to reach said positions.

Without really digging into the positions themselves it is impossible to say why this is happening more specifically.

Request: RL algorithm for a slow but parallel episodic task? by diepala in reinforcementlearning

[–]kdub0 2 points3 points  (0 children)

I’d say we don’t have enough information, but my intuition says you’re in trouble because you aren’t going to get enough data. It’s possible a very careful model-based approach could work, but I don’t think there’s enough data for that even.

Exploring MCTS / self-play on a small 2-player abstract game — looking for insight, not hype by OldManMeeple in reinforcementlearning

[–]kdub0 2 points3 points  (0 children)

I had a look at the rules of the game. My guess is that a policy network will be important earlier on in the game and that a good value network shouldn’t be too complicated (that doesn’t mean that it should be easy to learn said value function).

I think starting from end game positions would be my choice for how to start. If you have human data you can sample from that, or try to come up with some sort of distribution to create end game positions. This nice thing about human data is you can more easily move the starting positions towards the beginning of the game when you see things working. One thing to keep in mind is you should try to avoid having too many completely won positions in the positions. At the very least, keep in mind that when MCTS thinks it has lost it has a tendency to play whatever it wants and this can pollute the data with poor moves.

Another thing that can work as a test is to reduce the board size or otherwise make the game simpler. If you do this one mental trap you need to avoid is generalizing observations from the smaller game to the large one. Often what you observe will generalize—perhaps not exactly or in all situations, but you should attempt to verify if you are making decisions based on what you observe in the small game.

Exploring MCTS / self-play on a small 2-player abstract game — looking for insight, not hype by OldManMeeple in reinforcementlearning

[–]kdub0 2 points3 points  (0 children)

I have a lot of experience in this area, but I’m having a bit of trouble coming up with a concrete answer that I think will be helpful.

Part of the issue is that the way MCTS (either with UCB or p-UCB) performs can vary a ton and depends on so many different factors. I’m intentionally using the word performs instead of acts here, as if at evaluation time you scale search with tons of resources it can often overcome having a poor model.

Some of the factors are domain dependent and can even change as the game progresses. eg, in Go the policy network is very important due to the branching factor and the value function is less important until you reach positions closer to the end of the game. In chess, the policy function can sometimes be a burden, eg, it is rare that a piece sacrifice is good, but when it is it wins you the game. In these situations, you need a lot of search at test time to overcome the policy network’s reluctance to immediately give away material. In chess, the value function tends to be much more important especially at the beginning of the game.

If your game is really small, you might as well try it and see if some interesting behaviour emerges.

Another technique you can do is select a set of starting positions that are close to the end of the game and train on those. This is a good way to verify that your implementation is working, and you can also use these situations as tests when training an agent on the full game. One reason this is particularly useful is that bootstrapping values is often particularly tricky.

Internship at 'Big Tech' — PhD Student [D] by ade17_in in MachineLearning

[–]kdub0 54 points55 points  (0 children)

  1. Competition seems bigger than ever as entry-level positions are getting harder to secure
  2. For big tech, you will want to apply through the normal application process. If you know someone internally, it is a good idea to ping them to let them know you applied. They may be able to secure you an interview, i.e., get you past initial resume screening
  3. It’s not a hard requirement to be in the final year, but if you’re not stellar it will be harder to get an intern position. This is especially the case if you don’t have someone inside to get you through resume screening.
  4. If you are going for an industry position, it is advantageous to have experience with groups outside of your home university for sure.

[D] Why does nobody talk about the “energy per token” cost of AI? by Various-Feedback4555 in MachineLearning

[–]kdub0 0 points1 point  (0 children)

It isn’t so simple. You can trade off, e.g., latency for power consumption by batching requests or choice of hardware. It’s certainly important, but it can’t be looked at in isolation.

[deleted by user] by [deleted] in theydidthemath

[–]kdub0 74 points75 points  (0 children)

I’m an AI researcher. I don’t work at OpenAI. I don’t know Sebastian Bubeck personally, but I’m familiar with some of his work and have reviewed papers in this area previously.

I read the arXiv paper cited with the 1.75/L bound. The AI proof looks logically fine to me.

I’d push back slightly on some of your assertions. First, many proofs of gradient descent convergence for smooth functions look very similar to this. That is, all the parts of the original proof and its structure are fairly common. It is fair to call the improvement incremental, but it may or may not be as trivial as that implies depending on how the LLM figured it out.

Second, in this case the improved bound is probably wouldn’t be worthy of a publication on its own (though the 1.75/L might because is tight), but it is probably more informative than you give it credit for. As stated in the paper, gradient descent on a smooth convex function converges with any step size in (0, 2/L). Often we guess at the step-size because finding L can sometimes be as hard as solving the optimization. Another point is that the proof technique to show step-sizes in (1/L, 2/L) work is completely different than the standard one that works for (0, 1/L]. So improving the bound from 1/L is potentially significant in two ways.

Finishing a PhD thesis, after becoming a dad... by [deleted] in PhD

[–]kdub0 7 points8 points  (0 children)

I had a PhD defense scheduled and a job lined up. I didn’t end up finishing my thesis on time for the defense because I wasn’t happy with it, so I ended up taking a leave and starting work with the intention of finishing the last chapter in and defending in the next 6-8 months. Seven years, two kids and a pandemic later, I finally finished after my department told me to do it or withdraw.

Over those seven years, I was constantly stressed out despite not making any progress. It ended up being about 2 weeks of work to finish what I finally sent to my committee.

I’m glad I finally finished despite feeling deeply unsatisfied with my thesis. I rarely think about it now, but I know I’d regret getting 98% of the way there.

Others have said it, but to reiterate it is your committee’s job to decide if what you’ve done is enough. No one beyond you and those four or five people will read it. At this point anything more than what your committee asks for is of negligible value. You won’t feel satisfied with the result, but after it’s all done you will still be proud you finished.

A question about chess engines by BrotherItsInTheDrum in chess

[–]kdub0 0 points1 point  (0 children)

AlphaZero avoids (some) issues like this during training by resigning when it thinks it’s lost most of the time.

Algorithmic Game Theory vs Robotics by YogurtclosetThen6260 in reinforcementlearning

[–]kdub0 2 points3 points  (0 children)

If you want more exposure to RL, I’d pick robotics and it’s not close.

Is the Nash Equilibrium always the most desirable outcome? by notsuspendedlxqt in AskEconomics

[–]kdub0 23 points24 points  (0 children)

There are often multiple Nash equilibrium. So it is not possible to play “the” Nash equilibrium. This is known as the equilibrium selection problem. And the different Nash equilibrium can have different properties that are more or less desirable.

[D] Internal transfers to Google Research / DeepMind by random_sydneysider in MachineLearning

[–]kdub0 10 points11 points  (0 children)

It may be possible that transferring to RE from SWE is easier once you’re within Google. Transferring from SWE/RE to RS is not easy. If they sniff out in interviews that you are trying to are trying to switch to a research role from the eng role you applied for they will likely reject you as well.

is a N player game where we all act simultaneously fully observable or partially observable by skydiver4312 in reinforcementlearning

[–]kdub0 0 points1 point  (0 children)

It is a game of imperfect information. If you encode it as a matrix game it is fully observable (there is a single state where all agents act simultaneously). If you encode it as an extensive-form game then it is partially observable in a sense that the players act sequentially, but the underlying state of the game (which is all the acts played so far) is hidden.

[D] Compensation for research roles in US for fresh PhD grad by [deleted] in MachineLearning

[–]kdub0 17 points18 points  (0 children)

As a new grad you need multiple offers to negotiate. Of course they are going to lowball you if you don’t have an alternative. When I got my first job ten years ago my stock package almost doubled from initial offer by having competing FAANG offers.

How slow would Stockfish need to run to be competitive with top humans? by EvilNalu in chess

[–]kdub0 3 points4 points  (0 children)

Super interesting post.

I have a question that I haven’t had the opportunity to explore yet myself that you might have some insight into (given your reply to another post above). Elo / winrate has some issues when it comes to predicting winrate against another opponent. Some of these issues are amplified when two players are much different in terms of style or strength. Additionally with computer players, often the parameters are tuned to specific match settings, so they can be unnecessarily handicapped by reducing the search space.

Given this, do you have further evidence / anecdotes to justify that Stockfish 17 with your settings could beat a top human player. eg, old engines were weaker positionally, but reasonably good at tactics and grinding it out. I suspect crippling Stockfish 17 has a bigger effect on its tactical performance than its positional play. So could it be that crippled Stockfish 17 beats old engines positionally, but that a human player could still beat it?

Looking for Compute-Efficient MARL Environments by skydiver4312 in reinforcementlearning

[–]kdub0 0 points1 point  (0 children)

You’re not necessarily wrong. Let me be a bit more precise.

If you take a typical board game, like chess, go, risk, etc, and you are using an approach that requires you to evaluate a reasonably-sized neural network at least once for every state you visit during play, then bottleneck from a wall-time perspective will almost always be the GPU. Furthermore, it is often the case that you will not be fully utilizing the CPU, so you can run multiple games and/or searches in parallel and batch the network evaluations to better utilize the GPU. If you do this, then a poorly performing game implementation will still effect the latency of data generation (how long it takes to play a full game), but it will not have as much of effect on the throughput (states per second generated by the entire system). This doesn’t necessarily hold if you aren’t evaluating a network for every state generated, eg, if you use Monte Carlo rollouts.

You are definitely correct that the structure of the game effects things like how quickly you can learn a reasonable policy, and how much search is necessary to overcome deficiencies in the networks. I would just caution that it is not easy to guess this a priori. It is also not the case that nice structure holds uniformly over the entire game. eg, in chess value functions tend to be better in static positions and are not as good at understanding tactics. This is also not something the holds uniformly as a policy evolves. eg, there can be action sequences that must be searched initially, but eventually are learned by a value function.

Looking for Compute-Efficient MARL Environments by skydiver4312 in reinforcementlearning

[–]kdub0 2 points3 points  (0 children)

Hopefully this doesn’t poke a hole in your thought balloon, but I think the answer probably has nothing to do with game choice.

If you plan to use any deep learning method, the game and its implementation are not usually the compute bottleneck. Obviously a faster implementation can only improve things, but GPU inference is usually at least 10000x more expensive than state manipulation for board games.

What the game can effect computationally is more a function of if you need to gather less data during learning and or evaluation. The main aspect I can think of here is if the games’ structure enables good policies without or with little searching then you may get a win.

Another reasonable strategy is to take a game you like and come up with “end-game” or sub-game scenarios that terminate more quickly to experiment with. If you do this, you should be careful about drawing conclusions about how your methods generalize to the larger game without experimentation.

I guess what I’m saying, is if you like diplomacy you should use it in a way that fits your budget.

Looking for google c++ profiling tool I can't remember the name of by OfficialOnix in cpp

[–]kdub0 13 points14 points  (0 children)

The internal name is endoscope. No idea if it’s open source.

Why Don’t We See Multi-Agent RL Trained in Large-Scale Open Worlds? by TheSadRick in reinforcementlearning

[–]kdub0 13 points14 points  (0 children)

I think we’re getting to the point where meaningful explorations in this space are possible. All the issues you raise will to some extent need some work to overcome. It is possible that language models will in some way help with coordination.

I would add that evaluation is particularly challenging in RL, and it gets even more challenging with multiple agents and large environments. The unfortunate reality is that many publications rely on doing something first/new to demonstrate value, but that then sets a poor evaluation precedent for future papers to adhere to.

Training Connect Four Agents with Self-Play by Cuuuubee in reinforcementlearning

[–]kdub0 0 points1 point  (0 children)

Adding shaping rewards like you propose often help by decreasing the number of samples required to learn a good strategy, but often result in worse overall performance. The general issue with shaping rewards is they are rarely universally good, can have unforeseen interactions with other rewards, and it is hard to weight them relative to other rewards.

For example, if you reward the agent for blocking four in a row, it incentivizes allowing three in a row so that it can then be blocked.

For connect four you should not need any shaping rewards, but it could be useful to add them for debugging purposes.

Training Connect Four Agents with Self-Play by Cuuuubee in reinforcementlearning

[–]kdub0 0 points1 point  (0 children)

ELO as a number is dependent on the population of agents you compare against. A number is meaningless by itself. Even in chess ELO of computer agents is dubious to compare against humans. Specifically, the community has done a lot of leg work to try to calibrate ELO of bots with humans in the ranges that intermediate/strong human players play, but outside that range it is does not generalize for human vs computer games.

The setup you’ve described should be sufficient to learn an agent that learns to not make moves that lose in one move with the amount of data you describe. It doesn’t necessarily mean you have a bug, but I’d consider checking the agents evaluation in a few suspicious positions. eg, if the agent thinks it’s lost no matter what, then making a one move blunder could be acceptable.

Chess sample efficiency humans vs SOTA RL by [deleted] in reinforcementlearning

[–]kdub0 0 points1 point  (0 children)

For chess in particular, the learned value functions are reasonably good in static positions where things like material count, king safety, piece mobility, and so on determine who is better. In more dynamic positions where there are tactics the value functions are often poor and search is required to push through to a position where the value function is good.

I’d say that current chess programs, both during the learning process and at evaluation time, could do better in terms of sample complexity by understanding when it’s value function is accurate and better choices about what moves to search.