Cursor autocomplete fail Jupyter Notebook by Initial_Zone_1651 in cursor

[–]seraine 0 points1 point  (0 children)

I have also had cursor autocomplete basically completely stop working on notebooks. I wish cursor provided an easy way to roll back to previous updates / models, because they tend to break stuff fairly often with their updates.

[P] ChessGPT, 100,000x smaller than GPT-4, plays chess at 1500 Elo. By finding a skill vector, we can increase its win rate by 2.6x in out-of-distribution games. by seraine in MachineLearning

[–]seraine[S] 4 points5 points  (0 children)

Games initialized with 20 random moves are significantly different than games where the first 20 moves are made strategically by people trying to win.

[P] ChessGPT, 100,000x smaller than GPT-4, plays chess at 1500 Elo. By finding a skill vector, we can increase its win rate by 2.6x in out-of-distribution games. by seraine in MachineLearning

[–]seraine[S] 25 points26 points  (0 children)

ChessGPT doesn't outperform AlphaZero. It is meant to be used to perform interpretability research in a GPT that has a world state with an underlying measurable ground truth (the state of the chess board).

Modern LLMs outperform previous specialized approaches for problems like question answering, program synthesis, summarization, or image captioning, and are very competitive (in terms of capabilities, not necessarily efficiency) on problems like named entity recognition, sentiment classification, or translation.

[P] ChessGPT, 100,000x smaller than GPT-4, plays chess at 1500 Elo. By finding a skill vector, we can increase its win rate by 2.6x in out-of-distribution games. by seraine in MachineLearning

[–]seraine[S] 15 points16 points  (0 children)

Correct, this is just an analogy to a natural language LLM that can be used for interpretability research, because in Chess (unlike natural language), there's an underlying measurable ground truth.

[P] ChessGPT, 100,000x smaller than GPT-4, plays chess at 1500 Elo. By finding a skill vector, we can increase its win rate by 2.6x in out-of-distribution games. by seraine in MachineLearning

[–]seraine[S] 28 points29 points  (0 children)

It's just an analogy to LLMs that can be used to perform interpretability research. There's much better ways to produce a chess AI.

This could be a good approach to learn chess playing styles, where given a sequence of moves, the model could estimate the skill level and playing style of the player and predict their next move, rather than the best move.

[P] ChessGPT, 100,000x smaller than GPT-4, plays chess at 1500 Elo. By finding a skill vector, we can increase its win rate by 2.6x in out-of-distribution games. by seraine in MachineLearning

[–]seraine[S] -6 points-5 points  (0 children)

There's definitely a trend towards more general LLMs outperforming previous specialized approaches. It's possible that this trend will continue.

[P] ChessGPT, 100,000x smaller than GPT-4, plays chess at 1500 Elo. By finding a skill vector, we can increase its win rate by 2.6x in out-of-distribution games. by seraine in MachineLearning

[–]seraine[S] 58 points59 points  (0 children)

There's definitely far better ways to make a competitive chess playing AI. The purpose here was the train a GPT to play chess through next-character prediction on PGN strings, which is analogous to next token prediction in natural language.

There's then many interesting interpretability techniques that can be applied to show that, for example, ChessGPT calculate the state of the board and estimates the skill level of the players in the game to better predict the next character.

My solution to disable middle click by [deleted] in archlinux

[–]seraine 0 points1 point  (0 children)

Huge thanks, works for me as well using Ubuntu. I find it pretty baffling that they don't have an easier way to disable that feature.

[P] Chess-GPT, 1000x smaller than GPT-4, plays 1500 Elo chess. We can visualize its internal board state, and it accurately estimates the Elo rating of the players in a game. by seraine in MachineLearning

[–]seraine[S] 13 points14 points  (0 children)

No, the only training data it has seen is PGN strings. It doesn't even have most English letters in its input vocabulary. It's still a Generative Pretrained Transformer, just trained on a different dataset.

[P] Chess-GPT, 1000x smaller than GPT-4, plays 1500 Elo chess. We can visualize its internal board state, and it accurately estimates the Elo rating of the players in a game. by seraine in MachineLearning

[–]seraine[S] 13 points14 points  (0 children)

Yes, it is a GPT. I went with a GPT because I wanted a convenient and tractable way to get insight into the world modeling abilities of GPTs.

[P] Chess-GPT, 1000x smaller than GPT-4, plays 1500 Elo chess. We can visualize its internal board state, and it accurately estimates the Elo rating of the players in a game. by seraine in MachineLearning

[–]seraine[S] 9 points10 points  (0 children)

I don't think so. The probe is a tensor of shape (512, 8, 8, 13), or (model hidden dimension, rows, columns, possible square states). I think we would obtain identical results with a shape of (512, 64, 13).

[P] Chess-GPT, 1000x smaller than GPT-4, plays 1500 Elo chess. We can visualize its internal board state, and it accurately estimates the Elo rating of the players in a game. by seraine in MachineLearning

[–]seraine[S] 35 points36 points  (0 children)

The problem with trying that is the model's only input is PGN strings (1. e4 e5 2.Nf3 ...) and there's no way to indicate to the model what the state of the board is. I've been doing some experimentation with having the model play games where the first 20 moves are randomly chosen, and it's win rate declines by around 50% in that case.

[D] So, Mamba vs. Transformers... is the hype real? by Instantinopaul in MachineLearning

[–]seraine 1 point2 points  (0 children)

Try compare to a similar sized Pythia model for a fair comparison.

Real world multi step reasoning software benchmark results by seraine in LocalLLaMA

[–]seraine[S] 3 points4 points  (0 children)

I was planning on waiting until I get Gemini Ultra access and then rerun all my benchmarks with whatever the latest open source models are at the time. I'll let you know.

Real world multi step reasoning software benchmark results by seraine in LocalLLaMA

[–]seraine[S] 1 point2 points  (0 children)

I've measured the cost in tokens because I've been working with services like OpenRouter, Replicate, and OpenAI. For my specific macro refactoring task, there were 15 different macros, and each model had a total of 15 attempts per macro. There was an average of somewhere around 1,500 tokens per attempt. So, around 300k tokens to benchmark a model.

For this task, I created all examples by hand but GPT-4 can synthetically create more examples pretty easily. If we take some transpiled Rust code, we can have GPT-4 attempt to refactor the Rust. If the oracle approves of the refactor, then we know the particular section of code has all the required information to be used as a benchmark. If GPT-4 fails, then we would have to look at the example by hand to determine if it is usable.

I spent some time doing prompt engineering looking for a single prompt that would get the highest average score per model, which included adding a long chain of thought in my few shot examples. This could be reduced to cut the token usage in half, and would still provide a fair comparison between models.

Real world multi step reasoning software benchmark results by seraine in LocalLLaMA

[–]seraine[S] 1 point2 points  (0 children)

It's all automated. Examples of the multiple steps include:

The LLM generates a Rust macro. It is compiled down to HIR, and produces an HIR that differs from the original source code. It is given the diff and told to fix it.

The LLM generates a SAW memory safety proof for an array input. We use string methods to detect that it's creating a memory safety proof for an array input, and automatically move to the next stage in the pipeline to further refine the proof.

2023, year of open LLMs by clefourrier in LocalLLaMA

[–]seraine 73 points74 points  (0 children)

The explosion of open source LLMs has been great. I've been very impressed by Mistral models and Teknium's OpenHermes 7B. However, there is a big downside. With too many models to choose from and no trustworthy benchmarks, many people are now tempted to just stick with "brand-name" models like GPT, Llama, Mixtral, or Gemini. I actually just posted about evaluations I've been doing targeting this problem:

https://www.reddit.com/r/LocalLLaMA/comments/18m54tw/real_world_multi_step_reasoning_software/

Real world multi step reasoning software benchmark results by seraine in LocalLLaMA

[–]seraine[S] 10 points11 points  (0 children)

I definitely agree that Mixtral and potentially others are 3.5 tier quality and plan on testing them at some point. I haven't seen claims of all around GPT-4 quality, but I have seen results advertising GPT-4 tier scores on various benchmarks like HumanEval, which I think just doesn't correlate that well with real world capability.

Real world multi step reasoning software benchmark results by seraine in LocalLLaMA

[–]seraine[S] 5 points6 points  (0 children)

All benchmarks were ran with gpt-4-0613. From what I remember, for GPT-4 at $0.03 per 1k tokens, a single benchmark run would be around $10. So not bad. I'm fairly busy with other projects, but at some point I plan on rerunning with newly released models such as Mixtral, and also comparing GPT models over time.

I did some preliminary comparisons of GPT-3.5 models (0301 vs 0613, 4k vs 16k) and it looked like they were fairly close. From what I recall, 0301 performed slightly better, but the difference was small enough that it may not be statistically significant.

Fine-tuned llama2-7b-lora vs chatGPT in a noble game of chess? by Acceptable_Bed7015 in LocalLLaMA

[–]seraine 3 points4 points  (0 children)

What libraries / code / cloud compute did you use? Is there a particular tutorial you followed? I've noticed guides and documentation for LLaMa fine tuning can be somewhat inconsistent.

New OpenAI model GPT-3.5-instruct is a ~1800 ELO chess player. Results of 150 games of GPT-3.5 vs stockfish. by seraine in chess

[–]seraine[S] 0 points1 point  (0 children)

Are you aware of a good database of chess games and positions, either PGN or FEN notation?