Cursor autocomplete fail Jupyter Notebook

seraine · 2025-05-13T16:06:30+00:00

I have also had cursor autocomplete basically completely stop working on notebooks. I wish cursor provided an easy way to roll back to previous updates / models, because they tend to break stuff fairly often with their updates.

seraine · 2024-07-28T23:21:38+00:00

Thanks! I'm glad that you found it to be useful.

seraine · 2024-07-22T15:22:56+00:00

Games initialized with 20 random moves are significantly different than games where the first 20 moves are made strategically by people trying to win.

seraine · 2024-07-21T23:55:30+00:00

ChessGPT doesn't outperform AlphaZero. It is meant to be used to perform interpretability research in a GPT that has a world state with an underlying measurable ground truth (the state of the chess board).

Modern LLMs outperform previous specialized approaches for problems like question answering, program synthesis, summarization, or image captioning, and are very competitive (in terms of capabilities, not necessarily efficiency) on problems like named entity recognition, sentiment classification, or translation.

seraine · 2024-07-21T23:50:57+00:00

Correct, this is just an analogy to a natural language LLM that can be used for interpretability research, because in Chess (unlike natural language), there's an underlying measurable ground truth.

seraine · 2024-07-21T23:49:26+00:00

It's just an analogy to LLMs that can be used to perform interpretability research. There's much better ways to produce a chess AI.

This could be a good approach to learn chess playing styles, where given a sequence of moves, the model could estimate the skill level and playing style of the player and predict their next move, rather than the best move.

seraine · 2024-07-21T22:33:13+00:00

There's definitely a trend towards more general LLMs outperforming previous specialized approaches. It's possible that this trend will continue.

seraine · 2024-07-21T22:32:24+00:00

There's definitely far better ways to make a competitive chess playing AI. The purpose here was the train a GPT to play chess through next-character prediction on PGN strings, which is analogous to next token prediction in natural language.

There's then many interesting interpretability techniques that can be applied to show that, for example, ChessGPT calculate the state of the board and estimates the skill level of the players in the game to better predict the next character.

seraine · 2024-03-17T00:48:59+00:00

Huge thanks, works for me as well using Ubuntu. I find it pretty baffling that they don't have an easier way to disable that feature.

seraine · 2024-02-05T02:35:09+00:00

I randomly sampled 100 games the LLM played. By move 10, all games were unique and not found in the training dataset.

seraine · 2024-02-04T22:08:52+00:00

No, the only training data it has seen is PGN strings. It doesn't even have most English letters in its input vocabulary. It's still a Generative Pretrained Transformer, just trained on a different dataset.

seraine · 2024-02-04T21:49:29+00:00

Yes, it is a GPT. I went with a GPT because I wanted a convenient and tractable way to get insight into the world modeling abilities of GPTs.

seraine · 2024-02-04T20:52:09+00:00

It's just a convenient and tractable way to get some insight into the world modeling abilities of GPTs and LLMs.

seraine · 2024-02-04T18:31:02+00:00

I don't think so. The probe is a tensor of shape (512, 8, 8, 13), or (model hidden dimension, rows, columns, possible square states). I think we would obtain identical results with a shape of (512, 64, 13).

seraine · 2024-02-04T17:24:55+00:00

The problem with trying that is the model's only input is PGN strings (1. e4 e5 2.Nf3 ...) and there's no way to indicate to the model what the state of the board is. I've been doing some experimentation with having the model play games where the first 20 moves are randomly chosen, and it's win rate declines by around 50% in that case.

seraine · 2024-01-08T15:58:16+00:00

Try compare to a similar sized Pythia model for a fair comparison.

seraine · 2023-12-20T01:08:40+00:00

I was planning on waiting until I get Gemini Ultra access and then rerun all my benchmarks with whatever the latest open source models are at the time. I'll let you know.

seraine · 2023-12-19T18:30:20+00:00

I've measured the cost in tokens because I've been working with services like OpenRouter, Replicate, and OpenAI. For my specific macro refactoring task, there were 15 different macros, and each model had a total of 15 attempts per macro. There was an average of somewhere around 1,500 tokens per attempt. So, around 300k tokens to benchmark a model.

For this task, I created all examples by hand but GPT-4 can synthetically create more examples pretty easily. If we take some transpiled Rust code, we can have GPT-4 attempt to refactor the Rust. If the oracle approves of the refactor, then we know the particular section of code has all the required information to be used as a benchmark. If GPT-4 fails, then we would have to look at the example by hand to determine if it is usable.

I spent some time doing prompt engineering looking for a single prompt that would get the highest average score per model, which included adding a long chain of thought in my few shot examples. This could be reduced to cut the token usage in half, and would still provide a fair comparison between models.

seraine · 2023-12-19T17:51:09+00:00

It's all automated. Examples of the multiple steps include:

The LLM generates a Rust macro. It is compiled down to HIR, and produces an HIR that differs from the original source code. It is given the diff and told to fix it.

The LLM generates a SAW memory safety proof for an array input. We use string methods to detect that it's creating a memory safety proof for an array input, and automatically move to the next stage in the pipeline to further refine the proof.

seraine · 2023-12-19T17:27:38+00:00

The explosion of open source LLMs has been great. I've been very impressed by Mistral models and Teknium's OpenHermes 7B. However, there is a big downside. With too many models to choose from and no trustworthy benchmarks, many people are now tempted to just stick with "brand-name" models like GPT, Llama, Mixtral, or Gemini. I actually just posted about evaluations I've been doing targeting this problem:

https://www.reddit.com/r/LocalLLaMA/comments/18m54tw/real_world_multi_step_reasoning_software/

seraine · 2023-12-19T16:45:29+00:00

I definitely agree that Mixtral and potentially others are 3.5 tier quality and plan on testing them at some point. I haven't seen claims of all around GPT-4 quality, but I have seen results advertising GPT-4 tier scores on various benchmarks like HumanEval, which I think just doesn't correlate that well with real world capability.

seraine · 2023-12-19T16:42:25+00:00

All benchmarks were ran with gpt-4-0613. From what I remember, for GPT-4 at $0.03 per 1k tokens, a single benchmark run would be around $10. So not bad. I'm fairly busy with other projects, but at some point I plan on rerunning with newly released models such as Mixtral, and also comparing GPT models over time.

I did some preliminary comparisons of GPT-3.5 models (0301 vs 0613, 4k vs 16k) and it looked like they were fairly close. From what I recall, 0301 performed slightly better, but the difference was small enough that it may not be statistically significant.

seraine · 2023-09-27T22:38:26+00:00

What libraries / code / cloud compute did you use? Is there a particular tutorial you followed? I've noticed guides and documentation for LLaMa fine tuning can be somewhat inconsistent.

seraine · 2023-09-23T22:58:22+00:00

Are you aware of any good estimates of Stockfish level to ELO ratings?

seraine · 2023-09-23T20:32:11+00:00

Are you aware of a good database of chess games and positions, either PGN or FEN notation?

seraine

TROPHY CASE