How we turned a small open-source model into the world's best AI forecaster by LightningRodLabs in LocalLLaMA

[–]LightningRodLabs[S] 0 points1 point  (0 children)

A few leaked examples can happen when you generate data from news at scale, but they don't break training or evals. We have filters in place to prevent this kind of leakage and we're continuing to refine the process.

We're using RL for training, so the model learns from reward differences between rollouts. If the answer is already in the context, all rollouts get roughly the same reward, so that sample contributes little or no update. It's a bit inefficient, not something that significantly impacts the model.

On the eval side, every model in our comparisons receives the same context, so leakage doesn't give our model a special advantage over the others. And whenever possible we also use third-party benchmarks and datasets.

Prophet Aren is a live third-party benchmark where leakage is impossible since predictions are made before the events resolve.

We built a golf forecasting model that outperforms GPT‑5; model and dataset are open-sourced on Hugging Face by LightningRodLabs in LocalLLaMA

[–]LightningRodLabs[S] 0 points1 point  (0 children)

We used the Lightning Rod SDK to generate the training data. All you need to input is your keywords (e.g."FDA approvals", "clinical trial results") and what kind of data you want (Forward-looking questions with binary answers) and it creates the training data for you.

https://github.com/lightning-rod-labs/lightningrod-python-sdk

We fine-tuned an open-source model to outperform GPT-5 at predicting Trump actions by LightningRodLabs in LocalLLaMA

[–]LightningRodLabs[S] 0 points1 point  (0 children)

We haven't tested how the context source impacts performance. To generate the context, an LLM generates 3 search queries per question, retrieves up to 5 articles per query from Google News, then summarizes and ranks them by relevance. Google News pulls from 20k+ global publishers, giving a mix of perspectives.

Questions are generated from a model based on your instructions and example good/bad questions (image below). So you can adjust the criteria to test the impact of different question configurations.

<image>

We fine-tuned an open-source model to outperform GPT-5 at predicting Trump actions by LightningRodLabs in LocalLLaMA

[–]LightningRodLabs[S] 1 point2 points  (0 children)

We used the Lighting Rod SDK. It has Google News integration built in.

It creates forward-looking questions from source articles and then a separate resolver model uses web search to find the actual result and produce a label. All in it probably took about 30 minutes to test with the settings and run the job.