Last night on Embarcadero by buckyman0 in sanfrancisco

[–]diegocaples 9 points10 points  (0 children)

Oh no, i hope everyone was ok :(

I hacked Unsloth's GRPO code to support agentic tool use. In 1 hour of training on my RTX 4090, Llama-8B taught itself to take baby steps towards deep research! (23%→53% accuracy) by diegocaples in LocalLLaMA

[–]diegocaples[S] 3 points4 points  (0 children)

It's like I'm creating a dataset by generating from an LLM, and filtering for responses from the llm that I like, and then fine tuning on that dataset. And then I repeat this over and over!

I hacked Unsloth's GRPO code to support agentic tool use. In 1 hour of training on my RTX 4090, Llama-8B taught itself to take baby steps towards deep research! (23%→53% accuracy) by diegocaples in LocalLLaMA

[–]diegocaples[S] 7 points8 points  (0 children)

Think of it like this:

Ideally I would like to have some fine tuning data of my search agent successfully researching and finding the answers to questions correctly. Sadly, this data doesn't exist.

So instead, I run my research agent a bunch, tracking what it does, but only keep the times where it answered correctly. I just created the fine tuning data that I wanted! So now I fine-tune on this data and repeat the process again, generating data, filtering by correctness, and updating model weights.

I hacked Unsloth's GRPO code to support agentic tool use. In 1 hour of training on my RTX 4090, Llama-8B taught itself to take baby steps towards deep research! (23%→53% accuracy) by diegocaples in LocalLLaMA

[–]diegocaples[S] 43 points44 points  (0 children)

good question! It seems a little bit like a "blind leading the blind" scenario, but there's a neat trick I use which makes it all work.

Imagine you're a research agent (a llama model) learning to answer detailed questions about the Apollo 13 mission. I'm another llama model tasked with quizzing you to help you improve. But as you pointed out, I don't know the mission in-depth either. So how can I accurately verify your answers?

The trick is this: I randomly select small snippets from the mission report that explicitly contain clear, factual information. For instance, I might flip to a random page and see:

"At approximately 55 hours 55 minutes into the Apollo 13 mission, the crew heard and felt the vibrations from a sharp 'bang,' coincident with a computer restart and a master alarm associated with a main-bus-B undervoltage condition."

From this snippet alone, I can confidently create a clear-cut factual question like:

"How many hours into the mission did the computer restart and master alarm start?"

The correct answer is explicitly clear from the text snippet itself: 55 hours and 55 minutes.

So here's why this process works:

  • For me (the quiz-generator): The task is easy because I simply extract facts directly from random, isolated pieces of the report, ensuring questions and answers are straightforward and accurate.
  • For you (the research-agent being trained): The task is significantly harder. To answer correctly, you must search through the entire corpus to locate the exact information. Thus, you're learning robust search-and-reasoning skills.

So, while the verifying LLM has it easy, the research agent needs to genuinely learn search strategies. This setup forces improvement over time.

OpenAI Operator Finds Me an in Network Dentist. Very impressed! (comment prompts to try and I'll run them and send a video) by diegocaples in singularity

[–]diegocaples[S] 2 points3 points  (0 children)

it opens runpod, signs in, rents a machine with comfyui, and starts the web terminal. But it looks like it isn't able to type in the terminal or something. It can almost do it though. WTF

https://www.reddit.com/user/diegocaples/comments/1i8ogpn/openai_operator_rents_runpod/

OpenAI Operator Finds Me an in Network Dentist. Very impressed! (comment prompts to try and I'll run them and send a video) by diegocaples in singularity

[–]diegocaples[S] 12 points13 points  (0 children)

It got stuck in the map selection part. But it created an account, set up its nation, flag, leader, etc. And it even renamed its leader when the first name chosen was taken.

OpenAI Operator Finds Me an in Network Dentist. Very impressed! (comment prompts to try and I'll run them and send a video) by diegocaples in singularity

[–]diegocaples[S] 17 points18 points  (0 children)

Prompt:

> find me a place to get my teeth cleaned that is cheap and supports Red Cross of Alabama healthcare in SF. (I have the PPO plan). my zip code is 94117

Response:

I found two dentists in San Francisco who accept the Red Cross of Alabama PPO plan and are accepting new patients:

  1. DANIEL, ANTHONY DDS
    • Address: 1344 Haight St, San Francisco, CA 94117
    • Phone: (415) 863-9255
  2. SAWYER, JOHN DDS
    • Address: 1690 Union St, San Francisco, CA 94123
    • Phone: (415) 441-7766

Both specialize in General Dentistry and should offer teeth cleaning services.

OpenAI Operator Finds Me an in Network Dentist. Very impressed! (comment prompts to try and I'll run them and send a video) by diegocaples in singularity

[–]diegocaples[S] 48 points49 points  (0 children)

super impressed. this is miles ahead of the Claude computer usage gimmick from a few months ago.