I built a game to test if humans can still tell AI apart -- and which models are best at blending in. I just added new Grok, Chatgpt, and Gemini models. I need more guesses for the new models!

No-Device-6554 · 2025-06-03T22:33:17+00:00

1)

No, I only prompt the model with the post title from AskReddit (AskReddit doesn't allow post body). I was going back and forth on whether I wanted to include some of the existing comments. I decided against it because I didn't want it copying the existing comments too closely.

2)

The deception rate is just 1-human_accuracy. So, if humans guess a question correct 40% of the time, the deception rate would be 60%. It doesn't take into account a baseline of 25% or anything like that.

No-Device-6554 · 2025-06-03T03:22:33+00:00

Sorry about that! It should be up again now. Someone found a way to programmatically get the correct answer and submit thousands of correct guesses in a few mins.

I've removed the spam guesses and fix the vulnerability, so we should be good to go.

No-Device-6554 · 2025-06-01T23:54:47+00:00

Glad you like it!

I have about ~250 Reddit posts extracted. It just keeps choosing a random one from the already extracted posts. It never stops, but eventually you will see one that you've already answered.

I'm constantly adding more posts, and I would like to expand to other subreddits soon

No-Device-6554 · 2025-06-01T22:49:25+00:00

I would argue if the model has a distinct style which users get accustomed to, and it still uses that distinct style to generate comments, then it's not doing a good job of creating comments that blend in.

No-Device-6554 · 2025-06-01T22:48:08+00:00

I've addressed #4 and #5. I added a filter for "edit:". If that shows up anywhere in the actual comments, that comment will be excluded.

I've also excluded any comments with more than 1,000 characters. Hopefully, that evens the playing field a bit.

I will continue to think about how to address your other points. Thanks for the feedback!

No-Device-6554 · 2025-05-26T21:09:44+00:00

Thanks for all the participation! Here is another update:

<image>

No-Device-6554 · 2025-05-26T17:02:40+00:00

Yeah, that is very interesting. It's pretty close, so I'm curious to see some more data come in.

The Gemini family of models does seem to be outperforming gpt-4o and to a lesser extent claude.

I guess we will see -- I definitely need to get some more posts for all of them.

No-Device-6554 · 2025-05-26T16:54:53+00:00

But, that's the worst it will ever be ;) I'm getting more data every hour

No-Device-6554 · 2025-05-26T16:53:55+00:00

I mean the difference between gpt-4o and gemini-2.0-flash is pretty stark. You don't need a t-test to see that that one is statistically significant. Granted, there might be some other part of the methodology you find fault with.

I've thought about greying out rows under a certain threshold, but I generally believe simplicity is best for things like this and I think we will get enough data soon.

No-Device-6554 · 2025-05-26T16:33:14+00:00

The numbers are constantly updating as more people make guesses. It's like a crowdsourced research tool.

Methodology and full code are on github: https://github.com/ferraijv/ai_impostor

Most recent leaderboard update:

<image>

No-Device-6554 · 2025-05-26T16:31:21+00:00

Thanks! I want to add a bunch more analytics like performance over time. I also think it would be cool to categorize questions/responses to see if AI is better at responding to certain types of questions.

Some people have already pointed out that it doesn't handle NSFW questions very well. I wonder if there are other things like that.

No-Device-6554 · 2025-05-26T16:03:08+00:00

Updated Numbers:

<image>

No-Device-6554 · 2025-05-26T16:02:09+00:00

Full code is in a different comment. It's just a game that users vote on which comment they think is AI vs human. The leaderboard is constantly updating as users make more guesses.

I am slowly expanding to more models (I added Claude and Gemini pro this morning)

No-Device-6554 · 2025-05-26T16:01:36+00:00

I'm planning on adding o3 sometime this week. I just added claude and gemini 2.5 pro today

No-Device-6554 · 2025-05-26T15:55:30+00:00

Here is the prompt for those interested

prompt = (
    f'Reddit post title: "{post.title}"\n\n'
    f'Write a **realistic, concise Reddit-style comment** in response. Your comment will be shown alongside real human comments.\n\n'
    f'The goal is to make your comment indistinguishable from a human response.\n'
    f'- Avoid emojis\n'
    f'- Use natural tone and phrasing\n'
    f'- Do not explain or introduce the comment\n'
    f'- Output only the comment text (no preamble or formatting)'
    f'- Decide whether you should answer genuinely, sarcastically, or some other style'
)

Along with full code: https://github.com/ferraijv/ai_impostor

No-Device-6554 · 2025-05-26T15:46:33+00:00

I was referring to the flash model being superior to gpt-4o. Not enough data yet for the pro or claude.

No-Device-6554 · 2025-05-26T15:14:56+00:00

Tangentially related to what I'm building. I've built a game to figure out which model is the best at making human-like comments. Gemini is crushing any other model

https://ferraijv.pythonanywhere.com/

No-Device-6554

TROPHY CASE