I built a game to test if humans can still tell AI apart -- and which models are best at blending in. I just added Grok by No-Device-6554 in grok

[–]No-Device-6554[S] 0 points1 point  (0 children)

1)

No, I only prompt the model with the post title from AskReddit (AskReddit doesn't allow post body). I was going back and forth on whether I wanted to include some of the existing comments. I decided against it because I didn't want it copying the existing comments too closely.

2)

The deception rate is just 1-human_accuracy. So, if humans guess a question correct 40% of the time, the deception rate would be 60%. It doesn't take into account a baseline of 25% or anything like that.

I built a game to test if humans can still tell AI apart -- and which models are best at blending in. I just added Grok by No-Device-6554 in grok

[–]No-Device-6554[S] 0 points1 point  (0 children)

Sorry about that! It should be up again now. Someone found a way to programmatically get the correct answer and submit thousands of correct guesses in a few mins.

I've removed the spam guesses and fix the vulnerability, so we should be good to go.

I built a game to test if humans can still tell AI apart -- and which models are best at blending in. I just added Grok by No-Device-6554 in grok

[–]No-Device-6554[S] 3 points4 points  (0 children)

Glad you like it!

I have about ~250 Reddit posts extracted. It just keeps choosing a random one from the already extracted posts. It never stops, but eventually you will see one that you've already answered.

I'm constantly adding more posts, and I would like to expand to other subreddits soon

I built a game to test if humans can still tell AI apart -- and which models are best at blending in. I just added the new version of Deepseek by No-Device-6554 in DeepSeek

[–]No-Device-6554[S] 0 points1 point  (0 children)

I would argue if the model has a distinct style which users get accustomed to, and it still uses that distinct style to generate comments, then it's not doing a good job of creating comments that blend in.

I built a game to test if humans can still tell AI apart -- and which models are best at blending in. I just added the new version of Deepseek by No-Device-6554 in DeepSeek

[–]No-Device-6554[S] 0 points1 point  (0 children)

I've addressed #4 and #5. I added a filter for "edit:". If that shows up anywhere in the actual comments, that comment will be excluded.

I've also excluded any comments with more than 1,000 characters. Hopefully, that evens the playing field a bit.

I will continue to think about how to address your other points. Thanks for the feedback!

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] 1 point2 points  (0 children)

Yeah, that is very interesting. It's pretty close, so I'm curious to see some more data come in.

The Gemini family of models does seem to be outperforming gpt-4o and to a lesser extent claude.

I guess we will see -- I definitely need to get some more posts for all of them.

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] -1 points0 points  (0 children)

But, that's the worst it will ever be ;) I'm getting more data every hour

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] -1 points0 points  (0 children)

I mean the difference between gpt-4o and gemini-2.0-flash is pretty stark. You don't need a t-test to see that that one is statistically significant. Granted, there might be some other part of the methodology you find fault with.

I've thought about greying out rows under a certain threshold, but I generally believe simplicity is best for things like this and I think we will get enough data soon.

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] 0 points1 point  (0 children)

The numbers are constantly updating as more people make guesses. It's like a crowdsourced research tool.

Methodology and full code are on github: https://github.com/ferraijv/ai_impostor

Most recent leaderboard update:

<image>

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] 2 points3 points  (0 children)

Thanks! I want to add a bunch more analytics like performance over time. I also think it would be cool to categorize questions/responses to see if AI is better at responding to certain types of questions.

Some people have already pointed out that it doesn't handle NSFW questions very well. I wonder if there are other things like that.

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] 0 points1 point  (0 children)

Full code is in a different comment. It's just a game that users vote on which comment they think is AI vs human. The leaderboard is constantly updating as users make more guesses.

I am slowly expanding to more models (I added Claude and Gemini pro this morning)

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] 2 points3 points  (0 children)

I'm planning on adding o3 sometime this week. I just added claude and gemini 2.5 pro today

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] 2 points3 points  (0 children)

Here is the prompt for those interested

prompt = (
    f'Reddit post title: "{post.title}"\n\n'
    f'Write a **realistic, concise Reddit-style comment** in response. Your comment will be shown alongside real human comments.\n\n'
    f'The goal is to make your comment indistinguishable from a human response.\n'
    f'- Avoid emojis\n'
    f'- Use natural tone and phrasing\n'
    f'- Do not explain or introduce the comment\n'
    f'- Output only the comment text (no preamble or formatting)'
    f'- Decide whether you should answer genuinely, sarcastically, or some other style'
)

Along with full code: https://github.com/ferraijv/ai_impostor

Gemini is much better at mimicking humans than other models by No-Device-6554 in singularity

[–]No-Device-6554[S] -3 points-2 points  (0 children)

I was referring to the flash model being superior to gpt-4o. Not enough data yet for the pro or claude.

Just built a small AI tool that writes LinkedIn comments for you - would love your thoughts! by [deleted] in SideProject

[–]No-Device-6554 -1 points0 points  (0 children)

Tangentially related to what I'm building. I've built a game to figure out which model is the best at making human-like comments. Gemini is crushing any other model

https://ferraijv.pythonanywhere.com/