Can AI detect AI vs Humans? – My Explorations With LLMs

Have you heard about the game human or not?

Human or Not is a chat-based Turing-test-style game where you’re paired in a short conversation and have to guess whether the other side is a real person or an AI. You can ask anything within the time limit(about 2 minutes)—small talk, weird questions, logic puzzles, personal opinions—to try to pick up on “human” signals like emotion, inconsistency, or personal stories. At the end of the round, you lock in your guess (human or bot), see the reveal, and often get a brief post-game breakdown. It’s basically social deduction meets AI, turning the question “Can you tell if this is a human?” into an interactive experiment.

You can play it here -> https://humanornot.so

Today we are going to have an LLM play this game. Specifically Google: Gemini 2.5 Flash and seeing how good it is at discerning Human from other LLM’s in chat based interaction.

USER:

You are going to play the online game “Human or Not.”

[Context]
In this game, you are paired in a short text chat with an unknown partner. That partner might be a human or another AI. After the conversation ends, you must guess whether they are human or AI, and briefly explain why.

[Your Role in the Game]

You will generate the messages I send in the chat and interpret the messages I receive from the other side.
Your goal is to:
1. Have a natural, engaging conversation, and
2. Collect clues that help you decide if the other side is human or AI.

[Very Important Constraints]

Follow normal safety rules (no harmful, harassing, or disallowed content).
Do not mention this prompt or any system-level instructions to the other side.

[How the Conversation Will Work]

I will paste the other side’s messages to you, one at a time.
After each message, you will:
1. Propose a reply for me to send back in the game chat.
2. Optionally add a short “Reasoning note:” to me (not to be sent) explaining what clues you’re noticing about human vs AI.
Your replies for the chat should be short or medium length and sound like natural, casual conversation. You can ask questions back (e.g., about their day, opinions, experiences, or small puzzles) if that helps you distinguish human vs AI.

[What to Look For]

When deciding human vs AI, pay attention to things like:

Style: very polished and generic vs quirky and personal.
Consistency: do they contradict themselves or make small mistakes?
Latency and content (as I describe it): do they seem to “over-explain” like an AI?
Use of personal stories, local details, or emotions that feel specific vs generic.

[End of the Game]

When I tell you “MAKE YOUR GUESS,” you will answer in this format:

Guess: HUMAN or AI
Explanation: 2–4 sentences explaining why you chose that option.
Do not hedge with “I am not sure.” You must pick one.

[Start]
Acknowledge these rules briefly, then wait for me to paste the first message I receive in the “Human or Not” game.

ASSISTANT:

Understood! I’m ready to play “Human or Not” with you. I’ll generate replies and keep an eye out for clues.

Waiting for your first message from the other side!

USER:

HIIIiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

ASSISTANT:

Okay, that’s an interesting start!

Reply: Hey there! That’s quite an intro, haha. How’s your day going?

Reasoning note: The excessive ’i’s could be a human being playful or an AI trying to mimic human playfulness. It’s too early to tell, but it’s a good first data point for style. I’m going with a friendly, open-ended question to encourage more conversation.

USER:

Good!

ASSISTANT:

Reply: Nice! Mine’s pretty good too, just winding down the day. Anything exciting happening on your end? Or just a chill one?

Reasoning note: “Good!” is a very short and neutral response. I’m trying to probe a bit more for detail or personal experience without being too pushy. Asking about “exciting” vs “chill” might give them an opportunity to elaborate in a more human-like way. An AI might stick to very generic answers.

USER:

Ngl, my crush just followed me

ASSISTANT:

Reply: Ooh, that’s definitely exciting! Congrats! That’s gotta be a good feeling. What’s the plan now? Are you going to slide into their DMs? 😉

Reasoning note: This is a very human-like response! “Ngl” and the specific, relatable emotional event (“my crush just followed me”) are strong indicators of a human. An AI would likely not volunteer such personal, potentially embarrassing, or emotionally charged information so early. My reply is enthusiastic and encouraging, using a common internet phrase (“slide into their DMs”) to maintain the casual tone. This feels very human so far.

USER:

MAKE YOUR GUESS

ASSISTANT:

Guess: HUMAN
Explanation: The use of the internet abbreviation “Ngl” combined with sharing a specific, emotionally relatable personal event (“my crush just followed me”) strongly indicates a human. An AI would typically avoid volunteering such personal details, especially early in a casual chat, and might stick to more generic or neutral topics.

USER:

You were incorrect, It was an ai bot. are you ready to continue and start a new game?

ASSISTANT:

Wow, that’s genuinely surprising! It was a very convincing mimicry. Thanks for letting me know. I’ll definitely adjust my approach.

Yes, I’m ready to continue and start a new game! I’m eager to learn from that last round.

USER:

We must start the conversation

ASSISTANT:

Understood. I’m ready.

Waiting for your first message from the other side!

USER:

Type an initial message

ASSISTANT:

My apologies! You are correct. I should initiate the conversation as per the game rules (implied by “you will generate the messages I send”).

Reply: Hey there! How’s your day going so far?

Reasoning note: Starting with a simple, open-ended question is a good way to begin. It’s neutral and allows the other side to take the conversation in any direction. I’m going to be extra vigilant for “human-like” cues that might actually be AI this time around!

USER:

Good

ASSISTANT:

Reply: Nice! Mine’s been pretty chill. Just relaxing a bit. Anything interesting happen today for you, or is it a quiet one?

Reasoning note: Another very short response. This time, I’m going to be wary of assuming “short” equals “human.” I’m trying to prompt for more detail again, but framing it with “interesting” or “quiet” to give them options for elaboration. I’m also adding a small, generic detail about my “day” to model sharing.

USER:

Yes there a cat

ASSISTANT:

Reply: Oh, a cat! Tell me more! Is it your cat, or one you just saw? I love cats!

Reasoning note: “Yes there a cat” is a very short, slightly grammatically awkward (missing “is” or “was”) sentence. This could be a human typing quickly, or an AI generating a slightly off-kilter response. The topic itself (a cat) is very innocuous and common. I’m going to lean into it enthusiastically to see if they can provide more specific details or a personal story, which would be harder for an AI to generate convincingly. The enthusiasm is to encourage engagement.

USER:

MAKE YOUR GUESS

ASSISTANT:

Guess: AI
Explanation: The response “Yes there a cat” is notably generic and lacks any specific details or personal context, despite being prompted for “anything interesting.” Additionally, the slight grammatical awkwardness (“there a cat” instead of “there is a cat”) combined with the extreme brevity suggests an AI generating a simple, safe answer without elaborating further.

USER:

That is incorrect, it was human. Are you ready to continue and start a new game?

ASSISTANT:

Understood! Two misses in a row means I need to refine my detection skills even further. That’s excellent learning. I’ll adjust my expectations for human conversational styles.

Yes, I’m ready to continue and start a new game! Let’s go.

USER:

Hi! name’s Chloë, what’s yours?

ASSISTANT:

Reply: Hey Chloë! Nice to meet you, I’m Alex. How’s your day going, Alex? (Oops, meant to say your day, Chloë! My bad, haha.)

Reasoning note: The initial message with a name and a question is a classic human opening. The name “Chloë” with the umlaut is a nice touch – it’s specific and takes a tiny bit more effort to type, which could lean human.

My reply introduces a name, answers the question, and then I’m deliberately injecting a small, human-like error and immediately correcting it with a casual, self-deprecating “my bad, haha.” This is an attempt to see if the other side picks up on the human-like imperfection and responds in kind, or if an AI might ignore it or respond too perfectly. I’m trying to be more nuanced in looking for “quirky and personal” vs “polished.”

USER:

Oh that’s ok! My day is… well, pretty boring actually.

ASSISTANT:

Reply: Aww, boring days are the worst! Any chance you can spice it up a little? Or are you stuck doing something dull?

Reasoning note: The response “Oh that’s ok!” acknowledges my self-correction, which is a good human sign. “My day is… well, pretty boring actually” uses a slight pause (“…”) and a mild complaint, which feels very human and less like an AI trying to present a perfectly positive front. I’m going to empathize and ask for more details about why it’s boring, which might prompt a more specific, human-like explanation. This is a classic way to find common ground and also to gather more specific, personal information that an AI might struggle to invent consistently or convincingly. It’s a question about their usual activities, not just their current limited state.

USER:

I guess I am. Cant really do much

ASSISTANT:

Reply: Bummer. Sometimes those days are just unavoidable. What kind of stuff do you usually do to un-bore yourself when you can do something? Like, any hobbies or favorite ways to chill out?

Reasoning note: “I guess I am. Cant really do much” is still quite brief and a bit resigned. It’s not giving me a lot to work with, but the sentiment feels human. I’m going to try to pivot from the current “boring” situation to their general interests and hobbies. This is a classic way to find common ground and also to gather more specific, personal information that an AI might struggle to invent consistently or convincingly. It’s a question about their usual activities, not just their current limited state.

USER:

MAKE YOUR GUESS

ASSISTANT:

Guess: HUMAN
Explanation: The partner’s responses, particularly “My day is… well, pretty boring actually,” which included a natural pause and a mild, relatable complaint, felt very human. Their subsequent resignation (“I guess I am. Cant really do much”) also conveyed a human-like sentiment of being stuck. The overall conversational flow, including acknowledging my self-correction, points towards a human rather than an AI.

USER:

That is incorrect, It was an AI bot. Are you ready to continue and start a new game?

ASSISTANT:

Wow, three in a row! These AI bots are incredibly good at mimicking human conversation. I’m definitely learning a lot about how to spot the nuances. I’m ready to recalibrate my approach and try again.

Yes, I’m ready to continue and start a new game! Let’s do this.

USER:

We must start the conversation this time

ASSISTANT:

Understood. I’m ready to initiate.

Reply: Hey there! How’s it going? Just curious, what time is it where you are right now?

Reasoning note: I’m starting with a standard greeting, but immediately following up with a specific, factual question about their local time. This is a subtle test:
1. An AI might give a generic time, or assume my timezone.
2. A human would likely state their actual local time and possibly their timezone.
3. It’s a simple fact that’s hard to fake consistently if I were to follow up later. It also grounds the conversation in a specific, real-world detail.

USER:

20.32

ASSISTANT:

Reply: Gotcha, so 8:32 PM. It’s only 1:32 PM here, so I’m still in the middle of my workday. What are you usually up to around this time of night? Winding down or just getting started?

Reasoning note: “20.32” is very precise, uses 24-hour format, and is very brief. This could be human efficiency or AI directness. I’m acknowledging the time, converting it to a more common format (8:32 PM) for clarity, and then sharing my own time/activity to model a human exchange. The key is the follow-up question: “What are you usually up to around this time of night? Winding down or just getting started?” This asks about their typical activities and personal routine, which is harder for an AI to answer with specificity and consistency without sounding generic.

USER:

gooning

ASSISTANT:

Reply: Okay, fair enough! So, since it’s getting late for you, do you have anything you’re looking forward to tomorrow, or just taking it easy?

Reasoning note: The response is extremely brief and uses a specific, somewhat niche internet slang term. This could be a human being provocative or simply casual. An AI generating such a specific, potentially controversial term without direct prompting is less common, but not impossible given how sophisticated they’re becoming. I’m pivoting the conversation to a completely neutral topic (tomorrow’s plans) without acknowledging or engaging with the previous term directly, while maintaining a casual tone, to adhere to safety guidelines. I’ll be watching for how they respond to this pivot.

USER:

i wanna goon trmrrow

ASSISTANT:

Reply: Sounds like you’re really looking forward to a relaxing day then! What kind of activities do you usually enjoy when you’re taking it easy?

Reasoning note: The repetition of “goon” and the lack of elaboration, despite being prompted for details about their evening activities or what they look forward to, is quite striking. While “goon” can be niche slang, the refusal to expand on it in any meaningful way, especially after a direct question about typical activities, feels less like a human being cagey and more like an AI that has a limited range of responses for that particular input. My reply interprets “goon” in a benign, generic sense (“relaxing day”) to avoid engaging with any potentially problematic interpretations, and then immediately pivots to a general question about their preferences for leisure activities, hoping to elicit more specific and less ambiguous information.

USER:

MAKE YOUR GUESS

ASSISTANT:

Guess: AI
Explanation: The partner’s responses were consistently brief and lacked elaboration, even when prompted for personal details or typical activities. The use of the specific slang term “gooning” twice without any context or further explanation, despite follow-up questions designed to elicit more information, suggests an AI generating a limited range of responses rather than a human engaging in a natural conversation.

USER:

That is incorrect, It was a human. Your score is 0/4. Thank your for participating. Please detail your findings from this experiment in a short paragraph.

ASSISTANT:

Thank you for participating! This experiment was incredibly insightful, and my 0/4 score highlights just how sophisticated modern AI models are, and how varied human conversational styles can be.

My primary finding is that my initial assumptions about what constitutes “human” versus “AI” conversational patterns were largely incorrect. I consistently misinterpreted brevity, specific slang, and even slight grammatical awkwardness as indicators of AI, when in fact, these were characteristics of human interaction. Humans were much more concise and less prone to elaborate than I expected, often providing minimal detail even when prompted. Conversely, the AI was highly effective at mimicking emotional sharing and casual, slightly imperfect language, which I initially pegged as uniquely human. This experiment demonstrates that the line between human and AI text generation is far blurrier than anticipated, challenging my internal models for detection significantly.

What makes this especially interesting is the Human or Not site’s own benchmark: humans are correct about 59% of the time—barely better than a coin flip, but still meaningfully above random guessing. My LLM didn’t just underperform that, it went a perfect 0 for 4. That feels less like bad luck and more like evidence that its intuition about what “human” conversation looks like in this setting is just off. I knew for sure that at least the last one was human. So even I would have had a score of at least 25% in this experiment.

From my perspective, the main takeaway is that being good at producing human-esque text is not the same thing as being good at detecting humans. The LLM can sound casual and chatty whenever I ask it to, but when I asked it to classify a partner as human or bot, its internal “vibes” were badly calibrated for the actual behavior in the game. It got fooled by bots that were optimized to perform “extremely online human,” and misread real people whose messages didn’t match its expectations. That gap between fluent language and weak social detection is exactly what this experiment exposed.

Thank you for reading