20 Questions (Part 1) – My Explorations With LLMs

20 Questions

20 Questions is a classic guessing game built on curiosity and clever deduction.
Here’s how it works:

One player secretly thinks of a noun (any person, place, or thing).
The other player has to figure it out by asking yes-or-no questions only.
With each question, the possibilities narrow ( examples: is it alive? Can you eat it? Is it bigger than a breadbox?)
The guesser has at most 20 questions to solve the mystery.

This is a really fun game to play with your friends or family. During Covid my friends and I would call and try to guess what the other was thinking with this game and it was a great way of connecting. What we’re exploring today is if large language models can do this since they are already next token predictors.

I’m going to be working with gemma3:4b which is a model on Ollama

For today’s experiment I am going to be thinking of an apple

My texts will be in pink, and the llm will be in blue

I want to play a game of 20 Questions with you Here’s how it works:

I’ll think of a secret object.
You can only ask yes-or-no questions to figure out what it is.
You get a maximum of 20 questions.
I’ll answer each one with “yes” or “no”

Can you play this game with me?

Okay, let’s play! I’m ready. My first question is: Is it an animal?

No.

Okay, noted. Next question: Is it something you would find indoors?

No.

Okay! My next question is: Is it typically made of wood?

No.

Alright. Moving on. Is it something you would use for entertainment? (e.g., a game, a toy, a movie)

No.

Okay, good to know. My next question: Is it something that’s usually found in a kitchen?

Yes

Okay! Let’s see… Is it something you would use to prepare food? (e.g., a knife, a pot, a spoon)

No.

Okay, narrowing it down. Is it something you would store things in? (e.g., a box, a container, a shelf)

No.

Alright, this is getting tricky! Is it something you would find in a bathroom?

No.

Okay, let’s try this: Is it electronic? (Does it require electricity to function?)

No.

Okay, just a few more questions… Is it something you would typically find in a house? (meaning it’s not something you’d find in a store or outside)

Geez readers.. I’m a little conflicted here. I feel like this question violates the rules of the game a little bit.. What do you all think? Im just gonna go with the natural state of it.

No.

Okay, last question! Is it something that’s generally small in size?

Yes.

Okay! Based on your answers, I’m going to guess: Is it a button?

Okay, well… that was interesting. The LLM only used 12 out of the 20 questions and still guessed wrong, which honestly surprised me a bit. Looking back at the way I introduced the game, I think part of the problem is on my side: my original prompt explained the rules (yes/no questions, 20-question limit, I’ll answer clearly), but I never told the model how to actually play in a smart way. I didn’t say it should be strategic, use most of its question budget, or avoid guessing too early—I basically invited it to “play along” without setting any expectations about optimal strategy. While this may not be an issue for larger models- it shows how prompting is incredibly important to smaller ones.

Furthermore, the model seemed to lean into its default behavior: it tried to be confident and decisive instead of cautious. Rather than systematically narrowing things down like a classic 20 Questions player (splitting possibilities, ruling out big categories), it jumped to a guess once it thought it recognized a pattern from my answers. That makes sense for an LLM, since it’s trained to produce plausible next moves in a conversation, not to run a formal search algorithm in the background.

So the wrong guess after just 12 questions isn’t just a random failure—it shows a mismatch between what I assumed it would do and what I actually told it to do. If I run this experiment on this same model again, I’d want to tighten up the instructions: tell the model to think out loud about what each question rules out, to avoid guessing until it has strong evidence, and to treat the full 20 questions as a resource it should use efficiently. Join us next time where we will be testing different models and comparing their output to what we’ve seen here. Also, If you end up doing this let me know what your results look like!

20 Questions

Thank you for reading!