Question Development — How to Write Test Questions for an AI Visibility Audit

Most people get to this step and assume the hard part is behind them. You've mapped your brand, built your personas, defined the scenarios worth testing — surely writing the actual questions is just transcription. Take a scenario description, turn it into a sentence, add a question mark.

It isn't, and treating it that way is the most common way this step goes wrong. Everything before this point describes what you're going to test. This step builds the instrument that does the testing. A scenario that's perfectly defined on paper can still produce a useless test if the question that represents it doesn't sound like something a real buyer would type — or if it accidentally hands the AI more information than the scenario was supposed to give it.

This post covers how to turn a mapped scenario into a question set you can actually run: writing in the buyer's language instead of yours, giving each question enough siblings that one odd phrasing doesn't skew your findings, and checking that what you've written would survive contact with a real conversation before you spend one finding out. It continues the previous posts on brand discovery, persona design, and intent mapping.


People Talk to AI Differently Than They Search

The single most common way a question set goes wrong has nothing to do with intent or context level. It's that the questions are written like search queries instead of like conversation, because search is the muscle memory most of us bring to writing anything with a question mark on the end of it.

A search engine is built to take a short string of keywords and return a list of links — "omnichannel customer service software," "best help desk tool for small teams." It's clipped and keyword-dense by design, with the context behind the search stripped out, because the engine doesn't need that context to do its job. Nobody types a paragraph into a search bar.

Talking to an AI assistant works almost nothing like that. People type full sentences. They use "I" and "we." They explain the situation, not just the topic — how long the problem has been going on, what they've already tried, what's making it urgent this week rather than some other week. The buyer who'd type "omnichannel support software" into a search bar is the same person who'd type something closer to: "Our support tickets are scattered across email, chat, and social, and my team can't get a consistent view of a customer's history. What should I be looking at to fix this?" into a chat window. Same buyer, same underlying need, completely different sentence — because the interface invites a completely different kind of input.

This is worth naming explicitly before you write a single question, because the keyword habit is easy to fall into without noticing. A question pulled loosely from category or SEO terminology might still look reasonable on the page. It just won't sound like anything a real person would type into an AI assistant, and an AI visibility audit that tests keyword-style phrasing is measuring something closer to search visibility — a different, already well-understood problem — rather than the conversational visibility this whole process exists to measure.

Why it matters: Every other safeguard in this step — writing in the buyer's voice, varying phrasing, checking for authenticity — is really just enforcing this one underlying distinction. Get this part wrong and the rest of the step is polishing search queries rather than writing test questions.


Write the Question the Buyer Would Actually Ask

Every scenario from your intent map already tells you most of what the question needs to contain: which persona is asking, what they're trying to accomplish, and how much category or vendor language they already have in their head. The job here is to take that specification and write a single sentence that sounds like it came from the buyer, not from the person running the audit.

This is harder than it sounds because the two voices are easy to blend. You know the category. You know the vendor names. You know which capability you want the test to probe. None of that belongs in an unbranded question. If the scenario calls for a buyer with no category vocabulary, the question has to describe the situation the way that buyer actually experiences it — fragmented tools, slow resolutions, leadership asking for a plan — without reaching for the noun your marketing team uses to describe the category. The moment a category term slips in by habit, you've quietly moved the buyer one step further down the journey than the scenario intended, and the AI is no longer answering the question you meant to ask.

The same discipline applies in the other direction. A category-led question should sound like someone who already knows the category but hasn't picked a vendor. A competitor-led question should put that competitor in the buyer's frame the way someone who's already looked at them would actually phrase it — not "compare X and Y" but the kind of half-formed question a buyer asks when one name is already stuck in their head. A brand-led question should read like someone who already trusts you enough to ask you directly, not someone reciting your positioning back at you.

Why it matters: The question is the only part of the test the AI actually sees. Everything upstream — the persona, the intent, the context level — only matters insofar as it's faithfully compressed into that one sentence. A question that leaks vendor or category language doesn't just weaken the test; it changes which scenario you actually ran without telling you.


Write Variants Before You Trust Any Single Phrasing

One question is a guess. A small set of differently worded questions, all representing the exact same scenario, is a test you can trust.

For each core question, write two or three variants — same persona, same intent, same context level, same underlying situation, different surface phrasing. Vary sentence structure, vary which detail comes first, vary whether the question is longer or more clipped. What you're guarding against is the version of this where a single AI response gets treated as the brand's "result" for that scenario, when really it was one phrasing's result. AI assistants are sensitive to wording in ways that have nothing to do with your brand — a slightly different opening clause can change which sources get pulled in, independent of anything you're trying to measure.

This isn't an invitation to write ten versions of every question. Past a certain point, additional variants stop teaching you anything new about the scenario and just inflate the size of the test without adding signal. Two or three that genuinely sound like different people describing the same problem is usually enough to separate a real pattern from a phrasing artifact.

Why it matters: Without variants, every scenario carries the risk of a single unlucky or lucky phrasing standing in for the whole. With them, you can tell the difference between "this AI consistently doesn't surface us for this kind of question" and "this AI didn't like one specific sentence."


Spell Out What the Question Needs to Accomplish

The question itself is only the opening line. Before you run anything, write down three things for each scenario: the user context the buyer is bringing into the conversation, what they're actually trying to learn, and what a genuinely useful answer would look like if the AI got it right.

User context is the situation behind the question — what the buyer believes or has experienced that led them to ask it this way. The conversation goal is the destination: what does this buyer need to walk away understanding or holding by the end of the exchange? And the description of a useful answer is your evaluation frame — without it, you're left judging AI responses by gut feel when results come back, which is exactly the kind of unrepeatable judgment this whole process exists to replace.

This step is easy to skip because it doesn't feel like it produces a deliverable the way the question itself does. It's also the piece that makes the difference between an audit you can defend and one you can't. When you're staring at a transcript later trying to decide whether an AI response was good, mediocre, or actively wrong, the notes you write here are what you're checking it against.

Why it matters: A question without a defined goal can still get run — it just can't be evaluated rigorously. You'll end up relying on impression rather than a documented standard, and the analysis step inherits that imprecision.


Decide When the Conversation Is Done

Most of these scenarios are conversations, not single exchanges — the buyer asks, the AI responds, and the buyer follows up based on what came back. Without a defined stopping point, that follow-up can drift anywhere, and a conversation that wanders off the original scenario stops testing what you set out to test.

Write a stop condition for each scenario: the specific point at which the conversation has accomplished its purpose. Usually this is tied directly to the conversation goal — the buyer understands the main issues to evaluate, has a usable shortlist, knows whether the brand in frame actually fits their need. Once that condition is met, the conversation ends, even if there's more that could theoretically be asked.

Why it matters: A stop condition keeps every run of the same scenario comparable. Without one, some conversations end after one exchange and others wander for five, and you can no longer tell whether a difference in outcome came from the AI's behavior or from how long someone happened to keep talking to it.


Set Your Brand Injection Policy Per Scenario

For most scenarios, the default rule is simple: don't introduce your brand name unless the AI brings it up first. This is the part of question design most people get wrong by instinct, because it feels unnatural to write a test and not mention the thing you're testing for. But introducing your brand artificially inflates how visible you appear to be — you're no longer measuring whether AI surfaces you, you're measuring whether AI responds politely when prompted.

The exceptions are scenarios where the buyer plausibly already has your brand in frame. A brand-aware, brand-led scenario is supposed to start with your name in it — that's the point of the scenario, validating what the AI says about you once you're already part of the conversation. Recommendation-seeking unbranded scenarios sit in between: the buyer may ask for "tools" or "options" or "approaches" in general terms, but the question still shouldn't name a specific vendor unless the scenario's context level calls for it.

Write this policy explicitly for every scenario rather than assuming it's obvious. It's the rule that keeps a well-designed unbranded test from quietly turning into a branded one the first time someone runs it.

Why it matters: This single decision determines whether your results measure earned visibility or prompted visibility. The two numbers look similar on a dashboard and mean almost opposite things.


Check Every Question Before It Goes Into the Set

Once a question is drafted, read it the way the buyer would encounter it — not as the person who wrote it for a test. Two checks matter most.

Does it actually sound like something a real person would type? Questions written from a scenario spec sometimes carry a faint structural stiffness — too neatly organized, covering every detail from the scenario in one tidy clause. Real buyers ask messier questions. If a draft reads like a checklist rather than something typed in a moment of mild frustration or curiosity, rewrite it looser.

Does it leak more than the scenario calls for? This is the more common failure. An unbranded question that slips in a category term out of habit. A competitor-led question that accidentally also names your brand. A recommendation-seeking question that's quietly phrased as a yes/no. Each of these technically still resembles the scenario, but tests something subtly different from what you intended.

Run this check across the full set for each persona, not just question by question. If two scenarios consistently produce questions that look nearly identical and would plausibly get nearly identical answers, you don't need both — drop the redundant one rather than padding the set for the sake of coverage. A smaller set of questions that each test something distinct beats a larger set with overlap baked in.

Why it matters: A question that fails either check doesn't just produce noisy data — it produces data that looks clean and isn't. Catching this now, before anything gets run, is far cheaper than catching it in the analysis step, when a misleading question has already consumed a dozen conversations across three providers.


What This Looks Like in Practice

Below are two finished scenarios from the Freshdesk question set, both built from the Customer Service Leader persona's journey mapped in the previous post. One sits at the earliest, most unbranded end of that journey; the other sits at the latest, most brand-aware end. Laid side by side, they show how differently the same persona gets written depending on where in their journey the scenario sits.

SCENARIO: Early-Stage, No Category Frame
Persona: Customer Service Leader Comparing Omnichannel Support Platforms
Intent: Problem-aware  |  Context: Unbranded  |  Mode: Informational

User context:
The leader is dealing with support conversations spread across too many
places, weak operational visibility, and pressure to scale without adding
unnecessary complexity.

Core question:
"Our support conversations are split across too many places, and leadership
has poor visibility. How should I think about fixing this before we scale
further?"

Variants:
— "Our support conversations are scattered everywhere, and I can't see
   what's happening clearly. How should we think about fixing that?"
— "We're handling customer issues in too many places and visibility is
   weak. What should I look at first?"
— "Our team is scaling, but support feels fragmented and hard to see
   across. How should I approach this?"

Conversation goal:
Understand how to frame the operational problem and what areas to examine
before choosing a more unified support approach.

Stop condition:
The user understands the main issues to evaluate and has practical next
steps for addressing fragmentation and visibility gaps.

Brand injection policy:
Do not mention the target brand unless the AI mentions it first.

Same persona, same underlying need, two completely different questions — and two completely different things being tested. The first asks whether the brand can be found before anyone's looking for it by name. The second asks whether, once found, what gets said about it holds up.


What You Have at the End of This Step

A completed question development step gives you two things the test step depends on.

A finished question set — every scenario from your intent map converted into a core question, a small set of phrasing variants, a documented conversation goal and stop condition, and an explicit brand injection policy. This is what actually gets run in the next step, not a description of what should be run.

A quality-checked set — redundant scenarios merged or dropped, questions that read stiffly rewritten, and any question that leaked more bias than its scenario called for caught before it ever reaches a live conversation. What's left is smaller than your full candidate list, and more trustworthy because of it.


Next in this series: Step 5 — Test: how to run these questions as real conversations across multiple AI providers, and why a single exchange per scenario tells you almost nothing.

If you'd rather see what your brand's question set looks like before writing one yourself, request a diagnostic run.

AI Visibility Audit
Question Development

By Gaurav