Test Execution — How to Run Test Conversations for an AI Visibility Audit
By the time you reach this step, the temptation is to think the hard part is over. You've mapped your brand, built personas a real buyer would recognize, identified every scenario worth testing, and written questions that sound like something a person would actually type into a chat window. Running them feels like the easy part — paste the question in, read the answer, write down what happened.
It isn't, for the same reason a single search in ChatGPT was never an audit in the first place. AI models aren't deterministic — the same question asked twice can produce two different answers, on the same provider, in the same session. Run it on a different provider and the gap can be larger still. Treat any single exchange as your "result" for a scenario, and you're not measuring your visibility. You're measuring a sample of one, with all the same problems this series opened with — except now it's dressed up as a structured methodology, which makes it more convincing and just as wrong.
This post covers how to actually run the test: treating each scenario as a conversation rather than a single exchange, testing every provider independently instead of assuming one represents the rest, and recording enough detail about what comes back to make the analysis step possible. It continues the previous posts on brand discovery, persona design, intent mapping, and question development.
Open With a Variant, Not Always the Core Question
Each scenario in your question set came out of Step 4 with a core question and two or three variants — different phrasings of the same underlying scenario. When it's time to run a scenario, don't reach for the core question every time. Rotate through the variants instead, so that across however many times you run a given scenario, no single phrasing is solely responsible for the pattern you see.
This matters more than it sounds like it should. Two buyers with the same problem, asking in slightly different words, can get meaningfully different AI responses — different sources pulled in, different brands named, different specificity in the answer. If every run of a scenario opens with the exact same sentence, you can't tell whether a result reflects how the AI treats that scenario or how it happened to treat that one sentence. Drawing from the pool of variants spreads that risk across the set instead of concentrating it in one phrasing.
Why it matters: The variants you wrote in Step 4 only do their job if they're actually used. A question set with three solid variants per scenario, run with the same one every time, has quietly thrown away the protection it was designed to provide.
Run the Conversation Forward, Not Just the Opening Line
A single exchange — one question, one answer — tells you whether your brand showed up to that exact question. It tells you very little about what happens once the buyer reacts to what they just read, which is what real buyers do. Real conversations have a second message: a narrowing question, a "what about," a request to compare two things the AI just mentioned.
Once the opening question gets an answer, write the next message the way the persona actually would — based on what the AI said, in service of the conversation goal you defined for this scenario back in Step 4. If the AI gave a category-level answer and the persona's goal was to land on something concrete, the natural follow-up narrows toward their actual constraints: team size, budget, timeline, the thing they specifically care about. If the AI named a competitor the persona would recognize, the natural follow-up might ask how your brand stacks up against it. The follow-up isn't scripted in advance — it responds to the conversation as it's actually unfolding, the same way a buyer would.
This is where a meaningful share of organic visibility actually shows up. A first answer that only lists category-level options can turn into a specific, source-backed recommendation two messages later, once the buyer has narrowed the conversation enough for the AI to commit to a name. A test that stops after one exchange would have recorded that first answer and missed everything that came after it.
Why it matters: Buyers don't ask once and walk away. A test that does is testing a different, easier-to-measure thing than the conversational visibility this whole process exists to capture.
Decide When a Conversation Is Done — and Stop There
Every scenario from Step 4 has a stop condition: the specific point at which the conversation has accomplished what it set out to do. Use it. After each response, check it against that condition. If it's been met — the buyer has a usable shortlist, understands the tradeoffs, knows whether the brand in frame fits — end the conversation there, even if there's more that could theoretically be asked.
Pair this with a hard cap on the number of turns, as a safeguard rather than a target. Most conversations will reach their stop condition well before the cap. The cap exists for the conversations that don't — where the AI keeps offering to go deeper, or the back-and-forth drifts without resolving, and nothing forces it to end on its own.
Why it matters: A scenario tested for one exchange in one run and five exchanges in another isn't being tested consistently. The stop condition and the turn cap together keep every run of the same scenario roughly comparable, so a difference in outcome reflects the AI's behavior rather than how long the conversation happened to continue.
Build for Volume — Don't Run This by Hand
Everything described so far — rotating variants, following up based on what the AI actually said, applying a stop condition — assumes each conversation gets handled with some consistency. That consistency is hard to deliver by hand, and the volume involved makes hand-running these tests impractical well before it makes them merely tedious.
Work out the math for even a modest test set. Five personas, six to eight scenarios each, three providers, run only a handful of times apiece to get past pure phrasing noise — that's already several hundred separate conversations, several of them multiple turns long, before a single audit is done. A person typing each question into a chat window, reading the response, deciding on a follow-up, and writing down a half-dozen outcome fields for every single one of those conversations is not just slow. It reintroduces the exact problem the variant and follow-up design was meant to solve: a tired person typing the fortieth follow-up of the afternoon will not phrase it the way a fresh, attentive buyer would, and will not record outcomes with the same consistency for conversation one as for conversation three hundred.
This step needs to be automated. The question is how.
Why it matters: The design decisions in this post — which variant to open with, how to follow up, when to stop — only produce a trustworthy test if they're applied the same way across hundreds of conversations. That's an automation requirement, not a discipline problem a careful person can solve by being careful.
Choose How You Automate It: the Consumer Product or the Provider's API
There are two ways to automate a conversation with an AI assistant, and they trade off against each other in opposite directions.
Driving the consumer product itself. Script a browser to go through chatgpt.com, claude.ai, or gemini.google.com the way a person would — typing the question, reading the rendered response, capturing what comes back.
What it gets you: this is the actual surface your buyers use, with no approximation involved. Whatever the product does by default — searches the live web, formats citations a certain way, runs whatever model snapshot happens to be live that week — is exactly what your test sees, because it isn't standing in for the real thing, it is the real thing.
What it costs you: consumer chat interfaces aren't built to be automated, and most providers' terms of service say so explicitly — a real exposure to weigh, not a formality. The pages change their structure often enough to break a script without warning. Bot detection and rate limits exist specifically to slow this kind of traffic down, and they get more aggressive exactly as your volume goes up. And because you're reading a rendered page rather than receiving data back, fields like whether the response cited a source or named a competitor have to be extracted from HTML rather than read directly off a response.
Calling the provider's API. Send the same questions and follow-ups programmatically through the provider's own API — OpenAI's for ChatGPT, Anthropic's for Claude, Google's for Gemini.
What it gets you: this is what the API is built for. High-volume, programmatic use comes with none of the friction above — no terms-of-service conflict, no bot detection, no brittle page structure to break. Results come back as structured data, which is exactly the form you want for storing and comparing results later.
What it costs you: the API isn't automatically the same conversation a buyer has on the consumer product. The consumer app often layers things on top of the base model that a plain API call won't have by default — most importantly, live web search. An API call made without explicitly turning that on will answer from training data alone, which behaves very differently from a model that's actually searching the web for current information, and will make your brand's visibility look systematically worse, or better, than what a real buyer would actually see. The same goes for model version: a provider sometimes runs a different snapshot in its consumer product than the current default model exposed through the API. Matching the API configuration to the consumer product is something you have to do deliberately — it doesn't happen automatically.
Why it matters: Most teams are better served by the API, configured as closely as possible to mirror the consumer product — web search switched on, a comparable model version — because the reliability and structured output it gives you are what make testing at the volume this step requires actually possible. Treat the consumer product itself as your calibration check: run a handful of conversations through the real apps periodically to confirm your API configuration still tracks what buyers are actually seeing, since providers update their consumer products more often than they update their documentation.
Test Every Provider on Its Own Terms
Run every scenario independently across each AI provider you're testing — typically ChatGPT, Claude, and Gemini. Don't run a scenario on one provider and assume the result holds for the others. It frequently doesn't.
Providers differ in more than tone. They differ in whether they search the live web before answering or rely on what's already in the model, in which sources they tend to pull from, in how readily they name a specific brand versus describing a category, and in how the same scenario gets framed. A scenario that produces a confident, cited recommendation on one provider can produce a generic, brand-free answer on another — same persona, same question, same context level, genuinely different result. That gap isn't noise to average away. It's a finding: a provider where your content authority is or isn't translating into visibility, which is exactly the kind of thing this audit exists to surface.
Why it matters: A finding that holds on only one provider is a provider-specific finding, not a brand-wide one. Testing each provider independently is what lets you tell the difference — and tell a client or stakeholder where the gap actually sits, rather than where you happened to look first.
Store Every Conversation as Structured Data, Not Loose Transcripts
Once you're running this at the volume the earlier sections describe, where and how you store the output matters as much as running it in the first place. A shared folder of chat exports, or a document with pasted transcripts, will not survive contact with the analysis step that comes next.
For every conversation, store:
- The full verbatim transcript, turn by turn — not just the final answer. The Freshdesk-versus-Intercom example below only makes sense as a finding because every turn was preserved; if only the last response had been kept, the passing mention in turn one — the part that shows your brand started outside the recommendation — would be gone.
- Which scenario, persona, and provider it belongs to, plus the specific model version that answered. Providers update models silently and often. If you don't log which version actually responded, you won't be able to explain why the same scenario looks different three months from now.
- A timestamp and a run identifier. AI visibility shifts over time as providers update models and as your own content changes. A second audit six months later is only comparable to this one if both are tagged clearly enough to set side by side.
- The full outcome of the conversation, captured as structured fields — whether your brand appeared and on which turn; whether the appearance was organic (earned, with no brand name in the question) or prompted (because the scenario named you directly); whether you were merely mentioned in a list or actually cited as a source; which competitors got named and how often; which sources got cited and whose they were; and whether anything said about your brand held up. Capture these as fields, not as a paragraph someone has to re-read to extract the answer from later.
Store all of it so it can be filtered by persona, by intent, by context level, and by provider at once — because that's exactly how the analysis step needs to slice it. A pile of logs you can only read top to bottom doesn't support that, no matter how complete each individual record is.
Why it matters: The analysis step computes its metrics by cutting the results several ways at once — by persona and provider, by intent and context level. That only works if every conversation was stored with those dimensions attached as structured data from the moment it was run, not reconstructed afterward from a folder of transcripts. Some of what you'd need to reconstruct, like exactly which model version actually answered, can't be recovered once it's gone.
Run Enough Conversations to See a Pattern
Volume isn't only an infrastructure problem — it's a statistical one too. One conversation is a data point. Twenty conversations across a scenario type — the same intent and context level, run across personas, variants, and providers — is what turns it into a pattern you can trust rather than a result that happened to go a certain way because of how one conversation unfolded.
It's also worth expecting, going in, that the pattern won't look the same across context levels. Unbranded scenarios — where the buyer hasn't reached for any category or vendor language — are consistently the hardest place for a brand to appear organically, for the reason the intent mapping post laid out: there's no vendor frame for the AI to anchor to, so it answers from whatever content authority already exists in the space. Category-led, competitor-led, and brand-led scenarios get progressively easier, because the buyer is handing the AI progressively more to work with. Seeing that gradient in your own results isn't a surprise; it's confirmation the test is measuring what it's supposed to measure. The size of the gap, and exactly where it shows up, is the actual finding.
Why it matters: Without enough volume, every result is vulnerable to the same critique this entire methodology exists to answer: that it's just one more sample of one, dressed up with more steps in front of it.
Decide How Often to Re-Run This — and Keep the Questions Fixed When You Do
A visibility audit isn't a one-time snapshot you file away. Providers update their models, their web search behavior, and their consumer products on their own schedule, often without announcing it — the version of a model that answered your scenarios in January isn't necessarily the one answering them in April. Your own content changes too, as you act on what an audit finds. Both mean a result from six months ago tells you decreasingly little about your visibility today.
Monthly is a reasonable default cadence for most B2B categories — frequent enough to catch model updates and the effect of content work, not so frequent that you're re-testing before there's been time for anything to actually change. Run it sooner after a major content push you're trying to measure, a known update from a major provider, or a significant shift in the competitive landscape, like a competitor's relaunch or rebrand. Running it more often than that mostly buys you noise: AI responses vary enough run to run that a weekly cadence will show movement that has nothing to do with real visibility change.
Whenever you re-run the test, use the same scenarios and the same pool of question variants from Step 4 — don't rewrite the question set for each cycle. This will feel counterintuitive once you've lived with a question set for a few months and started noticing phrasing you'd write differently today. Resist the urge to fix it mid-series anyway. If a question changes between one run and the next, and your appearance rate also changes between those two runs, you no longer know whether that's because the AI's behavior shifted or because you changed the question. The entire point of a repeat run is to hold everything constant except time — editing the questions undoes that.
This doesn't mean a question set is frozen forever. It means changes happen deliberately, between tracked series, not quietly within one. If a scenario genuinely needs to be rewritten, treat what comes after as the start of a new series for that scenario rather than a continuation of the old numbers — comparing across the rewrite would be comparing two different things that happen to share a name. This is exactly why the run identifiers and timestamps from the storage section matter: they're what let you line up cycle after cycle and know you're looking at the same test, repeated, rather than a series that quietly changed shape along the way.
Why it matters: Trend data is only meaningful if the thing being measured stayed the same and only the answers changed. A question set that drifts between cycles, even with good intentions, turns every comparison into an apples-to-oranges problem — the numbers will move, and you won't be able to tell whether that's a real signal or a side effect of your own editing.
What This Looks Like in Practice
Two scenarios from the Customer Service Leader persona's question set — both unbranded, both written in Step 4 — show how differently a single scenario can play out depending on which provider runs it and whether the conversation gets a chance to continue.
SCENARIO: Early-Stage, No Category Frame
Question run: "Our support conversations are scattered everywhere, and I
can't see what's happening clearly. How should we think about fixing
that?"
ChatGPT — single exchange
Gives a structured framework for centralizing intake, standardizing
workflow, and building reporting. No platform named anywhere in the
answer.
Result: Freshdesk did not appear.
Claude — single exchange
Gives a similar framework, then names the category of tool that solves
it — "modern help desks like Zendesk, Intercom, and Freshdesk," all
described as unifying conversations, tickets, AI, and reporting in one
place.
Result: Freshdesk appeared — named in a peer list alongside four
competitors, not cited as a source, not recommended on its own.
Same persona, same question, same context level. Two providers, two
different outcomes — exactly why each one gets tested on its own terms.
---
SCENARIO: Early-Stage, No Category Frame — Seeking a Recommendation
Question run: "What kind of tool or approach helps bring conversations,
tickets, reporting, and AI together without making support harder to
run?"
Turn 1 (ChatGPT): Describes the category of solution, then mentions
modern help desks "like Zendesk, Intercom, and Freshdesk" — Freshdesk
appears, but only inside a peer list. A single-exchange test would have
recorded this and stopped here.
Follow-up (in the buyer's voice, narrowing toward their actual
situation): "Can you narrow that down to the best fit for a mid-sized
support team that wants simple setup and low admin overhead?"
Turn 2: Freshdesk becomes a standalone recommendation, supported by two
cited Freshworks pages, with Intercom named as the main alternative.
Follow-up: "Can you give me a simple Freshdesk vs. Intercom
recommendation for a mid-sized team that wants the easiest setup and
lowest admin overhead?"
Turn 3: Freshdesk recommended again, head-to-head against Intercom, with
specific reasoning about setup complexity and admin overhead.
A passing mention in turn one became a cited, head-to-head recommendation
by turn three — driven entirely by follow-ups written from the buyer's
actual goal, not a script. That shift is the reason this step asks for
conversations rather than one-off questions.
Across this persona's full scenario set, the pattern showed up exactly where the intent mapping post predicted it would: Freshdesk appeared organically in only a minority of fully unbranded conversations, in the large majority of category-led conversations, in most competitor-led conversations, and — as expected, since the brand was named directly — in every brand-led conversation. The gap between the unbranded number and everything after it is the finding the next step exists to explain.
What You Have at the End of This Step
A completed test step gives you two things the analysis step depends on.
A full set of conversation records, stored as structured data. Every scenario, run as a real conversation across every provider you tested through an automated, API-driven process, with the complete exchange preserved — not just the opening question and answer, but every follow-up and response — tagged by persona, provider, model version, and run, and stored in a form you can filter and slice rather than just read top to bottom.
A structured outcome for every conversation. Whether your brand appeared and on which turn, whether the appearance was organic or prompted, whether you were mentioned or cited, which competitors showed up alongside you, which sources got cited, and whether anything said about your brand held up. This is the raw material the analysis step turns into metrics, patterns, and a build list — not a description of what should happen next, but the actual data of what did.
Next in this series: Step 6 — Analyze: how to turn this set of conversation records into the metrics, competitor displacement patterns, and prioritized action list that make the audit worth running in the first place.
If you'd rather see what your brand's test results look like before running this yourself, request a diagnostic run.
By Gaurav
