Here's how most AI visibility tools work: they send one or two prompts to ChatGPT — something like "what are the best SEO tools?" — check whether your brand name appears in the response, and report back a score. Green if mentioned. Red if not.

That's not a measurement. That's a coin flip dressed up in a dashboard.

I've spent a lot of time thinking about what accurate AI visibility measurement actually requires, and the more I dug into how AI models generate responses, the more obvious it became that single-prompt checks produce data that's actively misleading. A brand can look invisible on a one-prompt test and be appearing consistently across dozens of real user queries. The reverse is also true — a brand can get lucky on a generic prompt and think they're in great shape while missing the majority of conversations that matter.

We just rolled out a significant update to how Surfaceable measures AI visibility. Here's the thinking behind it.

Why Single Prompts Lie

AI models are non-deterministic. Ask ChatGPT the same question twice and you'll often get a different set of recommended tools. The temperature settings, the context window, the exact phrasing — all of it shifts the output. A single prompt gives you a single sample from a distribution. It tells you almost nothing about your average presence.

But that's actually the smaller problem. The bigger one is prompt selection bias.

"What are the best SEO tools?" is not how real users ask questions. Real users ask things like:

"I'm a freelance SEO consultant, what tools should I use to track how my clients' brands appear in AI search?"
"What's a cheaper alternative to Semrush that also does AI visibility?"
"How do I track if my SaaS is getting mentioned by Perplexity?"
"What tools do agencies use to monitor brand mentions across AI platforms?"

Each of these prompts has a different intent, a different implied user, and a different evaluation framework baked in. A tool that shows up for "best SEO tools" might not show up for any of those — and those are the queries where buying decisions actually happen.

Single-prompt tools test one lottery ticket. They don't tell you anything about your underlying presence across the full space of relevant queries.

The Prompt Phrasing Problem

There's a well-documented phenomenon in AI research where small changes in prompt phrasing produce large changes in output. A study from Stanford last year showed that semantically identical questions phrased differently could shift brand recommendation rates by over 30 percentage points for the same model.

This matters enormously for measurement. If your tool is running "best AI visibility tools" and your brand appears, you might have a 0% presence rate on "AI search monitoring platforms for agencies" — a query that maps to the same intent, the same purchase, and is probably getting typed more often by your actual ICP.

The only way to get a meaningful number is to test across enough prompt variants — different phrasings, different buyer angles, different specificity levels — that you're sampling the distribution rather than cherry-picking a point on it.

What Platform Selection Actually Does to Your Score

Most tools test one platform. Some test two. The split between ChatGPT and Perplexity alone can be dramatic — we've seen brands with 60%+ presence rates on one and under 10% on the other for essentially the same queries. They have completely different citation patterns, different training emphases, different tendencies around which sources they pull from.

If your AI visibility score comes entirely from ChatGPT, you have no idea how you're performing on Perplexity — which, depending on your category, might be where more of your target audience is actually asking questions. Perplexity users skew technical and research-oriented. If you're a developer tool, a SaaS for technical teams, or a B2B product with a sophisticated buyer, Perplexity might be the more important platform to optimise for.

Platform-specific breakdowns aren't a nice-to-have. They're essential for knowing where to actually focus your content and schema efforts.

Our Approach: Hundreds of Variants, Five Platforms, Daily

The update we've just rolled out changes how Surfaceable generates and runs visibility checks in three ways.

First, prompt generation is now category-aware. When you set up a project, Surfaceable identifies your category and ICP from your homepage and domain. The prompt bank it generates for your site is built around the actual questions your target buyer would ask — not a generic "best tools" template. A SaaS for marketing agencies gets different prompts than a local business or an ecommerce brand, because the queries that lead to discovery are different.

Second, we run prompt variants at volume. Instead of one or two prompts per platform, Surfaceable now runs hundreds of variants across different phrasings, buyer contexts, specificity levels, and query structures. Some are direct ("what tool tracks AI brand mentions?"), some are comparison-framed ("Semrush vs alternatives for AI search"), some are problem-first ("my brand isn't showing up in ChatGPT, what should I do?"). This is a sample of your real presence distribution, not a point estimate.

Third, results are broken down per platform. Your visibility score now shows separate presence rates for ChatGPT, Claude, Gemini, Perplexity, and Grok — not a blended average that hides where you're strong and where you're invisible. If you're at 65% on ChatGPT and 8% on Perplexity, that's a different strategic problem than being at 35% everywhere. Knowing the split tells you where to direct content effort.

The daily run cadence hasn't changed — Surfaceable still refreshes your scores every day so you can track trend direction, not just a static snapshot. But now the underlying data behind those scores is significantly more robust.

What This Actually Looks Like in Practice

We ran a before/after comparison on a sample of existing Surfaceable accounts when we rolled this out. The single-prompt scores and the multi-variant scores for the same brands diverged significantly — sometimes by 20+ percentage points in either direction.

Brands that looked strong on single-prompt tests often had presence concentrated on one generic query. Expand to 200 variants and their true presence across the query space was much lower. Brands that looked invisible on a one-shot test often had real visibility on more specific, longer-tail queries — they were being cited in the right conversations, just not the broad ones the old test was sampling.

Neither of those situations is obvious from a single-prompt tool. Both of them matter for how you should be spending your content and schema effort.

The Uncomfortable Implication for the Industry

If you're using a tool that runs single or dual prompts and presents you with a pass/fail badge, you're optimising for looking good on that specific test rather than for actual presence in the queries that drive discovery and pipeline.

This is a version of the same problem that plagued early SEO — tools that measured what was easy to measure rather than what actually mattered. Rankings for a handful of head terms looked great while long-tail traffic was mediocre. Exact-match keyword density looked optimised while semantic relevance was weak.

AI visibility measurement is early. Most of the tools in the space right now are running the equivalent of checking rank position for two keywords and calling it a visibility audit. The brands that build their strategy on that data will be optimising for the wrong thing.

The right mental model for AI visibility is share of voice across a realistic sample of the queries your buyers are actually asking — segmented by platform, tracked over time. That's what tells you whether your investment in content, schema, and brand authority is working.

Surfaceable now measures that. If you're on the platform and haven't seen your updated scores yet, they'll be in your dashboard by the next daily run. If you're not, the free tier gives you enough to see where you actually stand.

Most AI Visibility Tools Are Running the Wrong Prompts