AI Mode Query Fan-Out Analyzer
A Streamlit app that scores your content against the way Google AI Mode actually reformulates queries. Paste a URL and a seed query. The tool generates up to 20 realistic query variants, embeds each passage on the page, and tells you which passages are covered and which are gaps.
Why I Built It
Clients kept asking the same question: "Is this page going to show up in AI Overviews?" The honest answer used to be, "Publish it and find out in a month." That's expensive. Content takes weeks to produce, and waiting for AI to re-crawl and decide costs real money.
AI Search doesn't rank URLs the way classic SEO did — it pulls passages. A page either has passages that match the retrieval query's intent or it doesn't. If I could simulate the retrieval before publishing, I could tell you which passages to keep, rewrite, or add.
That's what this does.
The Nine Query Variant Types
When a user types a query into AI Mode, the model doesn't just answer that one string — it internally reformulates. A seed query fans out into a family of related queries, and the system pulls passages that satisfy any of them. To simulate that, I prompt Gemini to generate up to 20 variants across nine types. The model picks the types that make sense for the seed — not every type fires for every query.
The first variant is always the original query, exactly as typed.
- Equivalent. Rephrasings of the same question.
"did roger moore drive an aston martin"→"what car did roger moore drive" - Follow-up. Logical next questions that build on the original.
"did da vinci paint mona lisa"→"who commissioned da vinci to paint mona lisa" - Conversational follow-up. How people actually talk to AI Mode after getting a first answer. The topic stays in the query for semantic match.
"solar panels"→"are solar panels worth it?"/"how long do solar panels last?" - Generalization. Broader version of the question.
"best Italian restaurants in Manhattan"→"best restaurants in New York City" - Specification. More detailed or specific version.
"climate change"→"climate change effects on coastal cities" - Canonicalization. Slang or informal phrasing turned into standard terms.
"how to get rid of belly fat fast"→"abdominal fat reduction methods" - Entailment. Consequences, prerequisites, or implied facts.
"solar panel installation"→"solar panel maintenance requirements" - Clarification. Disambiguation when the seed query has multiple meanings.
"apple"→"apple fruit nutrition"or"apple iphone features" - Related entity. Closely related people, concepts, or products.
"iPhone 15 features"→"smartphone comparison 2024"
I force at least two or three conversational follow-ups in every run, because that's where AI Mode actually lives. Static keyword SEO gets you the equivalent and specification buckets. AI Search is where the other seven types matter.
How It Works Under the Hood
Six stages. The first three are setup. The last three are the analysis.
1. Scrape the Page
Three scraping modes, in order of strength:
- Zyte API. The default. Handles JavaScript rendering, bot protection, and the sites that block everything else. Requires a Zyte key.
- Selenium with stealth. Headless Chrome with
selenium-stealthapplied. Works on most sites, slower to run locally. - Plain
requests. Fastest. Fails on anything that needs JS or blocks non-browser traffic.
2. Chunk Into Passages
Two granularities:
- Passage-based (default). Walks the DOM and treats semantic blocks — paragraphs, list items, heading sections — as individual passages. Mirrors how retrieval actually works.
- Sentence-based. Every sentence is a unit. Higher noise, finer-grained coverage.
Passage-based mode also supports a sliding sentence-overlap window, so a passage bleeds a sentence or two into its neighbors — useful when a single retrieval-worthy idea crosses a paragraph break.
3. Generate the Query Fan-Out
Gemini receives the prompt described above with the seed query. It returns a Python-parseable list of query strings. Nothing fancy — no chain of thought, no voting. It's a one-shot call.
4. Embed Everything
Eight embedding models available. Pick one for the whole run:
- Local (free, CPU).
all-mpnet-base-v2(quality, default),all-MiniLM-L6-v2(speed),all-distilroberta-v1(balanced),mixedbread-ai/mxbai-embed-large-v1(large, slow on CPU). - OpenAI.
text-embedding-3-small,text-embedding-3-large. - Gemini.
embedding-001,embedding-2-preview(multimodal).
Embeddings are cached by SHA-256 of (model name + text), so re-runs on the same content don't pay the API cost twice.
5. Compute Pairwise Cosine Similarity
Every passage is compared to every query. For a page with 40 passages and 7 queries, that's 280 comparisons — each a dot product of normalized vectors. Fast even on CPU.
6. Highlight and Score
Three bands, keyed to real empirical thresholds I've used in client work:
≥ 0.75— strong match. Covered.0.60 – 0.75— borderline. Worth rewriting for a tighter match.< 0.60— gap. Either the content isn't there, or the phrasing is too far from the query language.
The UI renders the source page's HTML with passages color-coded inline, plus a ranked table of passages × queries.
Inputs and Outputs
Inputs. A seed query. Number of variants (3–20, seven is the sweet spot). Input mode (URL list, pasted text, or persona-prompt ranking). Scraping method. Analysis granularity. Embedding model.
Outputs. Inline highlighted HTML of the source page (green / amber / red passages you can read in place). Ranked passage × query table. A gap report for passages below 0.60 against every query. Optional Gemini-generated SEO recommendations. And a prompt-ranking mode if you're choosing between candidate prompts for an AI app.
Stack
Python, Streamlit, no database. The whole thing runs in a single process: sentence-transformers for local embeddings, google-genai for query generation and Gemini embeddings, openai for the OpenAI embedding family, huggingface_hub for gated-model auth, scikit-learn for cosine similarity, plotly for the visualizations, beautifulsoup4 + trafilatura + selenium-stealth for scraping, nltk for sentence splitting.
Deployed on Posit Connect Cloud.
What It Doesn't Do
Worth being direct here:
- It's not a prediction. Cosine similarity between your passages and a fan-out query family is a proxy for retrievability. It doesn't guarantee AI Mode will cite you — citation depends on authority signals, recency, and dozens of things this tool can't see.
- The fan-out is Gemini's guess at what AI Mode generates, not the actual fan-out. Google doesn't publish that. Treating the output as directionally correct is fine. Treating it as ground truth is not.
- Embedding choice matters a lot. The same passage scored with MPNet vs. OpenAI 3-large can land in different bands. Pick one model and stick with it for a project so the scores are comparable.
- It doesn't fix your page for you. It shows you the gaps. You (or I) still have to write the passages that close them.
When to Use It
- Before publishing. Run a draft through it, fix the red passages, publish.
- Competitive audit. Run your page and your top-three competitors against the same seed query. The one with the most green passages is probably the one AI Mode is reaching for.
- Content refresh. Old page under-performing? Check which passages have degraded coverage for the queries you care about now.
- Prompt engineering. Use the persona-ranking mode to pick the best prompt for an AI-driven feature in your product.
Try It Now
QueryDrift
The commercial counterpart to the Fan-Out Analyzer. Built with Grant Simmons (ex-Homes.com, The Search Agency). QueryDrift ingests your Google Search Console data, clusters every query in semantic space, and tracks how your topic focus drifts over time — the signal SEO teams lose when AI Overviews start eating clicks.
One score, one cluster map, the topics you own — and the ones slipping away.
Proprietary SEO Tools
These tools are proprietary and used exclusively on client engagements — not shipped as products. Summaries of what each one does:
Taxonomy Tool
A Next.js and TypeScript application that generates hierarchical e-commerce taxonomies from real site data. It ingests Screaming Frog crawls, Google Search Console queries, GA4 sessions, and Semrush keyword data, then uses Gemini to produce a category tree with meta titles, slugs, and JSON-LD BreadcrumbList markup — scoped to defined customer personas and ready for CMS integration.
AI Search Simulator
A Streamlit application that loads a site, or competitor sites, into a Qdrant vector database using EmbeddingGemma or Gemini embeddings. Once indexed, the collection can be queried the way an AI retrieval system would, content gap audits can be run against a sitemap, and internal-linking suggestions can be generated across the full embedding space. Scraping is handled via Zyte; entity detection via Google Cloud Natural Language.
Media Mix Modeling
A custom Bayesian Media Mix Modeling application, built in-house on similar principles to Google Meridian rather than on top of it. Users upload weekly or daily media spend and revenue, select their channels (paid search, paid social, display, video, affiliate), and the model returns channel attribution with credible intervals, diminishing-returns response curves, budget-allocation optimization, and what-if scenario planning. Designed for the conversation that starts with defending a marketing budget in front of a CFO.
Entity Gap Analysis
A Streamlit tool that extracts entities from both client content and competitor content using Google Cloud Natural Language, scores them against target queries, and renders a relationship graph via networkx. The output: a ranked list of entities competitors are using that the client is not — weighted by query relevance. The result feeds directly into content strategy decisions.
Working With Me on These
The Fan-Out Analyzer is free to use. QueryDrift has a free tier. The harder part is interpreting the scores and writing the passages that close the gaps — that's what the AI SEO consulting service is for. If you want me to run these against your site and return a rewrite list, start a conversation.