Introduction
Your site ranks on Google. Your Core Web Vitals are green. Your schema validates. And yet when someone asks ChatGPT, Claude, Perplexity, or Google AI Overviews about your industry, your brand never appears. That gap between classical SEO and what AI search engines actually cite is the problem Generative Engine Optimization exists to solve.
GEO is not a replacement for SEO. It is a parallel discipline that focuses on how large language models retrieve, interpret and cite content rather than how search crawlers index and rank it. The techniques overlap in places, diverge in others, and the field moves faster than any published standard. This guide collects what actually works in practice, what does not, and the order in which a competent engineering team should tackle each lever.
Why AI visibility is no longer optional
By the first quarter of 2026, AI-mediated traffic is a measurable and growing share of qualified referrals for most information-heavy sites. ChatGPT Search is embedded in the iOS address bar for users with the app installed. Perplexity has moved from power-user tool to mainstream product. Claude reads the open web through its search tool. Google AI Overviews appear on a large share of informational queries in English-speaking markets and are rolling out to more languages each quarter.
The economic consequence is blunt. Even when a prospect ultimately buys, signs up or books a consultation through classical search, their shortlist was often filtered by an LLM upstream. If your name never enters that shortlist you do not get to compete. Classical SEO still matters, because it feeds the same crawl infrastructure that many AI systems use, but SEO alone is no longer sufficient.
AI or LLMs: a quick word on terminology
Both terms appear in practice and the choice matters less than consistency. “AI search” is the broader user-facing term. It covers chatbots, retrieval-augmented generation tools, AI Overviews, and hybrid systems. “LLMs” refers specifically to the underlying language models. Throughout this guide we use “AI” for the discovery surface and “LLMs” for the technology behind it. The optimization target is the same.
The AI crawler ecosystem
Before touching any code, know who you are optimizing for. As of April 2026 the main user agents are:
GPTBot, used by OpenAI for training and offline retrieval.OAI-SearchBot, used by ChatGPT live search.ChatGPT-User, on-demand fetches triggered by a user prompt.ClaudeBotandClaude-User, used by Anthropic.PerplexityBotandPerplexity-User, used by Perplexity AI.Google-Extended, the opt-out control for Gemini training.CCBot, Common Crawl, which feeds many smaller LLMs.Applebot-Extended, the opt-out control for Apple Intelligence training.Bytespider, operated by ByteDance.Meta-ExternalAgent, used by Meta AI.
None of these execute JavaScript. All of them respect robots.txt. Most identify themselves honestly. A subset fetch content on behalf of a user the moment that user presses send, which collapses the cycle from days to seconds.
What does not work
The GEO space is full of folklore. Most of it has no empirical support.
Custom meta tags such as <meta name="ai-content-url"> or <meta name="llms"> have no known implementation in any shipping LLM product. Files like /.well-known/ai.txt and /ai.txt have competing proposals and no adoption. HTML comments aimed at bots are stripped by every mainstream crawler before processing. Human-or-AI toggle buttons require a click, and bots do not click. User-agent sniffing to serve different content to LLMs violates Google’s cloaking policy and can trigger manual actions. Dedicated “AI info pages” show no differential treatment in citation behaviour. Pure JSON-LD and Schema.org markup is read by Microsoft Copilot through Bing and still influences traditional SEO, but multiple controlled tests show that ChatGPT, Claude and Perplexity largely ignore structured data during answer synthesis.
The pattern is always the same. Someone proposes a spec, writes a blog post, and other blog posts cite that one. Before adopting any GEO technique, ask whether there is evidence of actual consumption or only evidence of proposal.
Six techniques that actually work
Ordered by impact for a typical content-heavy site.
1. Audit robots.txt first
Nothing else matters if you are accidentally blocking the crawlers. Many sites inherited aggressive disallow rules from the 2023 panic about AI training. The decision to allow or block each crawler is yours to make, but it should be explicit. A reasonable default for a business that wants AI visibility:
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
If you want search visibility in ChatGPT but do not want your content used for model training, allow OAI-SearchBot and ChatGPT-User while disallowing GPTBot. OpenAI documents this split. Apple, Google and Anthropic offer similar controls.
2. Serve Markdown siblings for every page
This is the highest-leverage technical change on the list. Expose a clean Markdown version of every page at the same URL with .md appended, for example /blog/post and /blog/post.md. The Markdown version drops navigation, footers, analytics snippets, cookie banners and everything else that inflates token counts without adding informational value.
Independent measurements across content sites consistently report token reductions in the 70 to 85 percent range when Markdown replaces rendered HTML. A 15,000-token blog post typically becomes a 3,000-token one. That matters because when an LLM fetches your page to answer a user prompt it has a finite context budget. Smaller, cleaner content fits more fully into that budget and quotes more faithfully.
On Astro, Next.js or any static-first framework, generating .md endpoints from the same content collection that feeds the HTML view is a morning of work. The Markdown should include the page title, publication date, author, a short summary, the body, and clearly marked citations.
3. Advertise the Markdown version
An LLM crawler arriving at your HTML page needs to discover that a Markdown sibling exists. Two complementary mechanisms handle the two classes of client.
HTML head:
<link rel="alternate" type="text/markdown" href="/blog/post.md" />
HTTP response header:
Link: </blog/post.md>; rel="alternate"; type="text/markdown"
The HTML tag reaches parsers that read the DOM. The HTTP header reaches headless agents that issue a HEAD or GET and skip markup parsing entirely. Cost of deployment is one line in your layout plus one entry in your CDN headers config. Benefit: crawlers no longer have to guess your URL pattern.
4. Content negotiation on Accept: text/markdown
HTTP content negotiation has been in the standard since 1997. When a client sends Accept: text/markdown, return the Markdown representation from the same URL. Pair with Vary: Accept so CDNs cache each representation correctly. Claude Code, Cursor and several research agents already send this header by default. Content negotiation is the most likely long-term standard because it requires no new specifications and reuses the existing HTTP stack. On Cloudflare Workers the implementation is under twenty lines.
5. Publish llms.txt and llms-full.txt
Two files at the site root do the work.
/llms.txt is a curated Markdown index. It lists your most important pages grouped by theme with one-line descriptions. Think of it as a README for an LLM that has been asked about your site.
/llms-full.txt concatenates the full Markdown content of your core pages into a single file. Analytics across multiple publishers show it receives substantially more LLM traffic than the short index does. Generate both at build time from the same content collection you already maintain and regenerate on every deploy. No major LLM provider has formally committed to reading these files, but they show up in server logs often enough to justify the setup time.
6. Invest in visible content quality
This is the largest and most overlooked lever. LLMs synthesize answers by weighting candidate sentences from retrieved pages. Sentences that contain concrete, attributable information score higher than vague claims. “Studies show that AI visibility matters” is nearly useless. “Independent testing in 2026 found that direct quotations from named experts raise citation rates by roughly 43 percent” is what gets cited.
Three specific moves consistently raise AI citation rates on content that was previously under-cited:
- Direct quotations from named experts.
- Verifiable statistics with a clearly attributed source.
- Authoritative outbound citations to primary sources.
Every one of these is a visible content signal. None of them are hidden. For a WordPress agency, a SaaS company or any business with industry-specific expertise, the practical translation is: stop writing generic listicles, start writing articles that name specific versions, specific dates, specific people and specific numbers. Cite your sources inline. Quote your own team members by name. Include the year and the version of every tool you reference. This is good journalism and it is also GEO.
A combined SEO and GEO checklist for 2026
Crawlability and indexing:
- robots.txt permits both the classical search crawlers and the AI user agents you want to be visible in.
- XML sitemap is fresh and submitted to Google Search Console and Bing Webmaster Tools.
- Internal linking follows a clear topical structure. Hub and spoke, or topical maps.
- Canonical tags are correct on duplicate or parameterised URLs.
On-page work that still matters for humans and LLM retrieval alike:
- Descriptive, keyword-informed page titles and H1s in sentence case.
- One H1 per page, logical H2 and H3 hierarchy.
- Meta descriptions written for humans, not stuffed.
- Schema.org types that match the content, including Article, FAQPage, Product, Organization, Person and HowTo.
- Image alt text that describes the image, not the keyword.
Core Web Vitals. AI crawlers do not run JavaScript, but classical search still does:
- LCP under 2.5 seconds on mobile.
- CLS under 0.1.
- INP under 200 milliseconds.
GEO-specific:
- Markdown sibling endpoints for every content page.
<link rel="alternate" type="text/markdown">in every layout.Link:HTTP header at the CDN level.- Content negotiation on
Accept: text/markdown. /llms.txtand/llms-full.txtat the site root, regenerated on every deploy.- Visible citations, quotations and statistics within the content itself.
- Named entities, including people, products, companies, versions and dates, used consistently.
Measurement:
- Server-side logging of User-Agent and Referer at the edge.
- A dashboard segmenting AI crawler traffic from human and classical search traffic.
- Monthly test prompts in ChatGPT, Claude, Perplexity and Gemini to check brand mention patterns.
A pragmatic implementation order
For a site that has none of this yet, work in this sequence. Each step is independently valuable.
- Audit and fix robots.txt. Nothing else works without this.
- Add
<link rel="alternate" type="text/markdown">pointing at an eventual Markdown endpoint. You can ship this before the endpoint exists. - Build the Markdown rendering pipeline. Start with your highest-traffic content type and expand.
- Add the
Link:response header andAccept: text/markdowncontent negotiation at your CDN. - Generate
/llms.txtand/llms-full.txtfrom your content collection at build time. - Rework your flagship articles to include named entities, direct quotes and cited statistics. This is ongoing editorial work and has the highest long-term impact.
- Instrument server-side analytics for AI crawler traffic.
Steps one through five are pure engineering and can land in a single sprint. Step six is editorial and compounds over quarters. Step seven tells you whether any of it worked.
How to measure AI visibility
Traditional analytics will not see most AI crawler traffic. They do not execute JavaScript, so GA4 and Plausible miss them entirely. You need server-side logging.
Capture at the edge: full User-Agent string, Referer header, requested path, HTTP status returned and response size. Segment by user-agent pattern such as GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Bytespider, CCBot, Applebot-Extended and Google-Extended to build a weekly dashboard.
Complement this with brand mention tracking. Once a week, run a fixed set of prompts across ChatGPT, Claude, Perplexity and Gemini, and record whether your brand is cited, in what position, and with what quote. Tools such as Profound, Peec AI and Otterly automate this work. A disciplined spreadsheet works equally well.
If a crawler fetches your Markdown endpoint but your brand never appears in the citation set, the content is reachable but not competitive. Go back to the editorial lever.
Common objections and honest answers
Will serving Markdown cannibalize my SEO traffic? No. Google indexes the HTML canonical. The Markdown sibling is a secondary representation advertised through rel="alternate", which tells Google not to treat it as duplicate content.
Is this all going to get abused? LLMs evaluate candidate sources on perceived trust signals, not on file format. Publishing Markdown does not make thin content citable. The investment protects you from the downside of being unreadable, without guaranteeing the upside of being cited.
Should I block AI crawlers to protect my content? It is a business decision, not a technical one. If your revenue depends on people reaching your site directly, blocking training crawlers while allowing retrieval crawlers is a reasonable middle ground. If your revenue depends on discovery, block nothing.
How long until I see results? The engineering layer takes effect within days, as soon as crawlers refetch. The editorial layer compounds over quarters. Expect measurable changes in AI citations within four to eight weeks of shipping both layers together.
The durable strategy
The GEO space is unstable. Every month brings a new proposed standard, a new crawler or a new shift in how one of the major LLMs weights citations. The durable strategy is not to chase each trend but to invest in the layer that every retrieval system needs: clean, well-structured, citation-rich content delivered in a format that is cheap to ingest.
The technical plumbing, namely Markdown siblings, content negotiation and llms.txt, is table stakes. The editorial work, namely named entities, direct quotes, verifiable statistics and authoritative citations, is the moat. Competitors can copy your infrastructure in a week. They cannot copy five years of substantive writing by named experts at your company.
Build the plumbing once. Then spend the rest of your time writing things worth citing.


