How LLMs Really Read Your Website – And What Can Go Wrong
ChatGPT, Gemini, Perplexity and Claude have become the gatekeepers of the web. Those who are not AI-readable simply do not exist for millions of users. We explain the 10 decisive factors – from Schema.org through robots.txt and llms.txt to hreflang – and how language models react to them.
Search engine optimization is nothing new. But since large language models (LLMs) became the primary information source for hundreds of millions of people, the rules have fundamentally changed. It is no longer just about landing on page 1 of Google – it is about whether an AI assistant actually understands your content, correctly classifies it, and ultimately cites it.
RobotCheck.coffee checks websites against a living rulebook that is updated weekly from the official documentation of the AI providers. This article explains the ten decisive factors in depth – ordered by their weight in the score: what is behind it technically, why it matters for LLMs – and what concrete step you can take immediately.
Schema.org Markup
What is this?
Schema.org is a collaborative vocabulary for structured data – developed by Google, Bing, Yahoo and Yandex. With JSON-LD, Microdata or RDFa you give HTML elements machine-readable meaning: article, product, FAQ, person.
Schema types like Article, BreadcrumbList, FAQPage, HowTo, Product or LocalBusiness are the semantic grammar of the modern web. No other factor weighs more heavily in the RobotCheck score.
Why LLMs care about it
Language models are excellent at processing text – but they have to interpret. Schema.org takes that work off their hands. When an LLM sees a page with correct Article schema, it knows instantly: author, date, publisher, main content – without guessing.
FAQPage markup in particular is gold: LLMs extract questions and answers directly and present them as featured snippets.
- No schema → model must infer context from prose (more error-prone)
- author field present → better E-E-A-T rating (Expertise, Authority, Trust)
- datePublished + dateModified → enables temporal classification
- Product pages with aggregateRating → shopping-capable AI answers
- Malformed JSON-LD → validator errors reduce crawler trust
robots.txt
What is this?
The robots.txt is a text file in the root directory of your domain. It speaks a language dating back to 1994 – older than Google – yet more relevant than ever. It tells crawlers which areas they may visit and which they may not.
Why LLMs care about it
Every major AI company has its own crawlers – and often several with different jobs: ClaudeBot (Anthropic, training), Claude-SearchBot (Anthropic, search), GPTBot (OpenAI), Google-Extended (Google/Gemini), PerplexityBot. They respect robots.txt – anyone who wants their content visible to AI systems must not lock these bots out.
The most common mistake: developers block all bots wholesale with User-agent: * / Disallow: / – inadvertently excluding every LLM crawler too. The result: the AI simply does not know the content. The second most common mistake: allowing only one of a provider's bots and forgetting the others – then the page makes it into training, but not into the AI assistant's live search results.
A robots.txt without explicit LLM bot rules is like a shop window with drawn curtains. – RobotCheck Analysis
- Blocked ClaudeBot/GPTBot → content does not appear in AI answers
- Only the training bot allowed, search bot forgotten → invisible in live answers
- Explicitly allowed paths → higher indexing probability
- Missing Sitemap: directive → crawlers do not find the sitemap automatically
llms.txt
What is this?
The llms.txt is the youngest standard on this list – a proposal by Jeremy Howard (Answer.AI) that is spreading rapidly. Like robots.txt it sits in the root directory and is a curated table of contents in Markdown format: what does this website offer, what are the most important pages, where do AI systems find the essence?
While robots.txt says where crawlers may go, llms.txt says what they will find there – in a form LLMs can process directly: an H1 title, a concise summary as a blockquote, annotated link lists to the core pages.
Why LLMs care about it
LLM systems work with a limited context window. Instead of crawling hundreds of subpages and weighing them up themselves, an llms.txt lets them jump straight to the substance. Anthropic, Perplexity and a growing number of tools already parse the file.
The standard is young and evolving – which is exactly why RobotCheck checks llms.txt against the current specification at llmstxt.org, not against a frozen snapshot. What is optional today may be recommended next month.
- No llms.txt → AI has to guess the site structure itself
- H1 + blockquote summary present → instant understanding of the offering
- Annotated links to core pages → AI cites the right pages
- Relative instead of absolute URLs → links unusable for external systems
Sitemap XML
What is this?
An XML sitemap is the table of contents of your website – machine-readable, structured, with metadata about every URL: change date, frequency, priority. It is the most direct way to tell a crawler: here is everything I have.
Why LLMs care about it
Crawler-based AI systems (especially Perplexity, which crawls live) actively use sitemaps to discover new content. If the sitemap is missing or outdated, new blog posts or product pages are simply overlooked.
Particularly important: the lastmod value. LLMs prefer fresh content. A correctly set date signals currency and increases the chance of being cited in time-sensitive queries.
- Missing sitemap → crawlers discover content only via internal links
- Outdated lastmod data → content is classified as stale
- Image sitemap present → multimodal models index images better
- Sitemap index → enables selective crawling of specific sections
Open Graph Meta Tags
What is this?
Open Graph (OG) was originally developed by Facebook to display links in social networks. Today OG tags are standard across all platforms and AI tools. The most important tags: og:title, og:description, og:image, og:type.
Why LLMs care about it
When an LLM system fetches a URL, it reads the head section first. OG tags are a fast, reliable source for title and summary – a precise og:description is often the basis for source previews in AI answers, and usually more accurate than the body text.
Perplexity in particular uses OG data heavily for source links. A missing og:description leads to weak source previews – and therefore fewer clicks. Open Graph does not make a page AI-readable, but its absence can make an otherwise good page look weaker.
- Missing OG tags → title and description are guessed from the DOM
- Unique og:image → boosts social sharing and multimodal visibility
- og:type: article with article:published_time → more precise temporal classification
- Twitter card tags as well → full coverage of all major platforms
HTML Structure & Semantics
What is this?
HTML5 brought semantic elements: article, section, nav, header, footer, main, aside. They describe not only the appearance but the meaning of the content.
In practice this is the most common weak point of existing websites: many sites were built with div and span, without any semantic structure.
Why LLMs care about it
For a human, bad HTML is a visual problem. For an LLM it is a conceptual one. Without semantic tags the model has to guess what the main content is – and what is navigation, advertising or footer.
Modern LLM crawlers use main as the primary content container. If that tag is missing, the entire body is processed – including menu, cookie banner and footer links.
main, make sure you have exactly one h1 per page, and use article for self-contained pieces of content.- No main → crawler processes the entire body (incl. navigation, footer)
- Multiple h1 → page topic unclear to the model
- Flat heading hierarchy → article structure not recognizable
- Nested div soup → higher error rate in content extraction
- Correct landmark roles → assistive technologies AND AI benefit equally
Alt Text for Images
What is this?
The alt attribute in the img tag is a text description of an image – developed primarily for screen readers. An empty alt signals: this image is decorative. A meaningful alt text describes the content and context of the image.
Why LLMs care about it
Even though multimodal models can see images directly – when crawling text sources, LLMs rely almost exclusively on alt text. An image without alt text is simply invisible to a text-based crawling system.
Well-written alt text improves a page's semantic field: it adds keywords that fit the topic naturally – without keyword stuffing.
- Missing alt text → images are ignored during text crawling; content gaps appear
- Descriptive alt text → expanded semantic field of the page
- Keyword-stuffed alt text → flagged as spam, possible negative impact
- Alt text for infographics → critical: data in graphics otherwise inaccessible to LLMs
Performance & Core Web Vitals
What is this?
Performance means: how fast and stable does your page load? With the Core Web Vitals (LCP, CLS, INP/FID) Google has defined concrete metrics. LCP measures loading speed, CLS visual stability, INP interactivity.
Why LLMs care about it
For LLM crawlers performance has a different dimension: headless browsers and HTTP clients do not wait forever. Pages that take longer than 5–10 seconds to load are aborted by some crawlers. JavaScript-heavy apps without server-side rendering are invisible to many crawlers.
In addition: Google uses performance as a ranking signal. Since LLMs like Perplexity use Google results as a base, performance feeds indirectly into AI visibility.
- Purely client-side JS rendering → content often not indexed
- Slow TTFB → crawler timeout more likely
- Large, uncompressed images → lower crawl frequency
- Good Core Web Vitals → indirect boost through better Google ranking
Canonical Tags
What is this?
A canonical tag tells crawlers: this is the authoritative, original version of this page. It is the antidote to duplicate content – the same content across different URLs.
Why LLMs care about it
LLMs train on crawl datasets. If the same information exists on 5 different URLs, it is crawled 5 times – but none gets full authority. Canonical tags concentrate the link juice.
Especially important for blogs with tags, categories and archive pages: without canonicals, every archive page dilutes the authority of the original article.
- Missing canonical → authority spreads across duplicates
- Self-referencing canonical → correct practice, confirms originality
- Canonical on HTTPS → eliminates the HTTP/HTTPS duplicate problem
- Pagination without canonical → pages 2, 3, ... treated as separate documents
- Hreflang + canonical combined → multilingual sites correctly structured
hreflang
What is this?
The hreflang attribute tells crawlers which language and region a page is intended for – and where its counterparts in other languages live. It is written as <link rel="alternate" hreflang="en" href="..."> in the head, one line per language version, plus ideally an hreflang="x-default" as fallback.
Why LLMs care about it
Multilingual sites without hreflang confuse crawlers: they cannot reliably tell that the German and English versions are the same page – and in the worst case treat them as duplicates that dilute each other's authority. With correct hreflang, an AI system knows which language version to serve a user.
The x-default entry is the safety anchor: it tells the system which version to show when none of the listed languages matches the user's request. Without it, crawlers guess – usually to the disadvantage of smaller languages.
- No hreflang tags → language versions look like competing duplicates
- At least two languages linked → clear signal for an international site
- hreflang='x-default' set → reliable fallback for unknown locales
- hreflang + canonical combined → each language version keeps its own authority
Summary: The 10 Factors at a Glance
| # | Factor | Importance for LLMs |
|---|---|---|
| 1 | Schema.org | Semantic interpretation – strongest factor |
| 2 | robots.txt | Access control for all AI crawlers |
| 3 | llms.txt | Curated table of contents for AI systems |
| 4 | Sitemap XML | Complete content discovery |
| 5 | Open Graph | Fast preview in the head section |
| 6 | HTML structure | Content extraction and context |
| 7 | Alt text | Image understanding without multimodal |
| 8 | Performance | Crawler reachability and ranking |
| 9 | Canonical tags | Authority concentration |
| 10 | hreflang | Correct mapping of language versions |
The AI readability of a website is no longer a luxury – it is a basic prerequisite for visibility in the new era of information search. And it is a moving target: the rules RobotCheck checks against are updated weekly from the official sources of the AI providers – read how that works here.
Test your own website at robotcheck.coffee/check – free, no signup, with instant results.