Was ist Machine Readability?

Machine Readability beschreibt wie gut KI-Tools und Suchmaschinen den Inhalt einer Website verstehen können. Dazu gehören robots.txt, Sitemap, Schema.org, Open Graph, HTML-Struktur, Alt-Texte, Performance und Canonical Tags.

Ist RobotCheck.coffee kostenlos?

Ja, RobotCheck.coffee ist komplett kostenlos und erfordert keine Anmeldung. Einfach Domain eingeben und sofort den Machine-Readability-Score erhalten.

Welche Faktoren prüft RobotCheck.coffee?

RobotCheck.coffee analysiert 8 Faktoren: robots.txt, Sitemap XML, Schema.org Markup, Open Graph Tags, HTML-Struktur, Alt-Texte für Bilder, Ladegeschwindigkeit und Canonical Tags.

Warum ist KI-Lesbarkeit wichtig?

ChatGPT, Perplexity, Gemini und andere LLMs crawlen Websites ähnlich wie Suchmaschinen. Eine schlecht strukturierte Website wird von KI-Tools ignoriert oder falsch verstanden – das kostet Sichtbarkeit in der KI-getriebenen Suche.

⚠️This article is not yet available in your language. Showing the English version.

Deep Dive · LLM & SEO · 2026

How LLMs Really Read Your Website – And What Can Go Wrong

ChatGPT, Gemini, Perplexity and Claude have become the gatekeepers of the web. Those who are not AI-readable simply do not exist for millions of users. We explain the 10 decisive factors – from Schema.org through robots.txt and llms.txt to hreflang – and how language models react to them.

By Andreas · May 2026 · 12 min read

Search engine optimization is nothing new. But since large language models (LLMs) became the primary information source for hundreds of millions of people, the rules have fundamentally changed. It is no longer just about landing on page 1 of Google – it is about whether an AI assistant actually understands your content, correctly classifies it, and ultimately cites it.

RobotCheck.coffee checks websites against a living rulebook that is updated weekly from the official documentation of the AI providers. This article explains the ten decisive factors in depth – ordered by their weight in the score: what is behind it technically, why it matters for LLMs – and what concrete step you can take immediately.

#01

Schema.org Markup

What is this?

Schema.org is a collaborative vocabulary for structured data – developed by Google, Bing, Yahoo and Yandex. With JSON-LD, Microdata or RDFa you give HTML elements machine-readable meaning: article, product, FAQ, person.

Schema types like Article, BreadcrumbList, FAQPage, HowTo, Product or LocalBusiness are the semantic grammar of the modern web. No other factor weighs more heavily in the RobotCheck score.

Why LLMs care about it

Language models are excellent at processing text – but they have to interpret. Schema.org takes that work off their hands. When an LLM sees a page with correct Article schema, it knows instantly: author, date, publisher, main content – without guessing.

FAQPage markup in particular is gold: LLMs extract questions and answers directly and present them as featured snippets.

🤖 How LLMs react

No schema → model must infer context from prose (more error-prone)
author field present → better E-E-A-T rating (Expertise, Authority, Trust)
datePublished + dateModified → enables temporal classification
Product pages with aggregateRating → shopping-capable AI answers
Malformed JSON-LD → validator errors reduce crawler trust

#02

robots.txt

What is this?

The robots.txt is a text file in the root directory of your domain. It speaks a language dating back to 1994 – older than Google – yet more relevant than ever. It tells crawlers which areas they may visit and which they may not.

Why LLMs care about it

Every major AI company has its own crawlers – and often several with different jobs: ClaudeBot (Anthropic, training), Claude-SearchBot (Anthropic, search), GPTBot (OpenAI), Google-Extended (Google/Gemini), PerplexityBot. They respect robots.txt – anyone who wants their content visible to AI systems must not lock these bots out.

The most common mistake: developers block all bots wholesale with User-agent: * / Disallow: / – inadvertently excluding every LLM crawler too. The result: the AI simply does not know the content. The second most common mistake: allowing only one of a provider's bots and forgetting the others – then the page makes it into training, but not into the AI assistant's live search results.

A robots.txt without explicit LLM bot rules is like a shop window with drawn curtains. – RobotCheck Analysis

🤖 How LLMs react

Blocked ClaudeBot/GPTBot → content does not appear in AI answers
Only the training bot allowed, search bot forgotten → invisible in live answers
Explicitly allowed paths → higher indexing probability
Missing Sitemap: directive → crawlers do not find the sitemap automatically

#03

llms.txt

What is this?

The llms.txt is the youngest standard on this list – a proposal by Jeremy Howard (Answer.AI) that is spreading rapidly. Like robots.txt it sits in the root directory and is a curated table of contents in Markdown format: what does this website offer, what are the most important pages, where do AI systems find the essence?

While robots.txt says where crawlers may go, llms.txt says what they will find there – in a form LLMs can process directly: an H1 title, a concise summary as a blockquote, annotated link lists to the core pages.

Why LLMs care about it

LLM systems work with a limited context window. Instead of crawling hundreds of subpages and weighing them up themselves, an llms.txt lets them jump straight to the substance. Anthropic, Perplexity and a growing number of tools already parse the file.

The standard is young and evolving – which is exactly why RobotCheck checks llms.txt against the current specification at llmstxt.org, not against a frozen snapshot. What is optional today may be recommended next month.

🤖 How LLMs react

No llms.txt → AI has to guess the site structure itself
H1 + blockquote summary present → instant understanding of the offering
Annotated links to core pages → AI cites the right pages
Relative instead of absolute URLs → links unusable for external systems

#04

Sitemap XML

What is this?

An XML sitemap is the table of contents of your website – machine-readable, structured, with metadata about every URL: change date, frequency, priority. It is the most direct way to tell a crawler: here is everything I have.

Why LLMs care about it

Crawler-based AI systems (especially Perplexity, which crawls live) actively use sitemaps to discover new content. If the sitemap is missing or outdated, new blog posts or product pages are simply overlooked.

Particularly important: the lastmod value. LLMs prefer fresh content. A correctly set date signals currency and increases the chance of being cited in time-sensitive queries.

For WordPress: plugins like Yoast SEO or RankMath generate sitemaps automatically. For static sites (Astro, Next.js) there are sitemap plugins for every framework.

🤖 How LLMs react

Missing sitemap → crawlers discover content only via internal links
Outdated lastmod data → content is classified as stale
Image sitemap present → multimodal models index images better
Sitemap index → enables selective crawling of specific sections

#05

Open Graph Meta Tags

What is this?

Open Graph (OG) was originally developed by Facebook to display links in social networks. Today OG tags are standard across all platforms and AI tools. The most important tags: og:title, og:description, og:image, og:type.

Why LLMs care about it

When an LLM system fetches a URL, it reads the head section first. OG tags are a fast, reliable source for title and summary – a precise og:description is often the basis for source previews in AI answers, and usually more accurate than the body text.

Perplexity in particular uses OG data heavily for source links. A missing og:description leads to weak source previews – and therefore fewer clicks. Open Graph does not make a page AI-readable, but its absence can make an otherwise good page look weaker.

🤖 How LLMs react

Missing OG tags → title and description are guessed from the DOM
Unique og:image → boosts social sharing and multimodal visibility
og:type: article with article:published_time → more precise temporal classification
Twitter card tags as well → full coverage of all major platforms

#06

HTML Structure & Semantics

What is this?

HTML5 brought semantic elements: article, section, nav, header, footer, main, aside. They describe not only the appearance but the meaning of the content.

In practice this is the most common weak point of existing websites: many sites were built with div and span, without any semantic structure.

Why LLMs care about it

For a human, bad HTML is a visual problem. For an LLM it is a conceptual one. Without semantic tags the model has to guess what the main content is – and what is navigation, advertising or footer.

Modern LLM crawlers use main as the primary content container. If that tag is missing, the entire body is processed – including menu, cookie banner and footer links.

Fastest fix: wrap your main content in main, make sure you have exactly one h1 per page, and use article for self-contained pieces of content.

🤖 How LLMs react

No main → crawler processes the entire body (incl. navigation, footer)
Multiple h1 → page topic unclear to the model
Flat heading hierarchy → article structure not recognizable
Nested div soup → higher error rate in content extraction
Correct landmark roles → assistive technologies AND AI benefit equally

#07

Alt Text for Images

What is this?

The alt attribute in the img tag is a text description of an image – developed primarily for screen readers. An empty alt signals: this image is decorative. A meaningful alt text describes the content and context of the image.

Why LLMs care about it

Even though multimodal models can see images directly – when crawling text sources, LLMs rely almost exclusively on alt text. An image without alt text is simply invisible to a text-based crawling system.

Well-written alt text improves a page's semantic field: it adds keywords that fit the topic naturally – without keyword stuffing.

🤖 How LLMs react

Missing alt text → images are ignored during text crawling; content gaps appear
Descriptive alt text → expanded semantic field of the page
Keyword-stuffed alt text → flagged as spam, possible negative impact
Alt text for infographics → critical: data in graphics otherwise inaccessible to LLMs

#08

Performance & Core Web Vitals

What is this?

Performance means: how fast and stable does your page load? With the Core Web Vitals (LCP, CLS, INP/FID) Google has defined concrete metrics. LCP measures loading speed, CLS visual stability, INP interactivity.

Why LLMs care about it

For LLM crawlers performance has a different dimension: headless browsers and HTTP clients do not wait forever. Pages that take longer than 5–10 seconds to load are aborted by some crawlers. JavaScript-heavy apps without server-side rendering are invisible to many crawlers.

In addition: Google uses performance as a ranking signal. Since LLMs like Perplexity use Google results as a base, performance feeds indirectly into AI visibility.

Server-Side Rendering (SSR) or Static Site Generation (SSG) are the most effective levers for LLM crawlers. Next.js, Astro, Nuxt – all support SSR out of the box.

🤖 How LLMs react

Purely client-side JS rendering → content often not indexed
Slow TTFB → crawler timeout more likely
Large, uncompressed images → lower crawl frequency
Good Core Web Vitals → indirect boost through better Google ranking

#09

Canonical Tags

What is this?

A canonical tag tells crawlers: this is the authoritative, original version of this page. It is the antidote to duplicate content – the same content across different URLs.

Why LLMs care about it

LLMs train on crawl datasets. If the same information exists on 5 different URLs, it is crawled 5 times – but none gets full authority. Canonical tags concentrate the link juice.

Especially important for blogs with tags, categories and archive pages: without canonicals, every archive page dilutes the authority of the original article.

🤖 How LLMs react

Missing canonical → authority spreads across duplicates
Self-referencing canonical → correct practice, confirms originality
Canonical on HTTPS → eliminates the HTTP/HTTPS duplicate problem
Pagination without canonical → pages 2, 3, ... treated as separate documents
Hreflang + canonical combined → multilingual sites correctly structured

#10

hreflang

What is this?

The hreflang attribute tells crawlers which language and region a page is intended for – and where its counterparts in other languages live. It is written as <link rel="alternate" hreflang="en" href="..."> in the head, one line per language version, plus ideally an hreflang="x-default" as fallback.

Why LLMs care about it

Multilingual sites without hreflang confuse crawlers: they cannot reliably tell that the German and English versions are the same page – and in the worst case treat them as duplicates that dilute each other's authority. With correct hreflang, an AI system knows which language version to serve a user.

The x-default entry is the safety anchor: it tells the system which version to show when none of the listed languages matches the user's request. Without it, crawlers guess – usually to the disadvantage of smaller languages.

🤖 How LLMs react

No hreflang tags → language versions look like competing duplicates
At least two languages linked → clear signal for an international site
hreflang='x-default' set → reliable fallback for unknown locales
hreflang + canonical combined → each language version keeps its own authority

Summary: The 10 Factors at a Glance

#	Factor	Importance for LLMs
1	Schema.org	Semantic interpretation – strongest factor
2	robots.txt	Access control for all AI crawlers
3	llms.txt	Curated table of contents for AI systems
4	Sitemap XML	Complete content discovery
5	Open Graph	Fast preview in the head section
6	HTML structure	Content extraction and context
7	Alt text	Image understanding without multimodal
8	Performance	Crawler reachability and ranking
9	Canonical tags	Authority concentration
10	hreflang	Correct mapping of language versions

The AI readability of a website is no longer a luxury – it is a basic prerequisite for visibility in the new era of information search. And it is a moving target: the rules RobotCheck checks against are updated weekly from the official sources of the AI providers – read how that works here.

Test your own website at robotcheck.coffee/check – free, no signup, with instant results.

Analyze my website now