We built agent-bench, an open-source tool that scores websites on how well they work with AI agents. Then we pointed it at 20 of the most popular sites on the internet. The highest score was 60%. The average was 38%. Nobody is ready.
agent-bench runs six categories of static analysis against a website, each measuring a different dimension of agent-friendliness:
llms.txt standard.Each check produces a score from 0 to 1. The overall score is a weighted average. No LLMs are involved in the scoring — it's purely structural analysis, which means it's fast, free, and reproducible.
| Site | Score | API | Auth | Docs | Structure | Errors | Cost |
|---|---|---|---|---|---|---|---|
| api.github.com | 59% | 45% | 82% | 0% | 60% | 80% | 100% |
| github.com | 50% | 55% | 70% | 36% | 61% | 46% | 35% |
| httpbin.org | 46% | 25% | 82% | 16% | 53% | 43% | 75% |
| wikipedia.org | 42% | 7% | 82% | 0% | 52% | 23% | 100% |
| docs.python.org | 42% | 5% | 82% | 40% | 55% | 20% | 70% |
| twitch.tv | 39% | 5% | 62% | 70% | 49% | 26% | 45% |
| stripe.com | 39% | 17% | 62% | 60% | 56% | 23% | 30% |
| spotify.com | 37% | 5% | 60% | 60% | 65% | 26% | 25% |
| news.ycombinator.com | 36% | 5% | 77% | 16% | 57% | 33% | 50% |
| medium.com | 35% | 5% | 60% | 12% | 46% | 23% | 75% |
| figma.com | 35% | 5% | 67% | 52% | 44% | 23% | 40% |
| linear.app | 34% | 5% | 77% | 60% | 24% | 36% | 40% |
| shopify.com | 34% | 5% | 77% | 76% | 33% | 33% | 20% |
| stackoverflow.com | 33% | 5% | 60% | 0% | 46% | 23% | 75% |
| amazon.com | 33% | 5% | 60% | 8% | 46% | 20% | 70% |
| vercel.com | 33% | 5% | 72% | 60% | 44% | 13% | 30% |
| notion.so | 33% | 5% | 77% | 60% | 41% | 20% | 25% |
| twitter.com | 31% | 5% | 55% | 34% | 46% | 16% | 45% |
| discord.com | 29% | 5% | 65% | 36% | 43% | 20% | 30% |
| reddit.com | 29% | 5% | 62% | 4% | 42% | 43% | 45% |
The highest score — GitHub's API at 60% — is the only site that even approaches a passing grade. And that's the API endpoint, not the website. The average across all 20 sites is 38%. (Updated: after fixing false positives where SPA sites were getting inflated API scores, most sites scored even lower than our initial run.)
You'd expect sites built by and for developers — GitHub, Vercel, Linear, Stripe — to score highest. They do edge out consumer sites, but not by much. Stripe, the company famous for best-in-class API design, scored 41%. The stripe.com marketing site lacks an OpenAPI spec at the root, returns soft 404s, and ships heavy JavaScript bundles that cost agents hundreds of thousands of tokens to parse.
The lesson: having a great API product doesn't mean your website is agent-friendly. These are different problems.
15 out of 20 sites return 200 OK for nonexistent pages. This is the single most common failure across the board. When an agent navigates to a bad URL, it gets a full HTML page back with a 200 status and has to figure out on its own that the content doesn't exist.
This is a solved problem. Return 404. Return it with a structured JSON body if you can. It costs nothing to implement and it's the first thing that breaks for autonomous agents.
The cost check measures how many tokens an agent would burn just to read a page. The numbers are staggering:
At Claude Sonnet pricing ($3/M input tokens), loading a single Linear page costs $1.69 in tokens. An agent that needs to navigate five pages to complete a task would spend over $8 just on reading — before it does anything.
Signal-to-noise ratios tell the rest of the story. Most sites scored 0-1% — meaning 99% of what the agent receives is JavaScript bundles, CSS classes, tracking scripts, and framework boilerplate. The actual content the agent needs is buried in noise.
The llms.txt standard is only a few months old, but 8 out of 20 sites already have one. GitHub, Vercel, Linear, Twitch, Spotify, and Twitter all serve an llms.txt file — a plain-text summary of the site designed for LLMs to consume instead of parsing raw HTML.
This is probably the single highest-impact change a site can make. Instead of forcing an agent to parse 500K tokens of HTML/JS, give it a 10K-token text file that describes what the site does and how to use it.
Surprisingly, authentication scored highest on average. Most sites don't actively block bots on their public pages (CAPTCHAs are usually reserved for login flows), and several expose OAuth discovery endpoints that an agent could use for machine-to-machine auth. The bar is low, but at least it's not hostile.
Based on the data, here are the highest-leverage things a site can do to become more agent-friendly, roughly in order of effort:
200 OK for pages that don't exist. This is a one-line fix in most frameworks.llms.txt. Write a plain-text description of your site and serve it at /llms.txt. Takes an hour. Saves agents millions of tokens.X-RateLimit-Remaining and Retry-After tell agents when to back off instead of guessing. Almost nobody does this on their marketing site.<nav>, <main>, <article>. Agents use these to understand page structure without vision models.data-testid attributes. Stable selectors give agents reliable targets that won't break when you redesign. You probably already have them in your test suite — ship them to production.agent-bench is open source and free to run. Static analysis doesn't call any LLMs — it just makes HTTP requests and analyzes the responses.
pip install git+https://github.com/LightLayer-dev/agent-bench.git # Score any website agent-bench analyze https://your-site.com # Get a detailed HTML report agent-bench analyze https://your-site.com --format html -o report.html # See what kind of site it is and what tasks agents would try agent-bench classify https://your-site.com
We're also building live agent run benchmarks — where real AI agents attempt real tasks on real websites, measuring success rates, costs, and step counts across different models and frameworks. That's coming soon.
The web wasn't built for agents. But it can be rebuilt — one 404 at a time.