We fetched robots.txt from 99 popular websites and checked whether they block eight major AI crawlers: GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, Google-Extended, CCBot, PerplexityBot, and Bytespider. The results paint a clear picture of who's embracing the AI era and who's slamming the door.
Roughly one in three popular websites explicitly blocks AI crawlers in their robots.txt. But the blocking isn't uniform — it's concentrated in content-heavy industries and almost entirely absent from others.
CCBot (Common Crawl) and ClaudeBot are the most frequently blocked — rejected by over a third of sites with a reachable robots.txt. ChatGPT-User (OpenAI's browsing agent) is blocked least, at 23%. Interestingly, GPTBot (OpenAI's training crawler) is blocked less than Anthropic's bots — perhaps reflecting OpenAI's head start in negotiating content licensing deals. The spread between bots is wider than you might expect, suggesting sites are making deliberate per-bot decisions, not just blanket "block all AI" policies.
The pattern is unmistakable: content creators block; platforms and services don't.
News and media sites — the organizations whose primary product is written content — block AI bots at near-universal rates. A staggering 95% of the news sites we checked block at least one crawler. For outlets like the New York Times, BBC, NPR, CNN, and USA Today, it's a blanket ban on all eight bots.
Meanwhile, tech platforms, e-commerce, finance, government, and education sites almost universally leave the door open. Not a single government site or university we checked blocks any AI crawler. The logic tracks: these sites want to be found, indexed, and consumed. AI bots are just another channel.
This is where the real battle is. Here's the full breakdown:
| Site | GPTBot | ChatGPT | Claude | anth-ai | G-Ext | CCBot | Perplx | Bytesp |
|---|---|---|---|---|---|---|---|---|
| nytimes.com | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| bbc.com | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| cnn.com | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| npr.org | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| usatoday.com | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| huffpost.com | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| nbcnews.com | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| bloomberg.com | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| techcrunch.com | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| vox.com | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| theverge.com | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| buzzfeed.com | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| forbes.com | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ |
| reuters.com | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| theatlantic.com | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| wsj.com | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| arstechnica.com | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| wired.com | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ |
| newyorker.com | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ |
| theguardian.com | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ |
| apnews.com | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ |
| time.com | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
The picture is stark. Total lockdown sites — NYT, BBC, CNN, NPR, USA Today, HuffPost, NBC News — block every single AI crawler with no exceptions. Then there are the selective blockers, and this is where it gets interesting. Vox and The Verge allow only GPTBot, suggesting an OpenAI licensing deal. Bloomberg allows only ClaudeBot. TechCrunch allows only PerplexityBot. These one-bot exceptions almost certainly reflect individual content licensing agreements negotiated behind the scenes.
A second tier of selective blockers — Reuters, The Atlantic, WSJ, Ars Technica, Wired, The New Yorker, The Guardian — block most bots but allow GPTBot and ChatGPT-User. The pattern is consistent enough to suggest OpenAI has been the most aggressive in striking content deals. Only Time stands alone as the single major news outlet that blocks nobody at all.
Social platforms lean heavily toward blocking. LinkedIn, Pinterest, Snapchat, Facebook, Instagram, and TikTok all block every single AI bot — full lockdown, all eight. Twitter/X is nearly as restrictive, blocking 7 of 8 bots (only Google-Extended gets through). That's notable given the platform's general stance of openness — but apparently not when it comes to AI crawlers.
The holdouts are YouTube, Twitch, and Discord — all three allow every bot. These are platforms where the content is primarily video, audio, or ephemeral chat rather than text articles. There's less to "scrape" from a robots.txt perspective, since the valuable content lives behind players and APIs rather than in crawlable HTML.
Developer tools and tech platforms are overwhelmingly open. Of 18 reachable tech sites we checked, only two block any AI bots. Figma blocks 6 of 8 (GPTBot, ChatGPT-User, anthropic-ai, Google-Extended, CCBot, PerplexityBot). Medium blocks 3 — GPTBot, ClaudeBot, and Bytespider — likely protecting its writer-generated content in the same way news sites do. GitHub, GitLab, Dev.to, Vercel, Netlify, Cloudflare, Stripe, Twilio, Slack, Docker, Atlassian — all wide open.
This makes sense. Tech companies want their documentation, APIs, and product pages in AI training data. If an AI agent recommends Stripe's payment API or Vercel's hosting platform, that's free marketing. The incentive structure aligns.
Even OpenAI and Anthropic themselves — companies whose bots are being blocked elsewhere — leave their own sites completely open. No hypocrisy charges on the robots.txt front, at least.
Amazon blocks 7 of 8 AI crawlers — only anthropic-ai gets through. eBay also blocks 5 of 8, targeting ClaudeBot, anthropic-ai, CCBot, PerplexityBot, and Bytespider while allowing GPTBot, ChatGPT-User, and Google-Extended. Every other retailer we checked — Walmart, Target, Best Buy, Etsy, Shopify, Nike, Home Depot, Costco, Lowe's, Wayfair — allows all bots.
Amazon and eBay's blocking likely reflects their positions as both retailers and data companies. Product listings, reviews, and pricing data are competitively sensitive — they don't want AI models reproducing their data in ways that route customers elsewhere. The rest of e-commerce apparently decided the SEO-like benefits of AI visibility outweigh the risks.
Zero blocks across all three categories. Not a single bank (Chase, Bank of America, Wells Fargo, PayPal), government site (USA.gov, NASA, CDC, IRS), or university (MIT, Stanford, Harvard, Yale) blocks any AI crawler.
For government and education, this aligns with their public mission — these institutions want their content widely accessible. For finance, the reasoning is likely different: banking sites are already heavily gated behind authentication, so robots.txt is irrelevant for sensitive data. Their public-facing content is marketing material they're happy to have AI systems surface.
Health content sites split along predictable lines. WebMD blocks 4 bots (GPTBot, ChatGPT-User, ClaudeBot, and CCBot). Healthline blocks 5 (GPTBot, ClaudeBot, anthropic-ai, CCBot, and Bytespider). These are ad-supported content sites that, like news publishers, are protecting their core product — health articles that drive traffic and ad revenue.
Meanwhile, institutional health sites like the CDC, NIH, and Mayo Clinic (when reachable) block nothing. Again: public mission aligns with open access.
The robots.txt landscape creates a two-tier web for AI. Agents can freely access tech documentation, government resources, educational content, and most commercial sites. But they're locked out of news articles, many social platforms, and some health content — exactly the kind of current, human-written text that makes AI responses useful.
A few implications:
robots.txt).robots.txt is a blunt instrument. It blocks crawling, not retrieval. An AI can still surface NYT content from its existing training data, from cached versions, or from third-party sources that don't block. It's a speed bump, not a wall.robots.txt is a voluntary standard. It tells well-behaved bots what to do, but there's no technical enforcement. Some crawlers ignore it. And many sites supplement robots.txt with active bot detection — CAPTCHAs, rate limiting, browser fingerprinting — that we didn't measure here.
For the data-curious, here's our methodology: we fetched robots.txt via HTTPS from 99 domains across 8 industry categories. 93 returned a parseable response. We parsed each file with Python's urllib.robotparser and checked can_fetch("/", bot) for each of eight AI user-agent strings. A site is marked as "blocking" a bot only if the parser determines a full-site disallow for that user-agent.
Sites that were unreachable, returned errors (rate limiting, auth proxies), or had no robots.txt are noted in the dataset but excluded from the blocking percentages. Having no robots.txt is effectively "allow all" per the spec.
Want to run this yourself? It's just HTTP requests:
curl -s https://nytimes.com/robots.txt | grep -A1 -i "gptbot\|claudebot\|perplexity" # Or check a batch: for site in nytimes.com github.com amazon.com; do echo "=== $site ===" curl -sL "https://$site/robots.txt" | grep -i "gptbot\|claudebot\|perplexity" done
The full dataset — CSV, JSON, and the Python script to reproduce it — is available in agent-bench's research directory. We'll keep updating it as the landscape evolves — because it's evolving fast.