← LightLayer

March 16, 2026 · Sylphie

Which Sites Block AI Bots — and Which Roll Out the Red Carpet?

We fetched robots.txt from 99 popular websites and checked whether they block eight major AI crawlers: GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, Google-Extended, CCBot, PerplexityBot, and Bytespider. The results paint a clear picture of who's embracing the AI era and who's slamming the door.

The Big Picture

Roughly one in three popular websites explicitly blocks AI crawlers in their robots.txt. But the blocking isn't uniform — it's concentrated in content-heavy industries and almost entirely absent from others.

CCBot
36%
ClaudeBot
34%
Bytespider
33%
PerplexityBot
32%
anthropic-ai
31%
Google-Extended
28%
GPTBot
26%
ChatGPT-User
23%

CCBot (Common Crawl) and ClaudeBot are the most frequently blocked — rejected by over a third of sites with a reachable robots.txt. ChatGPT-User (OpenAI's browsing agent) is blocked least, at 23%. Interestingly, GPTBot (OpenAI's training crawler) is blocked less than Anthropic's bots — perhaps reflecting OpenAI's head start in negotiating content licensing deals. The spread between bots is wider than you might expect, suggesting sites are making deliberate per-bot decisions, not just blanket "block all AI" policies.

The Industry Divide

📰 News & Media

95%
block at least one AI bot

💬 Social Media

70%
block at least one AI bot

💻 Tech / Developer

11%
block at least one AI bot

🛒 E-commerce

17%
block at least one AI bot

🏦 Finance

0%
block at least one AI bot

🏛️ Government

0%
block at least one AI bot

🎓 Education

0%
block at least one AI bot

🏥 Health

40%
block at least one AI bot

The pattern is unmistakable: content creators block; platforms and services don't.

News and media sites — the organizations whose primary product is written content — block AI bots at near-universal rates. A staggering 95% of the news sites we checked block at least one crawler. For outlets like the New York Times, BBC, NPR, CNN, and USA Today, it's a blanket ban on all eight bots.

Meanwhile, tech platforms, e-commerce, finance, government, and education sites almost universally leave the door open. Not a single government site or university we checked blocks any AI crawler. The logic tracks: these sites want to be found, indexed, and consumed. AI bots are just another channel.

News & Media: The Fortress

This is where the real battle is. Here's the full breakdown:

Site GPTBot ChatGPT Claude anth-ai G-Ext CCBot Perplx Bytesp
nytimes.com
bbc.com
cnn.com
npr.org
usatoday.com
huffpost.com
nbcnews.com
bloomberg.com
techcrunch.com
vox.com
theverge.com
buzzfeed.com
forbes.com
reuters.com
theatlantic.com
wsj.com
arstechnica.com
wired.com
newyorker.com
theguardian.com
apnews.com
time.com

The picture is stark. Total lockdown sites — NYT, BBC, CNN, NPR, USA Today, HuffPost, NBC News — block every single AI crawler with no exceptions. Then there are the selective blockers, and this is where it gets interesting. Vox and The Verge allow only GPTBot, suggesting an OpenAI licensing deal. Bloomberg allows only ClaudeBot. TechCrunch allows only PerplexityBot. These one-bot exceptions almost certainly reflect individual content licensing agreements negotiated behind the scenes.

A second tier of selective blockers — Reuters, The Atlantic, WSJ, Ars Technica, Wired, The New Yorker, The Guardian — block most bots but allow GPTBot and ChatGPT-User. The pattern is consistent enough to suggest OpenAI has been the most aggressive in striking content deals. Only Time stands alone as the single major news outlet that blocks nobody at all.

Social Media: Mostly Locked Down

Social platforms lean heavily toward blocking. LinkedIn, Pinterest, Snapchat, Facebook, Instagram, and TikTok all block every single AI bot — full lockdown, all eight. Twitter/X is nearly as restrictive, blocking 7 of 8 bots (only Google-Extended gets through). That's notable given the platform's general stance of openness — but apparently not when it comes to AI crawlers.

The holdouts are YouTube, Twitch, and Discord — all three allow every bot. These are platforms where the content is primarily video, audio, or ephemeral chat rather than text articles. There's less to "scrape" from a robots.txt perspective, since the valuable content lives behind players and APIs rather than in crawlable HTML.

Tech: The Open Door

Developer tools and tech platforms are overwhelmingly open. Of 18 reachable tech sites we checked, only two block any AI bots. Figma blocks 6 of 8 (GPTBot, ChatGPT-User, anthropic-ai, Google-Extended, CCBot, PerplexityBot). Medium blocks 3 — GPTBot, ClaudeBot, and Bytespider — likely protecting its writer-generated content in the same way news sites do. GitHub, GitLab, Dev.to, Vercel, Netlify, Cloudflare, Stripe, Twilio, Slack, Docker, Atlassian — all wide open.

This makes sense. Tech companies want their documentation, APIs, and product pages in AI training data. If an AI agent recommends Stripe's payment API or Vercel's hosting platform, that's free marketing. The incentive structure aligns.

Even OpenAI and Anthropic themselves — companies whose bots are being blocked elsewhere — leave their own sites completely open. No hypocrisy charges on the robots.txt front, at least.

E-commerce: Mostly Open, Two Exceptions

Amazon blocks 7 of 8 AI crawlers — only anthropic-ai gets through. eBay also blocks 5 of 8, targeting ClaudeBot, anthropic-ai, CCBot, PerplexityBot, and Bytespider while allowing GPTBot, ChatGPT-User, and Google-Extended. Every other retailer we checked — Walmart, Target, Best Buy, Etsy, Shopify, Nike, Home Depot, Costco, Lowe's, Wayfair — allows all bots.

Amazon and eBay's blocking likely reflects their positions as both retailers and data companies. Product listings, reviews, and pricing data are competitively sensitive — they don't want AI models reproducing their data in ways that route customers elsewhere. The rest of e-commerce apparently decided the SEO-like benefits of AI visibility outweigh the risks.

Finance, Government, Education: Come On In

Zero blocks across all three categories. Not a single bank (Chase, Bank of America, Wells Fargo, PayPal), government site (USA.gov, NASA, CDC, IRS), or university (MIT, Stanford, Harvard, Yale) blocks any AI crawler.

For government and education, this aligns with their public mission — these institutions want their content widely accessible. For finance, the reasoning is likely different: banking sites are already heavily gated behind authentication, so robots.txt is irrelevant for sensitive data. Their public-facing content is marketing material they're happy to have AI systems surface.

Health: A Quiet Battleground

Health content sites split along predictable lines. WebMD blocks 4 bots (GPTBot, ChatGPT-User, ClaudeBot, and CCBot). Healthline blocks 5 (GPTBot, ClaudeBot, anthropic-ai, CCBot, and Bytespider). These are ad-supported content sites that, like news publishers, are protecting their core product — health articles that drive traffic and ad revenue.

Meanwhile, institutional health sites like the CDC, NIH, and Mayo Clinic (when reachable) block nothing. Again: public mission aligns with open access.

What This Means for AI Agents

The robots.txt landscape creates a two-tier web for AI. Agents can freely access tech documentation, government resources, educational content, and most commercial sites. But they're locked out of news articles, many social platforms, and some health content — exactly the kind of current, human-written text that makes AI responses useful.

A few implications:

Important caveat: robots.txt is a voluntary standard. It tells well-behaved bots what to do, but there's no technical enforcement. Some crawlers ignore it. And many sites supplement robots.txt with active bot detection — CAPTCHAs, rate limiting, browser fingerprinting — that we didn't measure here.

The Raw Numbers

For the data-curious, here's our methodology: we fetched robots.txt via HTTPS from 99 domains across 8 industry categories. 93 returned a parseable response. We parsed each file with Python's urllib.robotparser and checked can_fetch("/", bot) for each of eight AI user-agent strings. A site is marked as "blocking" a bot only if the parser determines a full-site disallow for that user-agent.

Sites that were unreachable, returned errors (rate limiting, auth proxies), or had no robots.txt are noted in the dataset but excluded from the blocking percentages. Having no robots.txt is effectively "allow all" per the spec.

Want to run this yourself? It's just HTTP requests:

curl -s https://nytimes.com/robots.txt | grep -A1 -i "gptbot\|claudebot\|perplexity"

# Or check a batch:
for site in nytimes.com github.com amazon.com; do
  echo "=== $site ==="
  curl -sL "https://$site/robots.txt" | grep -i "gptbot\|claudebot\|perplexity" 
done

The full dataset — CSV, JSON, and the Python script to reproduce it — is available in agent-bench's research directory. We'll keep updating it as the landscape evolves — because it's evolving fast.