Cloudflare bot blocking for AI crawlers
/ai.txt and robots.txt are signals — they declare your stance, but compliance is voluntary on the crawler’s side. Major model trainers don’t all promise to honour them. If you want active blocking, pair the signal with a Cloudflare WAF rule that drops requests from AI crawler user-agents.
This recipe assumes Cloudflare is in front of your origin. If you’re on raw GCP/AWS, you’d do the same shape with their respective WAFs (Cloud Armor, AWS WAF) — the policy is the same, the dashboard is different.
Tickbox config
Same as the AI training opt-out concept:
import { defineConsent, jurisdictions } from '@tickboxhq/core'
export default defineConsent({ jurisdiction: jurisdictions.UK_DUAA, policy: { version: '2026-05-08', url: '/privacy' }, categories: { necessary: { required: true }, ai_training: { vendors: [], // empty → block all known AI crawlers default: false, description: 'AI training and inference by automated crawlers.', }, },})The Nuxt module auto-serves /ai.txt from this. That’s the signal.
The Cloudflare side
In your Cloudflare dashboard:
- Go to Security → WAF → Custom rules.
- Click Create rule.
- Name:
Block AI training crawlers. - Field:
User Agent(orcf.client.botif you want broader bot detection). - Operator:
contains(ormatches regexif you want one rule for everything). - Value: any one of the user-agents Tickbox knows about — the canonical list lives in
packages/core/src/jurisdictions/vendors.ts. The regex form:
(?i)(GPTBot|ClaudeBot|anthropic-ai|Google-Extended|PerplexityBot|CCBot|Bytespider|Applebot-Extended|meta-externalagent|OAI-SearchBot)- Action:
Block.
Save. The rule takes effect within seconds.
Cloudflare’s built-in option
Cloudflare also ships a “Block AI Bots” managed rule under Security → Bots. It’s a one-click toggle, maintained by Cloudflare, and updated as new crawlers appear. Lighter operational cost than maintaining your own custom rule. The trade-off is you’re trusting Cloudflare’s list.
If you use both:
- Custom rule for the bots you specifically care about (overrides Cloudflare’s list).
- Managed rule for everything else.
What the crawlers see
A request from User-Agent: GPTBot/1.0:
- Hits your Cloudflare edge.
- Custom WAF rule matches.
- Cloudflare returns
403 Forbidden. The origin never sees the request. - The
/ai.txtandrobots.txtdeclarations are still there for crawlers that do respect them — but the WAF stops the ones that don’t.
Caveats
Don’t accidentally block Googlebot. Cloudflare’s managed rule is careful about this; if you write your own custom rule, double-check the user-agent strings — Google-Extended is the AI crawler, Googlebot is the search indexer. Blocking the wrong one tanks your SEO.
User-agent spoofing exists. A crawler that wants to scrape your content can identify as Chrome. WAF rules catch the polite ones; the impolite ones need rate limiting, behavioural analysis, or paid bot management — out of scope for this recipe.
/ai.txt is not a substitute. Cloudflare blocking covers the crawlers that don’t respect signals; /ai.txt covers the ones that do. You want both.
Verify
curl -A "GPTBot/1.0" https://your-site.com/some-page# Expect: HTTP/2 403curl -A "Mozilla/5.0" https://your-site.com/some-page# Expect: HTTP/2 200, your normal pagecurl https://your-site.com/ai.txt# Expect: text/plain with User-Agent: * Disallow: / (or per-bot rules)