Should you allow GPTBot and AI training crawlers?

AI crawlers split training from search, the per-bot decision this post resolves for service businesses.
AI crawlers split training from search, the per-bot decision this post resolves for service businesses.

TL;DR

  • AI crawlers split into two kinds with independent settings: training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) and search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot).
  • Rutgers Business School and Wharton found publishers who blocked AI crawlers lost 23.1% of monthly visits and 13.9% of human browsing without reliably reducing AI citations.
  • 79% of top news publishers block at least one training bot. 46% block Google-Extended, per a January 2026 Press Gazette analysis.
  • Service businesses should usually allow training crawlers because articles are a signpost, not the product. Publishers, paid newsletters, and proprietary research should usually block.
  • Decide whether your content is the product or the signpost, write per-bot rules in robots.txt blocking only training crawlers, and leave the search crawlers alone.

There is a deal on the table.

You let the AI crawlers read everything you publish. In return, your content might be quoted by ChatGPT, Claude, and Perplexity — which sends readers and future customers your way.

But the deal has a quieter clause. Some of those crawlers do not visit to quote you. They visit to train the next AI model on what you wrote. Your content becomes part of the AI model’s memory, with no credit and no payment.

You can accept the whole deal. You can reject it entirely. You can sign only half.

Most small business owners have never been told there was a choice.

Why do I have to make this choice at all?

Because AI crawlers are not one thing.

A single blanket rule — allow everything, or block everything — treats two different kinds of visitor as the same. One kind trains the AI model. The other kind fetches your page when a reader asks a live question and credits you in the answer.

Allowing both is the easy default. Blocking both is the cautious default. Neither is actually correct.

The two kinds have different costs and different benefits. Your decision is about values as much as technology.

What is a training crawler, actually?

A program that reads your pages and feeds the text into the next version of an AI model.

GPTBot trains OpenAI’s models. ClaudeBot trains Anthropic’s. Google-Extended trains Gemini. CCBot feeds Common Crawl, a dataset that almost every major AI model has used at some point.

These are separate from the search crawlers. OAI-SearchBot, Claude-SearchBot, PerplexityBot — those fetch content at query time. They credit the source. They send readers.

The settings are independent. You can allow the search crawlers and block the training ones. Or the reverse. Or neither.

That independence is the reason this decision is even possible.

What does blocking training crawlers actually cost you?

More than most owners expect.

Researchers at Rutgers Business School and The Wharton School analyzed publishers who blocked AI crawlers via robots.txt. The study was dated December 31, 2025.

Blocked sites saw total monthly visits drop 23.1 percent. Human-only browsing dropped 13.9 percent.

The same study reported something quieter. Citation rates did not reliably fall alongside the traffic. The research coverage put it plainly. Blocking appears to reduce traffic without reliably reducing AI citation rates.

That is the operational picture the owner has to look at. Block, and your traffic goes down. Block, and your citations do not clearly go down.

For some businesses the trade is still worth it. For others it is not.

What does allowing training crawlers actually cost you?

Your content feeds AI model training without credit or payment.

Today’s AI models will often quote you when a reader asks a direct question. Tomorrow’s AI models will know what you said without needing to quote you.

Ongoing lawsuits from major publishers against OpenAI and Anthropic are trying to define where all of this ends. The courts have not finished. The crawlers are reading you in the meantime.

For most small businesses, this is a cost worth paying for the visibility it buys. For a few, it is not.

The question is which camp you belong to.

If you are a publisher, what do you usually do?

You usually block.

In January 2026, 79 percent of top news websites were blocking at least one training bot — GPTBot, ClaudeBot, or CCBot. Only 46 percent were blocking Google-Extended, which is the same kind of decision against a different counterparty.

The math is simple for publishers. Your articles are your product.

Feeding the product into AI model training is giving it away. The traffic cost is the price of keeping it.

Publishers who can price their journalism, their research, or their analysis have the strongest case for blocking. The content is the business. The business is the content.

If you are a service business, what do you usually do?

You usually allow.

Your articles are not the product. They are how a reader finds out that you exist.

A small accountant in a small city. A WordPress consultant. A family clinic. A local bakery with a blog.

For these businesses, the traffic cost of blocking is a real cost with no matching benefit.

Being quoted in a ChatGPT answer is how a new customer arrives. A framework from a 2026 industry analysis names this directly. Businesses that depend on organic discovery should allow training crawlers unless a specific reason makes the trade unfair.

For most service businesses, no such reason exists. Block in spite of that and you risk seeing AI recommend your competitors when customers ask about your category.

If you are a specialist or creator, what do you usually do?

You decide case by case.

A clinician whose long-form articles establish reputation usually allows.

A researcher whose reports are sold usually blocks.

A consultant whose free blog attracts inquiries usually allows.

A writer whose paid newsletter is the product usually blocks.

The question to ask yourself is plain. Is the content the product, or a signpost to the product? Product means block. Signpost means allow.

Most specialists can answer that question in thirty seconds once it is asked directly.

How do I actually do this in my robots.txt?

You write separate rules per bot.

If you are blocking training, disallow GPTBot. Also ClaudeBot. Also Google-Extended. Also CCBot.

Leave the search crawlers alone. OAI-SearchBot. Claude-SearchBot.

Claude-User. PerplexityBot. The standard Googlebot.

Each bot reads the section addressed to its own name. The settings are independent, which is the whole point.

Block the ones you mean to block. Leave the rest alone.

If you are not sure how to write the file, any WordPress SEO plugin will let you edit robots.txt from the admin area. So will most hosts.

Other questions worth answering

What does the 79-percent figure from top news sites mean for small operators?

It doesn’t translate cleanly. No public data covers training-crawler rejection prevalence among small businesses specifically. ALM Corp’s January 2026 data analysis sampled top news websites.

Reading that 79 percent figure as a universal rate would misread the source. Small operators should treat the publisher rate as context for one camp, not a base rate for theirs.

Did the Rutgers research quantify citation rates for sites that closed the door?

No. The ppc.land April 2026 coverage of the Rutgers and Wharton research dated December 2025 quantified traffic drops. Total visits fell 23.1 percent. Human-only browsing fell 13.9 percent.

The citation-rate claim sat at the headline level, asserted but not given a specific percentage in that piece. Treat the citation half as directional, not as a measured figure.

What happens when ChatGPT or Perplexity fetches my page in response to a reader’s live query?

Three agents handle real-time retrieval — ChatGPT-User, Claude-User, and Perplexity-User. They fire when a reader asks a live question. This category sits separate from search indexing and AI model feeding.

Per ALM Corp’s February 2026 framework analysis, retrieval directives are on their own axis. Whether to leave them open is a separate call. Assume retrieval reaches your page regardless of what your file says.

How do I verify my new directives are being honored once they go live?

Three signals tell you whether the rules took. Server logs show which agent strings have hit the site. Free monitoring tools surface the same data without log access.

Per Cogni Blog’s March 2026 decision guide, each agent setting operates on its own axis. Ask ChatGPT about a freshly updated page. If the answer is accurate and recent, retrieval is reaching you.

Which AI bots should you actually block?

Decide which camp you are in.

A publisher blocks training. A service business allows. A specialist or creator checks whether the content is the product or the signpost.

Write the rules once. Block the training bots if that is the call. Leave the search bots alone.

Then test. Ask an AI to summarize one of your pages. If the answer is still accurate and recent, the search crawlers are still reading you — and your settings are working as intended. If you want a step-by-step routine, how to check AI citations covers the same test for any business and any AI engine.

Last week I wrote about the four doors that can block every AI crawler at once. That piece was about the doors. This one is about whether you want any of the visitors through them in the first place.

The two questions are related. They are not the same.

If you want a second pair of eyes on which camp you are in and what your robots.txt should actually say, you can contact me. I will ask about your business, listen, and tell you which deal I think is fair. No pitch. No sign-up.

Similar Posts