As a wise playwright once said, to block or bot to block, that is the question. If Shakespeare were alive today, would he have been proud that Google’s AI chatbot began life named after a term he is commonly referred to as? Sadly, we shall never know. However, what we do know is that AI bots are crawling websites and not everyone is on board with it. In this blog, we’re breaking down the whys and why-nots of blocking AI crawlers on your website.
You may have been scaremongered into thinking that AI is going to steal all your data and repurpose it without permission. For many businesses, this may have been enough to convince them to block AI bots from crawling their site entirely.
In reality, bots like Googlebot crawl your website regularly to understand the content being added, changed and removed from your site to aid search engines, so what’s so different about an AI bot doing the same?
What are crawlers?
Crawlers are essentially tools, typically used by search engines like Google or Bing, to examine your website and index data from it, such as the content you’ve created and details about your company. This ensures your website shows up accurately in search results. Crawlers are how search engines like Google discover and comprehend your site, so the concept of a crawler isn’t new. As a website owner, you can decide which parts of your website you want crawlers to view and index in search results by using robots.txt files.
AI crawlers employ the same technology, but instead of merely indexing your website data, AI crawlers scrutinize the information on your site and can use it to train their own technology (Large Language Models).
What is the purpose of AI crawlers?
AI crawlers are used to train Large Language Models (LLMs) that power chatbots like ChatGPT. LLMs process millions of terabytes of data and use that to answer your prompts.
Crawlers don’t just access websites, they are also trained on databases, documents and images to build their knowledge. Data to an LLM is what a buffet is to humans.
What AI crawlers are out there?
AI crawlers include:
- ChatGPT-User. This is utilised by ChatGPT when a user on GPT-4 directs the bot to your site in a prompt like “tell me how many times [SITE URL] mentions AI”.
- GPTBot. This crawler simply collects data from your site for training data for their AI knowledge base.
- Google Extended. This is how Google collects data for all their AI products, including Gemini (formerly known as Bard), their AI chatbot.
- Anthropic-AI. Anthropic has a variety of AI tools, including Claude, their AI chatbot, and their crawler gathers the data for this.
- CC-Bot. This is the Common Crawler bot and is what ChatGPT-3 was trained on. It’s designed to make data access available for everyone, without any charges.
Why would you want to block AI crawlers?
You might decide to block AI crawlers, especially if you’re worried about your content being misrepresented, or if your site is under development.
1. Misrepresented content
When humans create content, we write with subtlety and there may be cultural or business context that makes what you write understandable to a specific audience. When your content is taken out of that context and used to form part of an AI chatbot’s response, it will likely lose the subtlety, and the point your content made may have been lost or entirely misrepresented.
For some companies, this is a risk they don’t want to take, and so they block AI crawlers to prevent this. For instance, if you were a medical company with specific advice related to one of your products, you wouldn’t want an AI to take that out of context to an unrelated product or medical query.
2. Unwanted association
As AI crawlers tend to take sections of information from various websites without always understanding the context of that piece of information. There is a risk that your information may be presented alongside additional sources that your business doesn’t want to be associated with. If this is the case, then you may want to choose to block any AI crawler. This will prevent your company from being mixed in with competitors, or those in your industry who may not uphold best practices. For some companies where reputation management is critical, this could be a very strong argument.
3. Data scraping
It’s best practice to block any crawlers from viewing parts of your site you don’t want them to see. For example, you might have a staff wellbeing portal on your intranet or customer logins on your website. You don’t want these crawled as they contain personally identifiable information, something your customers or employees definitely don’t want an AI company to have! OpenAI says that the GPTBot is “filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies.” Most websites will already have these blocked, so it’s worth speaking to your hosting provider or SEO team to see if they can add any AI crawlers to the list.
4. Spam generation
As technology evolves, so do cybercriminals. We’re now seeing the most sophisticated phishing emails and malicious links being sent thanks to AI-generated content. Using a combination of AI-powered chatbots like ChatGPT and data harvested from your site means that spam emails are more realistic than ever. Malicious actors could then use AI to create even more realistic spam emails which could more closely imitate your employees or the company itself. This ultimately could lead to more successful phishing attempts which can cause financial and reputational loss for your business.
How to block AI crawlers
1. robots.txt
A robots.txt file will already be present on your site. It’s simply a matter of updating it to exclude the pages you want to block any AI crawler from viewing. Doing this will protect any sensitive data and content that should not be public knowledge from being accessed. Robots.txt files done wrong can cause your site to no longer be seen by Google and other search engines so it’s best to proceed with caution here. If you have an SEO agency, check in with them before you do this as they will be able to help you.
You can also instruct them to crawl at a certain speed. If you want them to crawl some, but not all, of your site, such as admin areas, this is possible too. Different businesses will have varying reasons to block crawlers, or not block them at all.
2. Web Application Firewall (WAF)
You can also use a WAF to block the crawler(s) as well as any unwanted traffic to your site. You’ll be able to keep it up and running for your customers without hindering their experience on your website.
Is it worth blocking AI crawlers?
So, is it really worth blocking AI crawlers?
When considering whether to block ChatGPT and similar crawlers, there’s more to ponder over than the downsides alone.
In November 2023, ChatGPT hit 100 million users per week. With this figure likely to grow, that’s a great deal of brand visibility you’re missing out on if you refuse to embrace this technology.
LLMs are the future of search, that’s more than clear. Bing has already embraced AI in the form of Microsoft’s Copilot, and Google is hot on its heels, recently moving its own AI-powered search feature AI Overviews, previously called Search Generative Experience (SGE), into the main Google search results. This means that if you’ve ever relied on organic search or SEO for a portion of your business generation, blocking AI could seriously hamper your efforts, if not now, then in the near future.
There’s even a branch of SEO forming known as Generative Engine Optimisation (GEO) or Answer Engine Optimisation (AEO) that focuses on improving visibility on popular LLMs like ChatGPT. Again, this may be an emerging acquisition channel that you’re missing out on if you block AI crawlers.
You should also consider how effective it is trying to block LLMs from your site. First, you must look beyond the big-name AIs. Blocking ChatGPT alone won’t cut it. Large language models like this are trained on a range of different datasets like Wikipedia and Reddit.
One of the datasets most commonly used by LLMs (including ChatGPT) is Common Crawl which has been created by a non-profit organisation and crawls the entire internet. So, if you’re genuinely determined to exclude your site from LLMs, then you need to block bots like Common Crawl as well as more popular crawlers.
Granting access to your website content can assist in ensuring that your brand is accurately and favourably portrayed to ChatGPT users. Blocking it may actually have the opposite effect if you’re trying to avoid being misrepresented online.
To block or not to block, that is the question
Our advice is to not block AI crawlers unless you have a genuine reason to do so, such as those that we mentioned above. AI is less of a hindrance and more of an exploration, and restricting its exploration of your website may hold you back.
If your company is debating whether to block or not to block, why not get in touch with us? We’ve supported companies across the UK with their SEO including whether or not to block AI.