To the blog overview

To give Chat-GPT access to your website or not?

For years there have been bots on the Internet that roam the Web looking for new content. How to ward off the ChatGPT bot can be read here.

Written by:

Kevin Rombouts

Explanations and tips

More and more, we as humans are using AI-driven tools to make our work easier. Applications like ChatGPT and Dall-E help in writing text and creating images. But do you or don’t you want AI chatbots to sift through your content? I’ll explain how, with some simple tweaks, you can gain complete control over whether or not chatbots like Chat-GPT are allowed.

Open any tech blog today and you’ll stumble across posts surrounding AI tools like ChatGPT. Developments are rapid, but sometimes cross ethical boundaries. Late last year, for example, The New York Times shot into the anchors after it finished with the fact that ChatGPT was reusing their content.

Do you have a website with content you’d rather not share with ChatGPT? Then read on! If you want to be visible from a marketing point of view, click here for an article on how to be visible in ChatGPT.

How does ChatGPT get my content?

For years there have been bots on the Internet, scouring the Web in search of new content and custom pages. In the beginning, these were mainly bots that indexed your pages for search engines like Google and Bing, but since the advent of AI tools, bots have taken on other purposes as well: gathering content.

This is done through a technique known as scrapping. As a bot goes through your Web site, it identifies the most important content on each page. This is indexed and stored in huge databases, where it is characterized and can be used to train the AI models with. So after that, your words may just come out of the digital mouth of a chat tool.

Keep AI crawlers out thanks to robots.txt

Keeping bots out of scrapping your website is a relatively simple task, because they (if all goes well*) take into account your robot information. This information communicates to the “crawlers” (as they are also called) which pages on your website they can and cannot scrape. You write all this away in a robots.txt file that is stored on your server.

However, it is important to remember that if you block all of your website from such crawlers, none of the pages will come up in the results of ChatGPT, for example. Therefore, it may be worthwhile to work out a plan for yourself which pages you do and don’t want to block. Need help with that? Then contact me and we’ll get started together.

* Almost all of the Internet takes robots.txt into account, but this is no guarantee that everyone is playing nicely by the rules. Want to know more about robots.txt? Then read this article from The Verge.

Here’s how to modify robots.txt

If you want to make changes to your robots.txt file, in many cases you will need access to your server’s file manager (for example, via an FTP client). If you use WordPress, you can also use an SEO plugin such as RankMath to modify your robots.txt file.

1. Open your robots.txt file and identify which rules are already processed in it. The structure of this file often consists of a combination of the bots (in the file as user-agent) and the rules the bot must follow (in the file as disallow).

2. Determine which bots can and cannot access your website. In addition to the usual crawlers from Google and Bing, among others, there are a number of recognizable AI crawlers that you may want to block. The most well-known at the moment are as follows:

- GPTBot – OpenAI’s crawler for ChatGPT, among other things. When you exclude these, you also automatically exclude ChatGPT-User. So this also prevents you from showing up in ChatGPT’s search results.
- Google-Extended – Google’s crawler that the company uses to train their Gemini model. According to Google, excluding Google-Extended has no impact on your search rankings in Google or your visibility in SGE.
- FacebookBot – Facebook’s crawler that it uses to improve its language models.
- anthropic-ai and Claude-Web – the crawlers of AI company Anthropic.
- CCbot – the crawler of Common Crawl, which maintain a public repository of the Web.
- Piplbot – Pipl’s crawler that collects documents to create a searchable index.
- BingBot – Microsoft’s crawler that it uses to train its Bing-AI model.

3. Determine which pages should and should not be excluded. In many cases, you will want to block bots from your website altogether. Then a simple / will suffice in the disallow line of your robots.txt file. Should you want to exclude more specific sections of your website, more complex rules will do. You can read more about that here.

4. Check your new rules. If you processed everything correctly, it should look something like this:

User-agent: ChatGPT
Disallow: /
User-agent: Google-Extended
Disallow: /

5. Save your robots.txt file and upload it back to your server. It can sometimes take several days, but traffic from these crawlers should stop on its own after that. Should you still feel that new content is being included in the models, check your robots.txt file again after a period of time to make sure the rules are still set correctly.

Need help? Let me know

Not getting there or want more advice on how to exclude AI crawlers on your website? Then get in touch with me – I’m happy to help!

How does ChatGPT get my content?

ChatGPT uses crawlers that collect content from Web sites via “scrapping. These bots identify important page content, store it in a database and use it to train AI models.

How can I prevent AI bots from scrapping my site?

You can keep AI crawlers out with a robots.txt file. In it you specify which bots you will or will not allow access to (parts of) your website. Most bots comply with these instructions.

What is a robots.txt file?

A robots.txt file is a text file on your server that tells crawlers which pages they may or may not visit. This allows you to maintain control over who is allowed to view and use your content.

Which AI crawlers can I block via robots.txt?

Examples of AI crawlers you can block are: GPTBot (ChatGPT), Google-Extended, FacebookBot, anthropic-ai, CCbot, Piplbot and BingBot. You can exclude them by user-agent in your robots.txt.

How do I modify my robots.txt?

You need access to your server’s file manager (e.g., via FTP). If you use WordPress, you can use an SEO plugin such as RankMath to modify your robots.txt file.

Can I exclude only specific pages from AI crawlers?

Yes, you can. Instead of blocking your entire site, you can specify in your robots.txt file which specific parts you want to exclude. To do this, use more specific rules.

Kevin Rombouts

Front-End Developer

More food for thought:

Explanations and tips

Website update or revamp? Choose smart!

Whether you choose major updates or a complete new website...

Explanations and tips

Why does good online branding increase sales?

If you really translate from your brand to a web...

Explanations and tips

Here are 4 handy advantages of Elementor as a page builder

But what makes Elementor so different from other page builders?...

Contact us directly

Call us directly on 020 – 2101505.

To give Chat-GPT access to your website or not?

How does ChatGPT get my content?

Keep AI crawlers out thanks to robots.txt

Here’s how to modify robots.txt

Need help? Let me know

How does ChatGPT get my content?

How can I prevent AI bots from scrapping my site?

What is a robots.txt file?

Which AI crawlers can I block via robots.txt?

How do I modify my robots.txt?

Can I exclude only specific pages from AI crawlers?

More food for thought:

Website update or revamp? Choose smart!

Why does good online branding increase sales?

Here are 4 handy advantages of Elementor as a page builder

Contact us directly

This is what happens after your emial is sent: