What Are The Essentials Of Robots.txt In An AI world?

the key points: 👇

Control crawler access: The robots.txt file serves as a gatekeeper, providing specific instructions to search engines and AI bots about which parts of your site they are allowed to crawl.
Protect sensitive areas: Use the “disallow” directive to hide private or low-value folders – such as admin panels, staging sites, and internal search results – from public indexing.
Promote your sitemap: Always include a direct link to your XML sitemap within the file to ensure search engines can quickly find and index your most important pages.
Manage AI scrapers: You can use specific user-agent commands to block AI bots from training on your data while still allowing traditional search engines like Google to show your site in results.
Perform regular audits: Periodically testing your robots.txt is crucial to ensure you haven’t accidentally blocked vital content, which can lead to sudden and severe drops in search rankings.

Every technical SEO should know their way around the core principles of a robots.txt file. It sits there as the very first thing a crawler looks for when it hits a subdomain. Getting the basics spot on is utterly critical to ensure you avoid a situation where pages show up ineffectually in search results, or simply drop out of the index entirely.

The landscape has shifted dramatically over the last couple of years with the aggressive arrival of Large Language Models (LLMs) and their extraordinarily hungry data-scraping bots. Writing a functional robots.txt file requires a solid understanding of traditional search engine crawlers and the newer, highly demanding AI agents currently chewing through the internet.

Hungry crawling bots - as shown by Patrick from Spongebob vacuuming up Krabby Patties.

Yes, bots are THIS hungry. Learn to control them with robots.txt.

Let’s have a look at the rules, the quirks, and the modern realities of managing your crawl budget and privacy.

Location, location, location

Your robots.txt file absolutely should sit at the root of your subdomain. You have zero room for negotiation here. When a crawler hits your site, it strips out the path from the URL, grabbing everything after the first forward slash to find its instructions. In practical terms, your setup should look something like this, depending on your domain scenario:

http://www.website1.com/robots.txt
http://website2.com/robots.txt
http://place.website3.com/robots.txt

Put it anywhere else, and crawlers aren’t guaranteed to find it. Some of the smarter bots might stumble across a misplaced file, but many will simply assume you have no robots.txt file on your site at all. Bots will then assume they can access absolutely everything and will go completely berserk, crawling every single inch of the site they can reach.

Sometimes bot crawling gets out of control. Like this gif of James Kirk crawling around some rubble.

Yes, “berserk crawling” is a thing, as demonstrated here by Captain James T. Kirk.

Now, this might be fine if you run a tiny, five-page brochure website. It becomes a massive SEO risk on a large ecommerce catalogue or an enterprise site where you desperately need to control crawler behaviour to ensure high-priority pages get indexed efficiently. Safer to not risk it IMO, just put the damn file in the right place.

The basic building blocks

You can create a robots.txt file in any basic text editor, right down to Notepad. A very basic, friendly robots.txt file will look something like this:

User-agent: *
Disallow:
Sitemap: http://www.website.com/sitemap.xml

The first line uses a wildcard asterisk to mean “any user agent” (or “any robot”). The blank disallow line means nothing on the site is restricted from crawling. The sitemap line clearly specifies the location of the XML sitemap index for the website, so the bot can hop straight onto it and start indexing from that clean list. It keeps everything nice, tidy, and efficient.

If you want to stop all bots from indexing content within certain folders, such as an area only accessible to logged-in users, the file just needs a minor adjustment.

User-agent: *
Disallow: /user-area/
Sitemap: http://www.website.com/sitemap.xml

You can easily keep robots out of a single page or a specific file by targeting the exact path.

User-agent: *
Disallow: /user-area/
Disallow: /assets/media/invoice-template.pdf
Sitemap: http://www.website.com/sitemap.xml

That at least takes care of the basic essentials.

Taming the LLM scrapers

The rules change slightly when we look at the bots operated by generative AI companies. Tools like ChatGPT, Claude, and Perplexity rely on massive web scrapers to build their training datasets. These bots, such as GPTBot, CCBot, or ClaudeBot, comb through your site specifically to extract training data rather than to rank your pages in a traditional search index.

Many site owners now actively choose to block these crawlers to protect their intellectual property. While this of course limits your chance of being mentioned in those LLM results when people go to ask them things, many website sectors such as publishing see this as an essential anyway. In fact recent data shows that a whopping 80% of the top news websites in the UK and US are now blocking AI training bots. So, depending on your sector, you may well want to join them and keep those plagiarising little clankers away from your content.

If you want to prevent, say, OpenAI, from using your content to train its models, you need to call out their specific user agent, for example:

User-agent: GPTBot
Disallow: /

You can stack these rules to block multiple AI crawlers while leaving the door wide open for (say) Googlebot and Bingbot to continue indexing your site for traditional search visibility. Just remember that the AI landscape moves rapidly, and new scraper bots (some more legitimate and/or concerning than others) appear almost weekly. Keeping them blocked is likely to need regular auditing and updates to your robots.txt file.

Important notes on blocking behaviour

Blocking things in robots.txt absolutely does not prevent them from appearing in search engine results pages entirely. A disallow directive stops the bot from crawling the page, but if the page acquires internal or external links, the search engine still knows it exists. You will often see these pages appear in the SERP with a message stating that no information is available for the page.

For things like user areas or invoice templates, you probably care very little about outlier cases where they show up like this, provided the full content remains unindexed.

Brands highly sensitive to certain URLs or confidential files must take a different approach. To ensure these files never show up in a search engine in any shape or form, you must actually allow the bots to crawl them. The bot needs to crawl the asset thoroughly so it can see the ‘noindex’ meta tag or the ‘x-robots noindex’ HTTP header. If you block the page in robots.txt, the bot never sees the noindex directive.

Also make sure that you don’t block assets necessary to render pages in a browser. Developers historically mass-blocked scripts and CSS folders to save crawl budget, but doing this today will result in grumpy error messages in Google Search Console and a direct negative impact on your organic visibility levels. Google announced this fundamental rendering change back in 2014, leaving you with zero excuses for getting it wrong today.

Blocking rendering assets leaves search bots confused. Like this gif of John Travolta looking around in bafflement.

Googlebot, probably: hey, where’d all the rendering assets go?

Technical quirks and specific rules

There are plenty of other technical elements you need to manage within a robots.txt file to keep things running smoothly.

Crawl delays act as a legacy feature once used to throttle robot access on fragile servers. You have absolutely no reason to use them in a modern hosting setup, and most major bots entirely ignore crawl delay rules anyway.

Most robots will happily honour pattern-matching rules. You can use the asterisk as a wildcard to mean “any sequence of characters,” and the dollar sign to explicitly match the end of a URL string.

The robots.txt file remains strictly case-sensitive across the board. Do not call the file robots.TXT, and make absolutely sure any rules you write case-match the specific URLs on your server.

Only one URL rule can exist per line. If you want to disallow three specific folders, you’re going to need to write three separate disallow lines.

Processing order requires careful attention. Some bots, including Google and Bing, use a “most specific rule first” principle, whereas standard processing order runs from top to bottom. If you’re jittery, just make sure you put any ‘Allow’ directives above your ‘Disallow’ directives to ensure the bot reads the exception before it reads the blanket ban.

Avoid using robots.txt to solve complex architectural problems. Blocking mobile websites from non-mobile bots, or trying to block duplication caused by messy faceted navigation, rarely ends well. Address those situations with canonical tags and proper server-side solutions rather than throwing sticky plaster rules into your text file.

You can easily add human-readable comments to your file by placing a hash symbol # at the very beginning of a line. This helps document your rules for anyone else who needs to unpick things later and will be wondering what mad crawl control spree you were on.

The security illusion

The robots.txt standard acts strictly as a polite directive, not an enforceable law. Malicious bots, email scrapers, and all kinds of aggressive data miners will generally ignore your file entirely in favour of ripping whatever they want from your site.

Be acutely aware that the robots.txt file is entirely public. Anyone on the planet can read your rules simply by navigating to the file in their browser. This makes it a terrible place to hide things.

Yeah, it’s kind of like that.

If you create a rule disallowing a folder called /secret-admin-login-area/, you have just handed every hacker a map directly to your sensitive pages. Never rely on this file to keep secure areas of your site hidden. Use appropriate server-side encryption and robust login protocols to protect your business.

That… kind of wraps it up, to be honest, and I can’t think of a witty conclusion paragraph. But if it all sounds a bit complicated, and you’d rather not get lost in a maze of directives and crawling chaos, I can offer technical SEO audits and support without the baffling jargon, so you’ll get solutions that make sense and actually work for your business, without needing to pump it into your LLM of choice to figure out what the heck is actually going on in there.

written by

Ruth Attwood
SEO Consultant

With 10+ years in the search industry, Ruth helps brands navigate the complexities of modern organic search without the agency jargon. As the founder of Puglet Digital, she has delivered results for everything from niche luxury labels to major financial institutions and high growth startups, specialising in scalable growth strategies, international deployments, and technical hygiene, including site migrations. >> view on LinkedIn

What Are The Essentials Of Robots.txt In An AI world?

the key points: 👇

Location, location, location

The basic building blocks

Taming the LLM scrapers

Important notes on blocking behaviour

Technical quirks and specific rules

The security illusion

Ruth Attwood
SEO Consultant

Contact Details

Quick Links

Recent Posts

What Are The Essentials Of Robots.txt In An AI world?

the key points: 👇

Location, location, location

The basic building blocks

Taming the LLM scrapers

Important notes on blocking behaviour

Technical quirks and specific rules

The security illusion

Ruth Attwood SEO Consultant

Ruth Attwood
SEO Consultant