Question 1

Where exactly should robots.txt be placed?

Accepted Answer

The robots.txt file must be accessible at the root domain: https://yourdomain.com/robots.txt. It cannot be in a subdirectory—https://yourdomain.com/blog/robots.txt will not work. If your site is on a subdomain, each subdomain needs its own robots.txt file (e.g., https://docs.yourdomain.com/robots.txt). After placing the file, verify it is accessible by visiting the URL directly in your browser, then submit it to Google Search Console under Settings > robots.txt.

Question 2

Can I block AI crawlers from scraping my content?

Accepted Answer

Yes. Major AI companies have published their crawler user agent names: GPTBot (OpenAI/ChatGPT), ClaudeBot (Anthropic/Claude), Google-Extended (Google's AI training, separate from Googlebot), CCBot (Common Crawl—used by many LLMs), PerplexityBot (Perplexity AI), and Bytespider (ByteDance/TikTok). Add Disallow: / rules for each of these user agents. Note that unlike reputable search engines, some AI scrapers may not honor robots.txt—but all major labs have committed to respecting it.

Question 3

Does Disallow in robots.txt prevent indexing?

Accepted Answer

Not completely. Disallow prevents crawling (Googlebot won't download the page), but if other sites link to a disallowed URL, Google can still index that URL based on the links alone—it just won't know the page content. For strong protection against indexing, use the noindex meta tag in the page's HTML: <meta name='robots' content='noindex'>. This way, even if Googlebot crawls the page, it will not add it to search results. For complete protection, use both robots.txt Disallow and noindex together, or require login for sensitive pages.

Question 4

What is the difference between Disallow and Allow?

Accepted Answer

Disallow blocks access to a path, and Allow explicitly permits a path that a broader Disallow rule would otherwise block. The more specific rule takes precedence. For example: Disallow: /admin/ blocks all of /admin/, but Allow: /admin/public/ overrides the Disallow for that specific path. This pattern is common for WordPress sites that want to block /wp-admin/ but allow /wp-admin/admin-ajax.php (needed for some public AJAX functions). Allow rules only make sense in combination with a more general Disallow rule.

Question 5

What is the Crawl-delay directive and should I use it?

Accepted Answer

Crawl-delay: N tells a crawler to wait N seconds between requests to your server. For example, Crawl-delay: 2 limits the bot to 30 requests per minute. This is useful for protecting servers with limited resources from being overwhelmed by aggressive crawlers. However, Googlebot ignores Crawl-delay—for Google, use the crawl rate settings in Google Search Console instead. Crawl-delay is respected by Bingbot, DuckDuckBot, and many other crawlers.

Robots.txt Generator

What is Robots.txt Generator?

How to Use Robots.txt Generator

FAQ

Where exactly should robots.txt be placed?

Can I block AI crawlers from scraping my content?

Does Disallow in robots.txt prevent indexing?

What is the difference between Disallow and Allow?

What is the Crawl-delay directive and should I use it?