robots.txt – Controlling Search Engine Crawling with Precision
The robots.txt file is a plain text file located in the root directory of a website. It provides instructions to search engine crawlers on which parts of a site should or should not be crawled. It’s a key component of technical SEO and follows the Robots Exclusion Protocol.
Search engines check this file before crawling a website to determine access permissions.
Common use cases:
- Blocking access to internal or irrelevant sections (e.g. admin areas, filter URLs)
- Conserving crawl budget on large websites
- Preventing unintentional duplication issues
- Allowing/disallowing specific user agents (e.g. Googlebot, Bingbot)
Example:
User-agent: *
Disallow: /internal/
Allow: /internal/seo-checklist.pdf
This rule blocks all crawlers from the /internal/ directory, except for one specific file.
Important Notes:
- robots.txt only controls crawling, not indexing.
- To prevent indexing, use a noindex meta tag on the page – which requires the page to be crawlable.
- Misconfiguration can block important pages from search results entirely.
Note: Crawled pages are not guaranteed to be indexed – Google may choose to ignore them if they’re low quality or redundant.