Skip to main content

robots.txt – Controlling Search Engine Crawling with Precision

The robots.txt file is a plain text file located in the root directory of a website. It provides instructions to search engine crawlers on which parts of a site should or should not be crawled. It’s a key component of technical SEO and follows the Robots Exclusion Protocol.

Search engines check this file before crawling a website to determine access permissions.

Common use cases:

  • Blocking access to internal or irrelevant sections (e.g. admin areas, filter URLs)
  • Conserving crawl budget on large websites
  • Preventing unintentional duplication issues
  • Allowing/disallowing specific user agents (e.g. Googlebot, Bingbot)

Example:

User-agent: *
Disallow: /internal/
Allow: /internal/seo-checklist.pdf

This rule blocks all crawlers from the /internal/ directory, except for one specific file.

Important Notes:

  • robots.txt only controls crawling, not indexing.
  • To prevent indexing, use a noindex meta tag on the page – which requires the page to be crawlable.
  • Misconfiguration can block important pages from search results entirely.

Note: Crawled pages are not guaranteed to be indexed – Google may choose to ignore them if they’re low quality or redundant.

Back
© FINK Brot