Robots.txt: what it does, what it cannot do, and common mistakes

Robots.txt is a plain text file placed in the root directory of a website that tells search engine crawlers which pages or sections they are allowed or not allowed to access. It is the first file Googlebot checks when it visits a site.

The file uses simple directives. Disallow: /admin/ tells crawlers not to access the admin section. Allow: / permits access to everything. Different rules can be set for different crawlers using the User-agent directive.

Robots.txt controls crawling, not indexing. This is a critical distinction. Blocking a URL in robots.txt prevents Googlebot from crawling it but does not prevent that URL from appearing in search results if other sites link to it. To prevent a page from being indexed, you need a noindex meta tag on the page itself, which requires the page to be crawlable. Common robots.txt mistakes include accidentally blocking CSS and JavaScript files that Google needs to render pages correctly, blocking staging environments without password protection, and using robots.txt to try to hide pages from search results when noindex is the correct tool.

Back to Glossary Index Page

Ready to see where your brand stands in AI and search?