SEO

Robots.txt

What is robots.txt? File to control crawler access and protect duplicate content.

What is robots.txt?

Robots.txt is a text file in your website's root directory that instructs search engine bots (crawlers) which parts of your website can be crawled and which cannot. It is an important component of Technical SEO that manages crawl budget, prevents duplicate content, and protects sensitive areas.

For B2B websites, robots.txt is critical to ensure that Google crawls the right pages with the right priority and doesn't waste time on admin pages or duplicates.

Robots.txt in B2B Context

B2B websites often have areas that Google should not crawl: admin panels, customer accounts, internal documentation, filter variations of product pages. Without robots.txt blocking these pages, Google could:

  • Index duplicate content that harms rankings
  • Waste crawl budget on unimportant pages
  • Index pages with privacy concerns

A strategic robots.txt optimizes crawl efficiency and prevents indexing problems.

Structure and Syntax of robots.txt

A robots.txt file follows this structure:

User-agent: Googlebot Disallow: /admin/ Disallow: /private/ Disallow: /*?sort= Allow: /important-admin-page/ User-agent: * Disallow: /temp/ Sitemap: https://example.com/sitemap.xml

Explanation of components:

  • User-agent: Which bot follows these rules (* = all bots)
  • Disallow: Paths that should not be crawled
  • Allow: Exceptions that can be crawled (overrides Disallow)
  • Sitemap: Optional reference to XML sitemap

Disallow vs. Noindex

A common mistake: people confuse Disallow in robots.txt with Noindex. This is dangerous:

Method Impact Use Case
Disallow in robots.txt Page is not crawled (but can be indexed if linked externally) Pages that should not be crawled
Noindex Meta Tag Page is crawled but not indexed Pages without ranking intent

If you really want to remove a page from the index, you need Noindex, not Disallow. Disallow just means "don't crawl that", but doesn't say "don't index that".

Best practices for robots.txt

  • Root directory: File MUST be in root directory: /robots.txt (not /robots.txt.txt or /assets/robots.txt)
  • Text only: robots.txt is pure text file, not HTML or other formats
  • Keep it simple: Don't make it too complicated. Focus on important disallows
  • Review regularly: When website structure changes, update robots.txt
  • Case-sensitive: Paths are case-sensitive: /admin/ is not the same as /Admin/
  • No wildcards in Allow: Allow doesn't support wildcards, only Disallow does

Common robots.txt Entries for B2B Websites

Typical disallows for B2B pages:

  • /admin/: Admin panel
  • /login/: Login pages
  • /user-account/: Personal accounts
  • /temp/: Temporary pages
  • /drafts/: Drafts
  • /*?sort=: Filtered/sorted versions of product pages
  • /*?utm_: URLs with tracking parameters
  • /print/: Print versions

Crawl Budget Optimization

Crawl budget is the amount of server resources Google allocates to crawl your website. For larger websites, crawl budget is limited. With robots.txt you can instruct Google to only crawl important pages:

User-agent: * Disallow: /search-results/ Disallow: /tag-archive/ Disallow: /?page=2 Disallow: /?page=3 Sitemap: https://example.com/sitemap.xml

This concentrates Google's crawl capacity on important content, not filter pages.

Parameter Handling

B2B websites often have URL parameters for filtering, sorting, or tracking:

  • /?sort=price: Sorted view (Disallow)
  • /?filter=category: Filtered view (Disallow)
  • /?utm_source=email: Tracking parameter (Disallow)

These should be blocked in robots.txt because they create duplicate content.

XML Sitemap in robots.txt

At the end of robots.txt you can specify the XML sitemap:

Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-blog.xml

This is optional but recommended - it helps Google find your sitemap(s) faster.

Test and Validate robots.txt

Check your robots.txt regularly:

  • Google Search Console: "Robots.txt Tester" under Crawling - Tools shows how Google interprets the file
  • View directly in browser: https://example.com/robots.txt - You can open the file directly in your browser
  • Syntax validation: Tools like robotstxt.org show errors
  • Check crawl impact: Search Console Coverage report shows if pages are blocked

Common robots.txt Mistakes

  • Blocking important pages: Accidentally blocking important pages with Disallow
  • Wrong path syntax: Paths must start with "/"
  • Too restrictive: Disallowing everything except individual pages is often not a good idea
  • Not updated: Still blocking old paths that no longer exist
  • robots.txt too large: robots.txt should be under 500 KB
  • Wrong User-Agent syntax: User-agent: Googlebot vs. User-agent: googlebot - both work but be consistent

robots.txt and Privacy

robots.txt is NOT a security mechanism. Anyone can open /robots.txt and see which paths you're trying to block. So:

  • Don't use it to protect secret paths: Don't use robots.txt to hide sensitive URLs
  • Use additional authentication: Admin panels should be password-protected, not just blocked by robots.txt
  • HTTPS for sensitive pages: Personal data should be encrypted

robots.txt and Noindex Combination

Best practice combines robots.txt with other techniques:

  • Page 2+ of pagination: Disallow in robots.txt blocks, but Noindex in head is safer
  • Temporary pages: robots.txt Disallow + Noindex for double security
  • Parameter URLs: Disallow combined with Canonical Tags

robots.txt as Part of Your SEO Strategy

robots.txt is a small but important piece of technical SEO. A well-configured robots.txt:

  • Prevents duplicate content problems
  • Optimizes crawl budget
  • Helps Google properly understand your website
  • Protects unimportant pages from being indexed

At Leadanic's SEO strategy, robots.txt audit and optimization is part of our technical SEO process.

Sounds like a topic for you?

We analyze your situation and show concrete improvement potential. The consultation is free and non-binding.

Book Free Consultation