Robots.txt

What is robots.txt?

Robots.txt is a text file in your website's root directory that instructs search engine bots (crawlers) which parts of your website can be crawled and which cannot. It is an important component of Technical SEO that manages crawl budget, prevents duplicate content, and protects sensitive areas.

For B2B websites, robots.txt is critical to ensure that Google crawls the right pages with the right priority and doesn't waste time on admin pages or duplicates.

Robots.txt in B2B Context

B2B websites often have areas that Google should not crawl: admin panels, customer accounts, internal documentation, filter variations of product pages. Without robots.txt blocking these pages, Google could:

Index duplicate content that harms rankings
Waste crawl budget on unimportant pages
Index pages with privacy concerns

A strategic robots.txt optimizes crawl efficiency and prevents indexing problems.

Structure and Syntax of robots.txt

A robots.txt file follows this structure:

User-agent: Googlebot Disallow: /admin/ Disallow: /private/ Disallow: /*?sort= Allow: /important-admin-page/ User-agent: * Disallow: /temp/ Sitemap: https://example.com/sitemap.xml

Explanation of components:

User-agent: Which bot follows these rules (* = all bots)
Disallow: Paths that should not be crawled
Allow: Exceptions that can be crawled (overrides Disallow)
Sitemap: Optional reference to XML sitemap

Disallow vs. Noindex

A common mistake: people confuse Disallow in robots.txt with Noindex. This is dangerous:

Method	Impact	Use Case
Disallow in robots.txt	Page is not crawled (but can be indexed if linked externally)	Pages that should not be crawled
Noindex Meta Tag	Page is crawled but not indexed	Pages without ranking intent

If you really want to remove a page from the index, you need Noindex, not Disallow. Disallow just means "don't crawl that", but doesn't say "don't index that".

Best practices for robots.txt

Root directory: File MUST be in root directory: /robots.txt (not /robots.txt.txt or /assets/robots.txt)
Text only: robots.txt is pure text file, not HTML or other formats
Keep it simple: Don't make it too complicated. Focus on important disallows
Review regularly: When website structure changes, update robots.txt
Case-sensitive: Paths are case-sensitive: /admin/ is not the same as /Admin/
No wildcards in Allow: Allow doesn't support wildcards, only Disallow does

Common robots.txt Entries for B2B Websites

Typical disallows for B2B pages:

/admin/: Admin panel
/login/: Login pages
/user-account/: Personal accounts
/temp/: Temporary pages
/drafts/: Drafts
/*?sort=: Filtered/sorted versions of product pages
/*?utm_: URLs with tracking parameters
/print/: Print versions

Crawl Budget Optimization

Crawl budget is the amount of server resources Google allocates to crawl your website. For larger websites, crawl budget is limited. With robots.txt you can instruct Google to only crawl important pages:

User-agent: * Disallow: /search-results/ Disallow: /tag-archive/ Disallow: /?page=2 Disallow: /?page=3 Sitemap: https://example.com/sitemap.xml

This concentrates Google's crawl capacity on important content, not filter pages.

Parameter Handling

B2B websites often have URL parameters for filtering, sorting, or tracking:

/?sort=price: Sorted view (Disallow)
/?filter=category: Filtered view (Disallow)
/?utm_source=email: Tracking parameter (Disallow)

These should be blocked in robots.txt because they create duplicate content.

XML Sitemap in robots.txt

At the end of robots.txt you can specify the XML sitemap:

Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-blog.xml

This is optional but recommended - it helps Google find your sitemap(s) faster.

Test and Validate robots.txt

Check your robots.txt regularly:

Google Search Console: "Robots.txt Tester" under Crawling - Tools shows how Google interprets the file
View directly in browser: https://example.com/robots.txt - You can open the file directly in your browser
Syntax validation: Tools like robotstxt.org show errors
Check crawl impact: Search Console Coverage report shows if pages are blocked

Common robots.txt Mistakes

Blocking important pages: Accidentally blocking important pages with Disallow
Wrong path syntax: Paths must start with "/"
Too restrictive: Disallowing everything except individual pages is often not a good idea
Not updated: Still blocking old paths that no longer exist
robots.txt too large: robots.txt should be under 500 KB
Wrong User-Agent syntax: User-agent: Googlebot vs. User-agent: googlebot - both work but be consistent

robots.txt and Privacy

robots.txt is NOT a security mechanism. Anyone can open /robots.txt and see which paths you're trying to block. So:

Don't use it to protect secret paths: Don't use robots.txt to hide sensitive URLs
Use additional authentication: Admin panels should be password-protected, not just blocked by robots.txt
HTTPS for sensitive pages: Personal data should be encrypted

robots.txt and Noindex Combination

Best practice combines robots.txt with other techniques:

Page 2+ of pagination: Disallow in robots.txt blocks, but Noindex in head is safer
Temporary pages: robots.txt Disallow + Noindex for double security
Parameter URLs: Disallow combined with Canonical Tags

robots.txt as Part of Your SEO Strategy

robots.txt is a small but important piece of technical SEO. A well-configured robots.txt:

Prevents duplicate content problems
Optimizes crawl budget
Helps Google properly understand your website
Protects unimportant pages from being indexed

At Leadanic's SEO strategy, robots.txt audit and optimization is part of our technical SEO process.