What is robots.txt?
Robots.txt is a text file in your website's root directory that instructs search engine bots (crawlers) which parts of your website can be crawled and which cannot. It is an important component of Technical SEO that manages crawl budget, prevents duplicate content, and protects sensitive areas.
For B2B websites, robots.txt is critical to ensure that Google crawls the right pages with the right priority and doesn't waste time on admin pages or duplicates.
Robots.txt in B2B Context
B2B websites often have areas that Google should not crawl: admin panels, customer accounts, internal documentation, filter variations of product pages. Without robots.txt blocking these pages, Google could:
- Index duplicate content that harms rankings
- Waste crawl budget on unimportant pages
- Index pages with privacy concerns
A strategic robots.txt optimizes crawl efficiency and prevents indexing problems.
Structure and Syntax of robots.txt
A robots.txt file follows this structure:
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Disallow: /*?sort=
Allow: /important-admin-page/
User-agent: *
Disallow: /temp/
Sitemap: https://example.com/sitemap.xml
Explanation of components:
- User-agent: Which bot follows these rules (* = all bots)
- Disallow: Paths that should not be crawled
- Allow: Exceptions that can be crawled (overrides Disallow)
- Sitemap: Optional reference to XML sitemap
Disallow vs. Noindex
A common mistake: people confuse Disallow in robots.txt with Noindex. This is dangerous:
| Method | Impact | Use Case |
|---|---|---|
| Disallow in robots.txt | Page is not crawled (but can be indexed if linked externally) | Pages that should not be crawled |
| Noindex Meta Tag | Page is crawled but not indexed | Pages without ranking intent |
If you really want to remove a page from the index, you need Noindex, not Disallow. Disallow just means "don't crawl that", but doesn't say "don't index that".
Best practices for robots.txt
- Root directory: File MUST be in root directory: /robots.txt (not /robots.txt.txt or /assets/robots.txt)
- Text only: robots.txt is pure text file, not HTML or other formats
- Keep it simple: Don't make it too complicated. Focus on important disallows
- Review regularly: When website structure changes, update robots.txt
- Case-sensitive: Paths are case-sensitive: /admin/ is not the same as /Admin/
- No wildcards in Allow: Allow doesn't support wildcards, only Disallow does
Common robots.txt Entries for B2B Websites
Typical disallows for B2B pages:
- /admin/: Admin panel
- /login/: Login pages
- /user-account/: Personal accounts
- /temp/: Temporary pages
- /drafts/: Drafts
- /*?sort=: Filtered/sorted versions of product pages
- /*?utm_: URLs with tracking parameters
- /print/: Print versions
Crawl Budget Optimization
Crawl budget is the amount of server resources Google allocates to crawl your website. For larger websites, crawl budget is limited. With robots.txt you can instruct Google to only crawl important pages:
User-agent: *
Disallow: /search-results/
Disallow: /tag-archive/
Disallow: /?page=2
Disallow: /?page=3
Sitemap: https://example.com/sitemap.xml
This concentrates Google's crawl capacity on important content, not filter pages.
Parameter Handling
B2B websites often have URL parameters for filtering, sorting, or tracking:
- /?sort=price: Sorted view (Disallow)
- /?filter=category: Filtered view (Disallow)
- /?utm_source=email: Tracking parameter (Disallow)
These should be blocked in robots.txt because they create duplicate content.
XML Sitemap in robots.txt
At the end of robots.txt you can specify the XML sitemap:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-blog.xml
This is optional but recommended - it helps Google find your sitemap(s) faster.
Test and Validate robots.txt
Check your robots.txt regularly:
- Google Search Console: "Robots.txt Tester" under Crawling - Tools shows how Google interprets the file
- View directly in browser: https://example.com/robots.txt - You can open the file directly in your browser
- Syntax validation: Tools like robotstxt.org show errors
- Check crawl impact: Search Console Coverage report shows if pages are blocked
Common robots.txt Mistakes
- Blocking important pages: Accidentally blocking important pages with Disallow
- Wrong path syntax: Paths must start with "/"
- Too restrictive: Disallowing everything except individual pages is often not a good idea
- Not updated: Still blocking old paths that no longer exist
- robots.txt too large: robots.txt should be under 500 KB
- Wrong User-Agent syntax: User-agent: Googlebot vs. User-agent: googlebot - both work but be consistent
robots.txt and Privacy
robots.txt is NOT a security mechanism. Anyone can open /robots.txt and see which paths you're trying to block. So:
- Don't use it to protect secret paths: Don't use robots.txt to hide sensitive URLs
- Use additional authentication: Admin panels should be password-protected, not just blocked by robots.txt
- HTTPS for sensitive pages: Personal data should be encrypted
robots.txt and Noindex Combination
Best practice combines robots.txt with other techniques:
- Page 2+ of pagination: Disallow in robots.txt blocks, but Noindex in head is safer
- Temporary pages: robots.txt Disallow + Noindex for double security
- Parameter URLs: Disallow combined with Canonical Tags
robots.txt as Part of Your SEO Strategy
robots.txt is a small but important piece of technical SEO. A well-configured robots.txt:
- Prevents duplicate content problems
- Optimizes crawl budget
- Helps Google properly understand your website
- Protects unimportant pages from being indexed
At Leadanic's SEO strategy, robots.txt audit and optimization is part of our technical SEO process.