Indexing

What is Indexing?

Indexing is the process by which Google crawls (visits) web pages, analyzes them, and adds them to its index. A page that is not indexed does not exist from Google's perspective and will not appear in search results. Indexing is step 1 of SEO: before any ranking is possible, the page must first be indexed.

Google uses automated crawlers (bots) that continuously traverse the web, follow links, read pages, and add them to the index. When Google discovers your website (through links or sitemap), it crawls it. If the page is "crawlable" (no blocks, fast enough, valid HTML), it gets indexed.

Indexing in B2B SEO Context

Indexing problems are surprisingly common in B2B websites:

Large sites with thousands of pages: B2B companies often have hundreds or thousands of product pages, case studies, blog posts. If crawl budget is limited, Google may not crawl everything.
Crawl budget management: Google allocates a budget for how much of your website to crawl. If many pages are unimportant, Google "wastes" budget on them. You must be strategic about which pages matter.
Duplicate content and parameters: E-commerce and SaaS websites often have duplicate pages due to URL parameters (filters, sorting). Google must know which is the "primary" version.
Gated content: If whitepapers or case studies are only available after login, Google cannot index them. This is often a conscious trade-off but with consequences.
Sitemap architecture: Large B2B websites need structured sitemaps so Google doesn't miss important pages.

Indexing Status in Google Search Console

Google Search Console shows the indexing status of your website:

Status	Meaning	Action
Indexed	Page is indexed and can appear in search results	Good! No action needed, but monitor rankings
Crawled but not indexed	Google crawls the page but chooses not to index it (duplicate, low quality, etc.)	Investigate why. Possibly a meta-robots noindex tag? Duplicate content? Slow page?
Excluded	Google has excluded the page from the index (due to robots.txt, noindex, canonical, parameter handling)	If you want this page indexed, remove the block. If not, leave as is.
Not found (404)	Google tried to crawl the page but got a 404 error	Either fix the page (restore URL) or remove the URL from the sitemap
Submitted and currently not indexed	Page was submitted in the sitemap but Google hasn't indexed it yet	Wait or request indexing via URL inspection. If it takes a long time, investigate why Google isn't indexing it.

Diagnosing and Fixing Indexing Problems

If a page is not indexed, here's how to find the cause:

URL Inspection in Google Search Console: Enter the URL. GSC will show: "Indexed"? "Crawled but not indexed"? If crawled but not indexed, Google will show the reason.
Common reasons why not indexed:
Meta-robots noindex tag (check HTML)
robots.txt block (check robots.txt)
Canonical to another page (check canonical tags)
Duplicate content (Google preferred another version)
Poor mobile usability (page speed, rendering issues)
Server errors (500, 503)
Low quality content (but Google won't admit this)
Request Indexing: If a page is "crawled but not indexed", you can click the "Request Indexing" button in URL inspection. Google will re-crawl and re-evaluate the page.
Remove Blocks: If the page is blocked by robots.txt or noindex, remove these blocks and resubmit in the sitemap.
Improve Page Speed: If page speed is the problem, optimize (images, caching, CSS/JS minify).
Check Crawl Errors: Google Search Console Coverage Report shows crawl errors. Fix these (404s, redirects, server errors).

Crawl Budget Management for Large Websites

Google allocates a budget for how often to crawl a website. Large websites with hundreds or thousands of pages must manage crawl budget:

Remove Bloat: Unimportant pages (old filter pages, archives, duplicates) should be removed or disallowed in robots.txt. This gives Google more budget for important pages.
Sitemap Priority: In XML sitemap, you can set <priority> tags. Important pages should be priority 1.0, less important 0.5 or 0.3.
Lastmod Date: In the sitemap, update the <lastmod> date when a page is updated. This tells Google to only crawl new/updated pages, not everything every time.
Noindex for Thin Content: Archive pages, old variations, very short content can get noindex so Google conserves crawl budget for valuable content.
Internal Linking Strategy: Important pages should get lots of internal links. Google crawls pages that are heavily linked first.

robots.txt and Indexing

The robots.txt file controls which pages Google is allowed to crawl:

Disallow specific directories: If you don't want Google to crawl /admin/ or /temp/, add: Disallow: /admin/
But Disallow != Noindex: Important point: robots.txt disallow means "don't crawl", but the page could still be indexed if external links point to it. For proper noindex, use a noindex meta tag or header.
User-agent specificity: You can set different rules for different bots. For example, Disallow: / for bots that aren't Google, but allow for Googlebot.

XML Sitemap and Indexing

XML Sitemap is the best way to tell Google about new or updated pages:

Create a sitemap: For large websites, create an XML sitemap with all important URLs.
Submit in Google Search Console: GSC will show if the sitemap is valid. Check that the sitemap shows indexed vs. non-indexed URLs.
Update regularly: When new content is published, update the sitemap immediately. Google checks regularly.
Sitemap index for large sites: If the sitemap has about 50,000 URLs (the limit), create a sitemap index that references multiple sitemaps.

Improving Indexing Speed

If new pages take forever to index (2-4 weeks), here are tips to speed it up:

Use URL Inspection: After publishing, inspect the URL in GSC. This "hints" to Google that the page is new.
Request Indexing: Click "Request Indexing" in URL inspection. This triggers immediate crawling.
Internal Linking: Link to the new page from important existing pages (homepage, popular posts). Google follows internal links and crawls linked pages faster.
Social Sharing: Sharing on social media doesn't directly help with indexing, but more traffic can help.
Sitewide News Feed: A news or updates feed on the homepage of new posts helps Google discover them faster.
Page Speed: Fast pages are crawled faster. If a new page is very slow, crawling gets delayed.

Monitoring Indexing Over Time

Indexing problems can emerge suddenly. Monitor regularly:

Google Search Console Coverage Report: Check weekly. If the "indexed" count drops, investigate immediately.
Total Indexed Pages Tracking: Track in a spreadsheet: how many pages should be indexed, how many actually are.
Excluded Reasons Report: GSC shows why pages are excluded. Monitor for unexpected exclusions.
New Crawl Errors: If suddenly many 404s or server errors appear, fix immediately. This can block crawling for other pages.

Indexing is the foundation of SEO. If pages are not indexed, they will never rank, no matter how good the content is. Spend time with SEO fundamentals before attempting advanced ranking tactics.