Session 7.1: Crawlability

Course → Module 7: Technical SEO Baseline

Session 1 of 7

Everything you have built so far (structured data, entity links, profile optimization) depends on one prerequisite: Google must be able to crawl your pages. If Googlebot cannot access your website, it cannot read your schema markup, follow your sameAs links, or index your entity pages. Your entity signals become invisible.

Crawlability is the most fundamental layer of technical SEO. Before optimizing anything else, you must ensure that search engines can physically reach and read the pages that carry your entity signals.

How Google Crawls the Web

Google discovers and processes web pages through a three-stage pipeline: crawling, rendering, and indexing. Crawling is the first stage, where Googlebot visits your URL and downloads the HTML. Rendering is the second stage, where Google executes JavaScript to produce the final page content. Indexing is the third stage, where Google processes the rendered content and stores it for search results.

flowchart LR A["URL Discovery
(Sitemaps, Links, GSC)"] --> B["Crawl Queue
(Prioritized by importance)"] B --> C["Googlebot Fetches
HTML"] C --> D{"robots.txt
allows?"} D -->|Blocked| E["Page NOT Crawled
Entity signals invisible"] D -->|Allowed| F["HTML Downloaded"] F --> G{"JavaScript
required?"} G -->|No| H["Content Available
Immediately"] G -->|Yes| I["Render Queue
(May be delayed)"] I --> J["JavaScript Executed
Final DOM Produced"] H --> K["Indexing Pipeline"] J --> K K --> L{"noindex
tag?"} L -->|Yes| M["Page NOT Indexed
Entity signals lost"] L -->|No| N["Page Indexed
Entity signals processed"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style E fill:#222221,stroke:#c47a5a,color:#ede9e3 style M fill:#222221,stroke:#c47a5a,color:#ede9e3 style N fill:#222221,stroke:#6b8f71,color:#ede9e3

There are multiple points in this pipeline where your entity signals can get blocked. A robots.txt file that disallows crawling means Googlebot never sees the page. A noindex tag means the page is crawled but never added to Google's index. JavaScript-dependent content means your entity signals are only available after rendering, which may be delayed.

Key concept: Crawlability is not just about whether Google can reach your homepage. It is about whether Google can reach every page that carries an entity signal: your About page, your Contact page, your service pages, and any page with schema markup.

robots.txt

The robots.txt file sits at the root of your domain (e.g., https://example.com/robots.txt) and tells search engines which parts of your site they may or may not crawl. It is a simple text file with directives like Allow and Disallow.

A properly configured robots.txt for entity authority should:

Allow crawling of all entity-relevant pages (homepage, about, contact, services)
Allow crawling of your CSS and JavaScript files (so Google can render pages)
Block crawling of admin pages, duplicate content, and internal search results
Reference your XML sitemap

A common mistake is blocking CSS or JS files with robots.txt. If Google cannot load your stylesheets and scripts, it cannot render your page properly, which means JavaScript-injected schema markup and content will not be processed.

Common Crawl Blocks

Crawl Block	How It Happens	Entity Impact	How to Fix
robots.txt Disallow on entity pages	Overly broad Disallow rules (e.g., Disallow: /about/)	Severe. Entity pages invisible to Google.	Audit robots.txt. Remove blocks on entity-critical pages.
noindex meta tag	Added by CMS settings, staging environment leak, or developer mistake	Severe. Page crawled but never indexed.	Check meta robots tag on every entity page. Remove noindex.
X-Robots-Tag: noindex header	Server configuration sends noindex in HTTP headers	Severe. Not visible in HTML source.	Check HTTP response headers using curl or browser dev tools.
JavaScript-only content	Entity info loaded via client-side JS, not in initial HTML	Moderate. Delayed processing, may be missed.	Ensure entity signals are in static HTML, not JS-dependent.
Login/paywall blocking	Content behind authentication that Googlebot cannot pass	Severe. Googlebot cannot authenticate.	Ensure entity pages are publicly accessible.
Blocked CSS/JS resources	robots.txt blocks /css/ or /js/ directories	Moderate. Google cannot render the page properly.	Allow Googlebot to access CSS and JS files.
Server errors (5xx)	Server returns 500 errors when Googlebot visits	Severe. Googlebot gives up after repeated failures.	Fix server issues. Monitor uptime.
Slow server response	Server takes 5+ seconds to respond	Moderate. Googlebot may abandon crawl.	Improve server response time to under 200ms.
Redirect chains	Page A redirects to B, which redirects to C, which redirects to D	Moderate. Googlebot may not follow long chains.	Reduce to single redirect. Maximum 2 hops.
Orphaned pages	Entity pages not linked from any other page on your site	Moderate. Googlebot discovers pages via links.	Add internal links to all entity pages. Include in sitemap.

JavaScript Rendering and Entity Signals

If your website uses a JavaScript framework (React, Angular, Vue), your entity signals may not be in the initial HTML that Googlebot downloads. They might only appear after JavaScript executes. This creates a problem.

Google does render JavaScript, but it does so in a separate queue. There can be a delay between the initial crawl and the render. During this delay, your entity signals are effectively invisible. For critical entity pages, this is a risk you should eliminate.

Static HTML and server-side rendered pages deliver entity signals almost immediately. Client-side JavaScript pages can take 24 to 72 hours before Google processes their content. For entity signals, this delay matters. Your schema markup, sameAs links, and entity descriptions should be in the initial HTML whenever possible.

How to Test Crawlability

Google Search Console provides the most authoritative crawlability testing tools:

URL Inspection Tool: Enter any URL from your site to see how Google crawls and renders it. Shows the fetched HTML, rendered HTML, and any issues.
Coverage Report: Shows which pages are indexed, excluded, or blocked, and why.
robots.txt Tester: Tests whether specific URLs are blocked by your robots.txt file.

You can also test manually by viewing your robots.txt file (https://yourdomain.com/robots.txt), checking HTTP response headers for noindex directives, and using "View Page Source" to verify that entity signals appear in the static HTML.

Assignment

Visit your robots.txt file (https://yourdomain.com/robots.txt). Review every Disallow rule. Does any rule block an entity-critical page (homepage, about, contact, services)?
Use the URL Inspection Tool in Google Search Console to test your homepage, About page, and Contact page. For each, verify that the page is crawlable and that your schema markup appears in the rendered HTML.
Check the HTTP response headers of your entity pages using browser developer tools (Network tab). Look for any X-Robots-Tag: noindex headers.
View the source of your homepage (Ctrl+U or Cmd+U). Verify that your Organization schema markup appears in the raw HTML, not only after JavaScript execution.
If you find any crawl blocks from the table above, fix them and re-test using the URL Inspection Tool.

Crawlability