Course → Module 7: Technical SEO Baseline
Session 1 of 7

Everything you have built so far (structured data, entity links, profile optimization) depends on one prerequisite: Google must be able to crawl your pages. If Googlebot cannot access your website, it cannot read your schema markup, follow your sameAs links, or index your entity pages. Your entity signals become invisible.

Crawlability is the most fundamental layer of technical SEO. Before optimizing anything else, you must ensure that search engines can physically reach and read the pages that carry your entity signals.

How Google Crawls the Web

Google discovers and processes web pages through a three-stage pipeline: crawling, rendering, and indexing. Crawling is the first stage, where Googlebot visits your URL and downloads the HTML. Rendering is the second stage, where Google executes JavaScript to produce the final page content. Indexing is the third stage, where Google processes the rendered content and stores it for search results.

flowchart LR A["URL Discovery
(Sitemaps, Links, GSC)"] --> B["Crawl Queue
(Prioritized by importance)"] B --> C["Googlebot Fetches
HTML"] C --> D{"robots.txt
allows?"} D -->|Blocked| E["Page NOT Crawled
Entity signals invisible"] D -->|Allowed| F["HTML Downloaded"] F --> G{"JavaScript
required?"} G -->|No| H["Content Available
Immediately"] G -->|Yes| I["Render Queue
(May be delayed)"] I --> J["JavaScript Executed
Final DOM Produced"] H --> K["Indexing Pipeline"] J --> K K --> L{"noindex
tag?"} L -->|Yes| M["Page NOT Indexed
Entity signals lost"] L -->|No| N["Page Indexed
Entity signals processed"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style E fill:#222221,stroke:#c47a5a,color:#ede9e3 style M fill:#222221,stroke:#c47a5a,color:#ede9e3 style N fill:#222221,stroke:#6b8f71,color:#ede9e3

There are multiple points in this pipeline where your entity signals can get blocked. A robots.txt file that disallows crawling means Googlebot never sees the page. A noindex tag means the page is crawled but never added to Google's index. JavaScript-dependent content means your entity signals are only available after rendering, which may be delayed.

Key concept: Crawlability is not just about whether Google can reach your homepage. It is about whether Google can reach every page that carries an entity signal: your About page, your Contact page, your service pages, and any page with schema markup.

robots.txt

The robots.txt file sits at the root of your domain (e.g., https://example.com/robots.txt) and tells search engines which parts of your site they may or may not crawl. It is a simple text file with directives like Allow and Disallow.

A properly configured robots.txt for entity authority should:

A common mistake is blocking CSS or JS files with robots.txt. If Google cannot load your stylesheets and scripts, it cannot render your page properly, which means JavaScript-injected schema markup and content will not be processed.

Common Crawl Blocks

Crawl BlockHow It HappensEntity ImpactHow to Fix
robots.txt Disallow on entity pagesOverly broad Disallow rules (e.g., Disallow: /about/)Severe. Entity pages invisible to Google.Audit robots.txt. Remove blocks on entity-critical pages.
noindex meta tagAdded by CMS settings, staging environment leak, or developer mistakeSevere. Page crawled but never indexed.Check meta robots tag on every entity page. Remove noindex.
X-Robots-Tag: noindex headerServer configuration sends noindex in HTTP headersSevere. Not visible in HTML source.Check HTTP response headers using curl or browser dev tools.
JavaScript-only contentEntity info loaded via client-side JS, not in initial HTMLModerate. Delayed processing, may be missed.Ensure entity signals are in static HTML, not JS-dependent.
Login/paywall blockingContent behind authentication that Googlebot cannot passSevere. Googlebot cannot authenticate.Ensure entity pages are publicly accessible.
Blocked CSS/JS resourcesrobots.txt blocks /css/ or /js/ directoriesModerate. Google cannot render the page properly.Allow Googlebot to access CSS and JS files.
Server errors (5xx)Server returns 500 errors when Googlebot visitsSevere. Googlebot gives up after repeated failures.Fix server issues. Monitor uptime.
Slow server responseServer takes 5+ seconds to respondModerate. Googlebot may abandon crawl.Improve server response time to under 200ms.
Redirect chainsPage A redirects to B, which redirects to C, which redirects to DModerate. Googlebot may not follow long chains.Reduce to single redirect. Maximum 2 hops.
Orphaned pagesEntity pages not linked from any other page on your siteModerate. Googlebot discovers pages via links.Add internal links to all entity pages. Include in sitemap.

JavaScript Rendering and Entity Signals

If your website uses a JavaScript framework (React, Angular, Vue), your entity signals may not be in the initial HTML that Googlebot downloads. They might only appear after JavaScript executes. This creates a problem.

Google does render JavaScript, but it does so in a separate queue. There can be a delay between the initial crawl and the render. During this delay, your entity signals are effectively invisible. For critical entity pages, this is a risk you should eliminate.

Static HTML and server-side rendered pages deliver entity signals almost immediately. Client-side JavaScript pages can take 24 to 72 hours before Google processes their content. For entity signals, this delay matters. Your schema markup, sameAs links, and entity descriptions should be in the initial HTML whenever possible.

How to Test Crawlability

Google Search Console provides the most authoritative crawlability testing tools:

You can also test manually by viewing your robots.txt file (https://yourdomain.com/robots.txt), checking HTTP response headers for noindex directives, and using "View Page Source" to verify that entity signals appear in the static HTML.

Further Reading

Assignment

  1. Visit your robots.txt file (https://yourdomain.com/robots.txt). Review every Disallow rule. Does any rule block an entity-critical page (homepage, about, contact, services)?
  2. Use the URL Inspection Tool in Google Search Console to test your homepage, About page, and Contact page. For each, verify that the page is crawlable and that your schema markup appears in the rendered HTML.
  3. Check the HTTP response headers of your entity pages using browser developer tools (Network tab). Look for any X-Robots-Tag: noindex headers.
  4. View the source of your homepage (Ctrl+U or Cmd+U). Verify that your Organization schema markup appears in the raw HTML, not only after JavaScript execution.
  5. If you find any crawl blocks from the table above, fix them and re-test using the URL Inspection Tool.