When Google and other search engines index websites, they don’t execute JavaScript.
This seems to put SPAs at a tremendous disadvantage compared to a traditional web- site. Not being on Google could easily mean the death of a business, and this daunting pitfall could tempt the uninformed to abandon SPAs.
SPAs actually have an advantage over traditional websites in search engine optimi- zation (SEO) because Google and others have recognized the challenge. They have created a mechanism for SPAs to not only have their dynamic pages indexed, but also optimize their pages specifically for crawlers. This section focuses on the biggest search engine, Google, but other large search engines such as Yahoo and Bing sup- port the same mechanism.
9.1.1 How Google crawls an SPA
When Google indexes a traditional website, its web crawler (called a Googlebot) first scans and indexes the content of the top-level URI (for example, www.myhome.com).
Once this is complete, it then it follows all of the links on that page and indexes those pages as well. It then follows the links on the subsequent pages, and so on. Eventually it indexes all the content on the site and associated domains.
When the Googlebot tries to index an SPA, all it sees in the HTML is a single empty container (usually an empty div or body tag), so there’s nothing to index and no links to crawl, and it indexes the site accordingly (in the round circular “folder” on the floor next to its desk).
If that were the end of the story, it would be the end of SPAs for many web applica- tions and sites. Fortunately, Google and other search engines have recognised the importance of SPAs and provided tools to allow developers to provide search informa- tion to the crawler that can be better than traditional websites.
The first key to making our SPA crawlable is to realize that our server can tell if a request is being made by a crawler or by a person using a web browser and respond accordingly. When our visitor is a person using a web browser, respond as normal, but for a crawler, return a page optimized to show the crawler exactly what we want to in a format the crawler can easily read.
315 Optimize our SPA for search engines
For the home page of our site, what does a crawler-optimized page look like? It’s prob- ably our logo or other primary image we’d like appearing in search results, some SEO- optimized text explaining what the application does, and a list of HTML links to only those pages we want Google to index. What the page doesn’t have is any CSS styling or complex HTML structure applied to it. Nor does it have any JavaScript, or links to areas of the site we don’t want Google to index (like legal disclaimer pages or other pages we don’t want people to enter through a Google search). Figure 9.1 shows how a page might be presented to a browser and to the crawler.
The links on the page aren’t followed by the crawler the same way a person follows links because we apply the special characters #! (pronounced hash bang) in our URI anchor component. For instance, if in our SPA a link to the user page looks like /index.htm#!page=user:id,123, the crawler would see the #! and know to look for a web page with the URI /index.htm?_escaped_fragment_=page=user:id,123. Know- ing that the crawler will follow the pattern and look for this URI, we can program the server to respond to that request with an HTML snapshot of the page that would nor- mally be rendered by JavaScript in the browser. That snapshot will be indexed by Google, but anyone clicking on our listing in Google search results will be taken to /index.htm#!page=user:id,123. The SPA JavaScript will take over from there and render the page as expected.
This provides SPA developers with the opportunity to tailor their site specifically for Google and specifically for users. Instead of having to write text that’s both legi- ble and attractive to a person and understandable by a crawler, pages can be opti- mized for each without worrying about the other. The crawler’s path through our site can be controlled, allowing us to direct people from Google search results to a specific set of entrance pages. This will require more work on the part of the engi- neer to develop, but it can have big pay-offs in terms of search result position and customer retention.
Figure 9.1 Client and crawler views of a home page
At the time of this writing, the Googlebot announces itself as a crawler to the server by making requests with a user-agent string of Googlebot/2.1 (+http://
www.googlebot.com/bot.html). Our Node.js application can check for this user agent string in the middleware and send back the crawler-optimized home page if the user agent string matches. Otherwise, we can handle the request normally. Alterna- tively, we could hook it into our routing middleware as shown in listing 9.1:
...
var agent_text = 'Enter the modern single page web application(SPA).' + 'With the near universal availability of capable browsers and ' + 'powerful hardware, we can push most of the web application to' + ' the browser; including HTML rendering, data, and business ' + 'logic. The only time a client needs to communicate with the ' + 'server is to authenticate or synchronize data. This means users' + ' get a fluid, comfortable experience whether they\'re surfing ' + 'at their desk or using a phone app on a sketch 3G connection.' + '<br><br>'
+ '<a href="/index.htm#page=home">;Home</a><br>' + '<a href="/index.htm#page=about">About</a><br>' + '<a href="/index.htm#page=buynow">Buy Now!</a><br>'
+ '<a href="/index.htm#page=contact us">Contact Us</a><br>';
app.all( '*', function ( req, res, next ) { if ( req.headers['user-agent'] ===
'Googlebot/2.1 (+http://www.googlebot.com/bot.html)' ) { res.contentType( 'html' );
res.end( agent_text );
} else { next();
} });
...
This arrangment seems like it would be complicated to test, since we don’t own a Googlebot. Google offers a service to do this for publicly available production websites as part of its Webmaster Tools (http://support.google.com/webmasters/bin/
answer.py?hl=en&answer=158587), but an easier way to test is to spoof our user-agent string. This used to require some command-line hackery, but Chrome Developer Tools makes this as easy as clicking a button and checking a box:
1 Open the Chrome Developer Tools by clicking the button with three horizontal lines to the right of the Google Toolbar, and then selecting Tools from the menu and clicking on Developer Tools.
Listing 9.1 Detect a Googlebot and serve alternative content in the routes.js file
The HTML to be provided to the web crawler
Detect the Googlebot by looking at the user agent string; other crawlers use a different user agent that with some research can be targeted as well. If the crawler is detected, the contentType is set to HTML and the text is sent to it, bypassing the normal routing code.
If the user agent isn’t a crawler, call next() to proceed to the next route for normal processing.
317 The cloud and third-party services
2 In the lower-right corner of the screen is a gears icon: click on that and see some advanced developer options such as disabling cache and turning on log- ging of XmlHttpRequests.
3 In the second tab, labelled Overrides, click the check box next to the User Agent label and select any number of user agents from the drop-down from Chrome, to Firefox, to IE, iPads, and more. The Googlebot agent isn’t a default option. In order to use it, select Other and copy and paste the user-agent string into the provided input.
4 Now that tab is spoofing itself as a Googlebot, and when we open any URI on our site, we should see the crawler page.
Obviously, different applications will have different needs with regard to what to do with web crawlers, but always having one page returned to the Googlebot is probably not enough. We’ll also need to decide what pages we want to expose and provide ways for our application to map the _escaped_fragment_=key=value URI to the content we want to show them. Whatever the case, this book should provide you with the tools to decide how to best abstract the crawler content for your application. You may want to get fancy and tie the server response in to the front end framework, but we usually take the simpler approach here and create custom pages for the crawler and put them in a separate router file for crawlers.
There are also a lot more legitimate crawlers out there, so once we’ve adjusted our server for the Google crawler we can expand to include them as well.