Ajax Crawling for .NET

"The browser can execute JavaScript and produce content on the fly - the crawler cannot"

Download .zip Download .tar.gz View on GitHub

According to Google: "The browser can execute JavaScript and produce content on the fly - the crawler cannot" https://developers.google.com/webmasters/ajax-crawling/docs/learn-more

AJAX what?

A typical AJAX driven JavaScript application does the heavy lifting of "injecting" HTML into the existing DOM. Think @RenderBody in MVC, but dynamically and without re-loading the whole page.

This is certainly progress in the way web sites are built, when new "pages" are requested the URL changes and a fragment is appended to the end e.g. http://www.myexample.com/#/about (note the # in the URL). This is read by the JavaScript and it fetches the HTML defined within the application. This process saves round trips to the server and makes it possible to produce richer and more responsive web sites. It also helps to de-couple the UI from the server.

The Problem

The above process is great for web sites, but bad for search engine crawlers. And if it's bad for them it's bad for you! As stated above the crawler cannot execute the JavaScript and produce content on the fly like your browser can. It needs a static "snapshot" of what you intend to have servered to the browser when the URL http://www.myexample.com/#/about is requested.

We can not however simply create MVC routes to server snapshots based on the hash containing URL as we'd start mixing concerns pretty quickly.

A Solution

To tell search engine crawlers that the contents of the page at the URL is dynamically created all you need to do is start using #! (hashbang) instead of #.

So

http://www.myexample.com/#/about

becomes

http://www.myexample.com/#!/about

This tells the crawler the contents of the page at this address is dynamically inserted via AJAX. What the crawler does at this point is important. It replaces the #! with ?_escaped_fragment_=. This allows your server to identify that a HTML snapshot should be served rather than the dynamic cool AJAXy page.

How do I handle this in ASP.NET MVC?

By using #! we have told the crawler that we have dynamic content on our page, the crawler then makes a request but replaces the #! with ?_escaped_fragment_=. Using ASP.NET MVC we can handle this request quite easily.

ActionFilters to the rescue

Firstly create an action filter called AjaxCrawlable.

Within the action filter check the request for _escaped_fragment_. If it doesn't exist then return out of there, we're not interested, let the server serve the dynamic AJAX content.

Next split up the QueryString object value that the ?_escaped_fragment_= has produced. var parts = request.QueryString[Fragment].Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries); This breaks up the URL query string based on /.

You can now use the parts of the query string to identify what HTML content the server should serve.

If you're using /{controller}/{action}/{id} as a pattern parts[0] will be your controller parts[1] your action and parts[2] your id.

You should have something that look a little like this:

public override void OnActionExecuting(ActionExecutingContext filterContext)
{
  var request = filterContext.RequestContext.HttpContext.Request;

  if (string.IsNullOrWhiteSpace(request.QueryString[Fragment]))
  {
    return;
  }

  var parts = request.QueryString[Fragment].Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries);

  if (parts.Length > 0)
  {
  ...
  }
}

Now to insert the HTML snapshot. One solution that I use is to read the HTML file contents into an MvcHtmlString object and save it to a ViewBag property.

filterContext.Controller.ViewBag.HTMLPage =
MvcHtmlString.Create(
  System.IO.File.ReadAllText(
    HttpContext.Current.Server.MapPath("~") + "pages/" + parts[0] + ".html", System.Text.Encoding.UTF8));

The Controller

In your controller add your action filter to any of your view methods that might contain dynamic content.

[AjaxCrawlable]
public ActionResult Index()
{
  return View();
}

The View

Then simply check if the HTMLPage property is null, if it isn't use it, otherwise use your dynamic content:

@if(ViewBag.HTMLPage != null)
{
    <!-- Snapshot -->
    @ViewBag.HTMLPage;
}
else {
    <!-- Your dynamic content here -->
}

Testing

Simply run your site with JavaScript disabled, and when you want to test http://www.myexample.com/#!/about simply replace the #! with ?_escaped_fragment_= and goto http://www.myexample.com/?_escaped_fragment_=/about instead. You should see your page is displayed with the usually dynamically added content in place, ready to be crawled.

Resources