Tuesday, June 30, 2009

Does Googlebot overload your server?

Getting indexed by Google is vital for your business, but what do you do when Googlebot hits your website too hard?

The problem
For one of our websites, Googlebot accounts 90% of the hits - and that is alot of requests. The problem is that Googlebot comes in waves, sometimes sending 10 requests per second or more, other times sitting quietly at arround 1 request per second.

The server that runs the application should have no problem handling this, but if such a Googlebot wave overlaps certain batch jobs (at night), or even worse peak traffic hours, things are not peachy at all. In other words, being a heavy database application, this translates into slow, or even denied service for our (human) visitors.

As you know, when things start to go bad for a database driven website, chances are they're going to get worse. The more requests the application gets, the more time it takes to deliver on each, and if the requests are comming in at a constant rate, or even increasing, this creates an exponentially growing que which ultimately leads to the swap, huge loads, CPUs 99.9% in IO wait = denial of service. Such is the price of success - if you're not prepared for it, that is.

The solution
Obviously you're going to want to get as much attention from Googlebot as you can. So, if you're like me, you're definitively not going to go to Google's Webmaster Tools and tell it to give you a rest.

What you want is a way to just tell Googlebot to give you a rest when the server is naturally busy either with real traffic, performing your routine operations or simply when Googlebot is giving you exagerated love. If at any point your server has some spear processing power, and Google is willing to index your pages... well, you know where I'm going.

There are 2 key elements to the solution that we implemented:


// check if googlebot visit
$ua = $_SERVER['HTTP_USER_AGENT'];
$ipos = stripos($ua, 'googlebot');
if($ipos !== false) {
// check load
$data = shell_exec('uptime');
$data = explode(' ', $data);
$data = explode(',', $data[13]);
$load = $data[0];
if($load > 3) {
header('HTTP/1.1 503 Service Temporarily Unavailable');
header('Status: 503 Service Temporarily Unavailable');
header('Retry-After: 3600');
die();
}
}

Simply put, our server has 4 CPUs - that usually means that a load of up to 4 is ok. So we are simply cheking if this is a Googlebot visit, and if so only allow it to run if the server load is below 3. If the server is loaded, we're simply telling this to Google, and ask it to check back later - in one hour in this case.

2 comments:

  1. Hi this looks interesting. Recently we have been having Google Bots cause server overloads.

    One question. My apache/security is designed to not output errors like the 503 for example. Would this script still work without this header portion? Essentially our server outputs a blank page instead of this error if this result returns true. It might just confuse the google bot? Which would in effect have the google bot keep try, but at least it would not overload the site?

    ReplyDelete
  2. Well, if you give out a 200 answer with a blank page that means that url holds a blank page.

    In reality Google bot might consider it a soft 404 and retry it.

    As far as I know though, 503 is the correct answer to server overload.

    ReplyDelete