The problem
For one of our websites, Googlebot accounts 90% of the hits - and that is alot of requests. The problem is that Googlebot comes in waves, sometimes sending 10 requests per second or more, other times sitting quietly at arround 1 request per second.
The server that runs the application should have no problem handling this, but if such a Googlebot wave overlaps certain batch jobs (at night), or even worse peak traffic hours, things are not peachy at all. In other words, being a heavy database application, this translates into slow, or even denied service for our (human) visitors.
As you know, when things start to go bad for a database driven website, chances are they're going to get worse. The more requests the application gets, the more time it takes to deliver on each, and if the requests are comming in at a constant rate, or even increasing, this creates an exponentially growing que which ultimately leads to the swap, huge loads, CPUs 99.9% in IO wait = denial of service. Such is the price of success - if you're not prepared for it, that is.
The solution
Obviously you're going to want to get as much attention from Googlebot as you can. So, if you're like me, you're definitively not going to go to Google's Webmaster Tools and tell it to give you a rest.
What you want is a way to just tell Googlebot to give you a rest when the server is naturally busy either with real traffic, performing your routine operations or simply when Googlebot is giving you exagerated love. If at any point your server has some spear processing power, and Google is willing to index your pages... well, you know where I'm going.
There are 2 key elements to the solution that we implemented:
1. Server load (http://en.wikipedia.org/wiki/Load_(computing) )
2. The 503 HTTP response code (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
// check if googlebot visit
$ua = $_SERVER['HTTP_USER_AGENT'];
$ipos = stripos($ua, 'googlebot');
if($ipos !== false) {
// check load
$data = shell_exec('uptime');
$data = explode(' ', $data);
$data = explode(',', $data[13]);
$load = $data[0];
if($load > 3) {
header('HTTP/1.1 503 Service Temporarily Unavailable');
header('Status: 503 Service Temporarily Unavailable');
header('Retry-After: 3600');
die();
}
}
Simply put, our server has 4 CPUs - that usually means that a load of up to 4 is ok. So we are simply cheking if this is a Googlebot visit, and if so only allow it to run if the server load is below 3. If the server is loaded, we're simply telling this to Google, and ask it to check back later - in one hour in this case.