Tuesday, June 30, 2009

Does Googlebot overload your server?

Getting indexed by Google is vital for your business, but what do you do when Googlebot hits your website too hard?

The problem
For one of our websites, Googlebot accounts 90% of the hits - and that is alot of requests. The problem is that Googlebot comes in waves, sometimes sending 10 requests per second or more, other times sitting quietly at arround 1 request per second.

The server that runs the application should have no problem handling this, but if such a Googlebot wave overlaps certain batch jobs (at night), or even worse peak traffic hours, things are not peachy at all. In other words, being a heavy database application, this translates into slow, or even denied service for our (human) visitors.

As you know, when things start to go bad for a database driven website, chances are they're going to get worse. The more requests the application gets, the more time it takes to deliver on each, and if the requests are comming in at a constant rate, or even increasing, this creates an exponentially growing que which ultimately leads to the swap, huge loads, CPUs 99.9% in IO wait = denial of service. Such is the price of success - if you're not prepared for it, that is.

The solution
Obviously you're going to want to get as much attention from Googlebot as you can. So, if you're like me, you're definitively not going to go to Google's Webmaster Tools and tell it to give you a rest.

What you want is a way to just tell Googlebot to give you a rest when the server is naturally busy either with real traffic, performing your routine operations or simply when Googlebot is giving you exagerated love. If at any point your server has some spear processing power, and Google is willing to index your pages... well, you know where I'm going.

There are 2 key elements to the solution that we implemented:


// check if googlebot visit
$ua = $_SERVER['HTTP_USER_AGENT'];
$ipos = stripos($ua, 'googlebot');
if($ipos !== false) {
// check load
$data = shell_exec('uptime');
$data = explode(' ', $data);
$data = explode(',', $data[13]);
$load = $data[0];
if($load > 3) {
header('HTTP/1.1 503 Service Temporarily Unavailable');
header('Status: 503 Service Temporarily Unavailable');
header('Retry-After: 3600');
die();
}
}

Simply put, our server has 4 CPUs - that usually means that a load of up to 4 is ok. So we are simply cheking if this is a Googlebot visit, and if so only allow it to run if the server load is below 3. If the server is loaded, we're simply telling this to Google, and ask it to check back later - in one hour in this case.

Thursday, June 25, 2009

Setting timeout for Zend_Feed_RSS

Simple but not a straightforward one:

Zend_Feed::setHttpClient(
new Zend_Http_Client(
null,
array(
'timeout' => 3 // seconds
)
)
);

Tuesday, February 3, 2009

If you don't link to your pages, why should someone else?

And why would Google keep them in it's index, if you don't even think they're valuable enough to be linked to? Did you ever experience pages being hit by GoogleBot, staying in the index for a few days and then being dropped for no apparent reason? Now you know why - or at least one main reason why...

I'm not talking about getting links from different domains, but each page on your site should have at least one internal link pointing to it. And at least one static route from your home page to each page.

What does this mean? Simply put, there should be one chain as bellow leading to each page, and this chain should be relatively stable:

Homepage -> pageA -> pageB -> pageInQuestion

And relatively short, some would say max 2 clicks away from home page. Now we all know that's not really possible for large sites, without clogging your pages with links. But, even staying inline with Google's 100 links per page scheme:

2 clicks away - 10,000 pages
3 clicks away - 1,000,000 pages
4 clicks away - you'll never get those many pages in Google anyway:)

So you get my point - have whatever navigation structure works best for your visitors. Have a stable, pyramidal sitemap like structure for Google. And supplement it with XML sitemaps. This gives the best chances that GoogleBot will find, index and keep your pages.

Don't forget to use Google's webmaster tools to check what incomming links (internal and external) each page receives.

Friday, January 30, 2009

Before you even consider Page Rank (PR)

If you and your competitor(s) are competing head-to-head for exactly the same keywords, and your domains are exactly the same age, and the landing page for that keyword are exactly just as optimized... PR is what you need.

If this is not the case (and even if you believe it is, 99% chances are it's not), then there's still a big to-do list that you can do yourself for your site. Do all this, and there's a good chance you'll head off your competition.

After all, Google does advise that SEO is all about "common sense". After all, your valuable PR8 is worth nothing if Google did not index your pages, or those pages are not associated with the right keywords. This is something that escapes way too many people these days.

Google doesn't like fat
Serve slim, fat-free pages! That way GoogleBot can eat more (pages) without becomming bloated...

Your page should have the smallest footprint possible:
  • get that css and javascript out of your html pages
  • get rid of comments
  • use gzip for delivery
  • code your HTML so that it's as small and efficient as possible - BONUS - your pages will load faster and your users will love you for it
Google is impacient
Don't leave GoogleBot hanging arround for you! If GoogleBot gives you 15 minutes each day, how many pages can you give it?

Make sure your server replies in a timely fashion:
  • optimize your server code
  • optimize your database
  • get a better connection
  • get a dedicated server if you need it
  • make sure you stay online - you don't want GoogleBot knocking on your door with noone to answer, do you?
Google doesn't know what blind-love is
So if Google does not understand your site, how can it love it?

Speak GoogleBot's language:
  • have a clear navigation structure
  • have seo-optimized urls
  • do not put out duplicated content
  • make sure that each url leads to a unique page - if GoogleBot downloads the same content from 10 different URLs, that's just wasted resources for Google and for you
Google is looking for ... keywords
Keyword optimization remains very important. Do not go overboard, do not do keyword stuffing, do not do subdomain spam - these don't work in the long run - not anymore. But :
  • use your keywords inside your copy
  • research your keywords using Google Insight and other tools
  • be congruent with your keywords - don't use too many variations
  • put your copy as high up in your HTML code as possible (hint: CSS)
  • in fact put your keywords high up in your copy, URL, links, alt tags
Oh yes, if you want Google to listen to you, listen to it:

They say they're just guidelines... nobody forces you to listen to them... unless you want traffic from Google that is!

We're not done, more comming, suggestions welcome!