Restoring a wordpress site by scraping/crawling google

I love challenges, but once in a while the tend to be way tooo big! During my christmasholidays I accidently wiped my home server. I wanted to do some LVM stuff online, remotely, without console access, through the ESXi console. … and really thought that nothing could go wrong ;-). First assumption being wrong.

To make a long history short. I shot myself in the foot and was without a server for a 3-4 days. When I got home again, I thought a simple reboot and some LVM magic would make everything all right. Second assumption being wrong.

So in the very end I had to reinstall my server from scratch. Luckily I backup my stuff using and so should you! It will save your butt some time!

It turned out that, for some bizaro reason, my database had not been dumped to csv files. So in the end I came to these conclusions:

  • I lost my database.
  • I thus lost my wordpress blog.

:-(

But loving challenges I refused to let that be the end. I thought about using archive.org, but they did not really have a new crawl of my site.

I decided to crawl google! Not as easy as it might sound for a couple of reasons:

  • Google does not like being crawled …. at all. If googles infinite number of computers discover that you are crawling them, then your IP will be blocked from seeing their cached content.
  • When you enter keywords into google you normally get thousans of links to follow. I needed one. The correct one! The one that was a cached version of my site.

So I fired up my editor and utilized the great WWW::Mechanize. I ended up with this script, which do all the hard work of scraping google. It will take some time to complete — hours and days even! It will get there though. If you try to speed things up it will take longer as you will be blocked by google when they detect you are scraping them. Be warned. Been there. Tried that. Got blocked.

Having retrieved all of my old site through google I had to parse these pages and import them into wordpress. So again I fired up my editor and wrote this little script. For this to work, you have to have

    • a clean wordpress installation with a hello world post
    • XMLRPC writing enabled in WordPress, as the script uses WordPress::XMLRPC.
    • the following in wp-config.php
      define( ‘AUTOSAVE_INTERVAL’,    3600 );     // autosave 1x per hour
      define( ‘WP_POST_REVISIONS’,    false );    // no revisions
      define( ‘DISABLE_WP_CRON’,      true );
      define( ‘EMPTY_TRASH_DAYS’,     7 );        // one week

So in the end what did I loose and what did I learn? I lost my comments on my site. Or more preciesely: I have them,but I will postpone putting them back in until I get the time to fool around with coding again. And I learnt a lot about tripplechecking my backup for all their pieces before doing storage related work remotely without a proper console!

Leave a Reply