Derik in Minnesota
Oct. 21st, 2009
02:43 pm - Remote Wiki Backup
I am not a trusting soul.
When the Wiki I contribute to crashed a few months ago, we discovered that our backups were not being performed as billed. That was a great sadness, and a scramble to back up ~27,000 pages out of web-caches, plus re-creating some very complicated templates from scratch. It was not fun.
Several months later… TWiki is safely ensconced in a new host and humming along. Our new host’s backups have been tested and verified to work.
I still worry. I mean… it’s not like the backups are off-site. A fire would wipe us out. A properly robust backup system must allow users to download their own backups.
(Quite aside the fact that our ostensible philosophical commitment when departing from Wikia should compel us to make reasonable backups to allow others to leave from us.)
Complicating things… I don’t have database access, so any backup system must, perforce, run remotely.
With all that in mind, I’ve been noodling at a script to scrape the name of every page in every namespace from the wiki, then archive’s the page’s raw contents. No history, no user data, no IP addresses… just the essentials.
…it’s harder than you’d think. We’ve got plenty of articles with multibyte names (non-english characters) in addition to multi-byte text. MySQL handles multibye text easily enough… as does PHP if you beat it hard enough, but the default behavior of mySQL seems to be to deliver multibyte-encoded content as latin-1, scrambling foreign characters. Getting all the ducks lined up has been a back-burner project for a couple months.
(It was actually backburner since BEFORE the Bookworm Crash that wiped out TFWiki. I regret not having it worked out before then.)
After some debugging, this is what I have:
- A script that scrapes the names/namespaces of all the pages currently on the wiki– that’s about 37,000 pages.
- A script that will then query the wiki for the raw (pre-render) text of these pages and store them in a database.
This isn’t a great solution. Since I’m running on a remote web-server, it means making 37,000 individual queries to the server I’m querying. I could do this in minutes with database access, but since this is a live wiki, I’m throttling the queries to one every 10 seconds to prevent overloading the server, which means snapshotting all 37,000 pages will take… 10 days.
There is no guarantee that the version I snapshot won’t be a vandalized page, reverted seconds after i take a picture. Or with templates… that I’m not grabbing a micro-version that’s not working, or incomparable with an inter-dependent template snapshotted later. (My solution was to hard-code the templates to be snapshotted first, and simply monitor the recentChanges to make sure there were no edits to them while they were being scraped.)
And of course the results aren’t in an easily imported format– they’re BLOB fields in an associative database that doesn’t correspond to mediaWiki structure. You cant’ really do much with them in this form.
…except hold onto them. If something goes wrong… they’re not in the best format– but they’re archived with no character-encoding issues, in original wikitext. It would take some custom-coding, but the text would get back into the wiki with 0 loss.
Well, no loss… except the pages which have been edited since they had a snapshot taken.
In an ideal world… this script would be crontab’d and monitor its own progress and execution-time to adjust its own throttle, it would monitor recentChanges, and it would import the edits onto its own wiki– live-mirroring the other. Oh– and it’d do something about the images, which this doesn’t back up at all.
That goes on the backburner though. Next up for me is a total rewrite of the site’s bot, using some of what I learned here… there are some links it stubbornly refuses to fix, and I think that a proper systematic script will get it working better.
For now… I can hold onto this and be content. Whatever else may happen… I know the site will not be wiped out.
Bird in the hand.
