Home

Advertisement

Customize

Derik in Minnesota

Oct. 21st, 2009

02:43 pm - Remote Wiki Backup

I am not a trusting soul.
When the Wiki I contribute to crashed a few months ago, we discovered that our backups were not being performed as billed. That was a great sadness, and a scramble to back up ~27,000 pages out of web-caches, plus re-creating some very complicated templates from scratch. It was not fun.

Several months later… TWiki is safely ensconced in a new host and humming along. Our new host’s backups have been tested and verified to work.

I still worry. I mean… it’s not like the backups are off-site. A fire would wipe us out. A properly robust backup system must allow users to download their own backups.
(Quite aside the fact that our ostensible philosophical commitment when departing from Wikia should compel us to make reasonable backups to allow others to leave from us.)
Complicating things… I don’t have database access, so any backup system must, perforce, run remotely.

With all that in mind, I’ve been noodling at a script to scrape the name of every page in every namespace from the wiki, then archive’s the page’s raw contents. No history, no user data, no IP addresses… just the essentials.

…it’s harder than you’d think. We’ve got plenty of articles with multibyte names (non-english characters) in addition to multi-byte text. MySQL handles multibye text easily enough… as does PHP if you beat it hard enough, but the default behavior of mySQL seems to be to deliver multibyte-encoded content as latin-1, scrambling foreign characters. Getting all the ducks lined up has been a back-burner project for a couple months.
(It was actually backburner since BEFORE the Bookworm Crash that wiped out TFWiki. I regret not having it worked out before then.)

After some debugging, this is what I have:

  • A script that scrapes the names/namespaces of all the pages currently on the wiki– that’s about 37,000 pages.
  • A script that will then query the wiki for the raw (pre-render) text of these pages and store them in a database.

This isn’t a great solution. Since I’m running on a remote web-server, it means making 37,000 individual queries to the server I’m querying. I could do this in minutes with database access, but since this is a live wiki, I’m throttling the queries to one every 10 seconds to prevent overloading the server, which means snapshotting all 37,000 pages will take… 10 days.
There is no guarantee that the version I snapshot won’t be a vandalized page, reverted seconds after i take a picture. Or with templates… that I’m not grabbing a micro-version that’s not working, or incomparable with an inter-dependent template snapshotted later. (My solution was to hard-code the templates to be snapshotted first, and simply monitor the recentChanges to make sure there were no edits to them while they were being scraped.)
And of course the results aren’t in an easily imported format– they’re BLOB fields in an associative database that doesn’t correspond to mediaWiki structure. You cant’ really do much with them in this form.

…except hold onto them. If something goes wrong… they’re not in the best format– but they’re archived with no character-encoding issues, in original wikitext. It would take some custom-coding, but the text would get back into the wiki with 0 loss.
Well, no loss… except the pages which have been edited since they had a snapshot taken.

In an ideal world… this script would be crontab’d and monitor its own progress and execution-time to adjust its own throttle, it would monitor recentChanges, and it would import the edits onto its own wiki– live-mirroring the other. Oh– and it’d do something about the images, which this doesn’t back up at all.

That goes on the backburner though. Next up for me is a total rewrite of the site’s bot, using some of what I learned here… there are some links it stubbornly refuses to fix, and I think that a proper systematic script will get it working better.

For now… I can hold onto this and be content. Whatever else may happen… I know the site will not be wiped out.
Bird in the hand.

Dec. 29th, 2008

11:53 am - Mirroing a remote directory with PHP

PHP time, yay!

This is a project I created to filfill a specific need; I wanted to mirror the contents of a directory. This was a publicly-accessable, listable directory on a corporate web site. It contained sub-directories, which in turn contained product images. My script anticipates that very rigid file structure-- if your directory is different you'll need to make /major/ adjustments to your code.

I'm saving this as "mirror.php" First the backbone;
Messy cludgy code samples below the cut )

Jul. 23rd, 2007

07:48 pm - Debuggery (IP Borders, intermezzo)

"...now why PHP is insisting that the sin of 90 degrees is .999999682932?"
-Me
(Coding is fun.)

Current Location: MCAD

Jul. 21st, 2007

05:33 am - Unreal Estate (IP Borders, part 1)

The 25 countries with the 'largest' IPv4 Footprint
Country Size %
1)UNITED STATES48.59 46.90
2)JAPAN6.11 5.89
3)AUSTRALIA2.42 2.33
4)CHINA6.24 6.03
5)UNITED KINGDOM3.95 3.81
6)GERMANY4.31 4.16
7)FRANCE3.08 2.97
8)CANADA2.76 2.66
9)REPUBLIC OF KOREA3.18 3.06
10)NETHERLANDS2.39 2.30
11)ITALY2.04 1.97
12)SPAIN1.26 1.22
13)SWEDEN1.24 1.19
14)BRAZIL1.15 1.11
15)SWITZERLAND1.13 1.09
16)TAIWAN1.10 1.06
17)MEXICO0.97 0.94
18)RUSSIAN FEDERATION0.89 0.86
19)NORWAY0.76 0.73
20)FINLAND0.73 0.71
21)POLAND0.69 0.67
22)AUSTRIA0.56 0.54
23)DENMARK0.53 0.51
24)INDIA0.51 0.49
25)BELGIUM0.47 0.46

I'm doing some work for the Science Museum of Minnesota this summer- one of the projects involves visualizations of globalization; we wanted to show how the world is connected-- a map of the internet.

I'll be honest, it was a modest goal. Surely there has to be a map out there somewhere showing how the world would be rearranged if if were laid out the way IP addresses are... right? Just crib from that and make a pretty version!

Apparently not. There's lots of maps of the internet, some beautiful traceroute visualizations, network topology, even physical density of IP addresses (which is approaching a 1:1 correlation with human physical density- regardless of where you are in the world.)

These are some beautiful visualizations, I love them all- none of them are what we want and I so callously committed myself to. I don't even have a clue what 'if IP addresses determined geographcal layout' means! I suppose I could just kludge something together and declare it fits, but this is for the NOAA's exhibit- I'd like it to have some actual scientific basis.

...which is how I found myself running regular expressions over a datafile containing IP address tuples and their corresponding countries. A couple hours re-teaching myself php classes (I know, I know, I never learned to run a database local, I'm scratch-writing stats analysis on a web server...) yielded a list with some useful numbers, the top 25 of which I've listed at left.

the names for the bytes of an IPv4 address
(which I just made up)
255. 255. 255. 255
Ambit Windward Dative Squib
<--General Specific-->

IP addresses are like phone numbers for every machine physically connected to the internet. (If you're wireless, the modem your wirelessing-into has an IP address.) Example: 52.149.33.204 4 numbers ('bytes,') valued between 0 and 255. I gave the IP bytes names while I was working on them,, it amused me so. Put together there are 4.2 billion combinations, this is actually a problem since there's already 6 billion people on Earth. IPv6, still in development, is expected to fix this potential 'crunch' with plenty of time to spare before the entire planet gets wired.

IP addresses are a hierarchy. The final set of values is the narrowest, the first the broadest, like genus and species, an IP address pares the options down until only one remains.

For my purposes, I wanted to know about the Ambit, the broadest upper category of IP addresses. There are 256 Ambit, each representing 16 million addresses. I want to know what Ambit 'belong' to which country, and how they're divided if they belong to multiple countries.

Note: This addresses IP allocation, not utilization. That's fine, because I'm treating it like a natural resource- tapped or not, which countries are 'rich' in IP's? And utilization is maximizing anyway. One of the reasons to change to the 16-byte IPv6 system is that with more than 300 undecillion, vast swaths of addresses will be reserved for special purposes regardless of utilization. (An IP address then will convey context information beyond simple location on a network.)

bytes of an IPv6 address
Squib2551
Dative2552
Windward2553
Ambit2554
Capchaw2555
Demonstrative2556
Midwife2557
Remorse2558
Shenlong2559
Exoletus25510
Déshabillé25511
Xabungle25512
Egregore25513
Ill Solace25514
Gabwhacker25515
Bleed25516
Staring at numbers with lots of caffeine and no sleep is a dangerous thing. Somewhere along the way I decided that a group of adjacent IP addresses was a Theory, like a pride of lions, and I took the time to name all 16 bytes in the upcoming IPv6 protocol, which I inflict on you at right. I gave three of them genders too, but you have to guess which ones.

So armed with this purpose, and a database of 76,000 IP tuples from around the world, I went to work, and did just that. I discarded block-allocations that appear unused, compensated for the unused-allocations I found mentioned other places... and ended up being able to account for 94% of all IP space. The fringes of IP allocation are constantly shifting- China never had a chunk of territory assigned to it, its slowly nibbled a huge chunk of IP space out of blocks it shares with other countries. This chart is imperfect- the fudging to compensate for thsoe allocated-but-largely-unused blocks makes it appear Japan only has 2/3 of an IP address per resident- they actually have ~1.1. On the one hand Japan is wireless, which makes IP's less important. On the other it's wired to the hilt- thus near-full-utilization.

The US, no surprise, has about half of all IPv4 territory worldwide. In the final map, that will translate as area. Clustering (continents?) and borders... are more problematic. There will have to be some sort of node-analysis of which countries 'share' a border based on their overlap, and how strong that overlap is. Ideally, with some artistic statistical smudges I'll be able to get the resulting 'map' to break up into 5-10 'continents' I can position vis photoshop to reflect what the statistics tell me about which nations are actually their 'neighbors'.

So, yeah. China has 5% of the internet and 20% of the population. America has 46% (about 1/3 of which is unused) and 5% of the population.

Mappy mappy!

Current Mood: busy