like a ninja from heaven (deriksmith) wrote,
like a ninja from heaven

The last helicopter out of Wikia (filtering page text)

Second in my series about Decamping from Wikia... is the question of importing page text.

This entry has two parts, The Rant (where I talk about the problem,) and The Tech Part where I also rant, but interspersed with code samples.

(If this is slightly less coherent than usual... I'm drugged up to the gills after wrenching my neck over Labor Day. Just know that I feel great!)

The Rant

Quick note: I've been seeing messages recently that Wikia's pages-for-export archives are out of date. Wikia has apparently been having caching problems for about two weeks now, and I suspect that's at the root of the problem, when they get that cleared up normal archiving will probably resume.
Their image archives OTOH, are always out of date. Teletraan 1's was last updated over a year ago. We inquired about this and one of their helpful (!) @wikias offered to stomp on the system until it produced an up-to-date archive for us, but also requested we not ask for it until we were ready, since there was Work Involved. (I can respect that.)
 :Update: M.mendel indicates this problem has indeed been fixed.

So, importing page text. That sounds huge doesn't it? Aren't wikis all about pages? What else is there?

Well, we're not talking about pages, we're talking about page text the stuff you see in the <TEXTAREA> box when editing. It turns out Wikia gives us something extra;

Wikia has inserted an extra link back to itself! in the exported text! Don't believe me? Check it out!
How obnoxious! That's at the bottom of every page! As of this writing (Aug 28, 2008) there are 6,854 articles on Teletraan I, the wiki I hail from. Does that mean that if we export their pages, Wikia gets more than six thousand new links back to itself?
No, actually. And there are 3 reasons why;

  • Though there are 6,854 articles on Teletraan I, there are actually more than 25,000 pages. Talk pages, category pages, image pages, image talk pages... it all ads up.

  • And while there are 27,000 pages on Teletraan I, Wikia doesn't just append this message to every page-- it appends it to every revision of every page. That's a quarter million links back to wikia!

  • And finally, there are two links back to Wikia in each notice. So that's a smooth half a million links back to wikia, ah ah ah!

And they're in history too, so even if we felt like picking them out of 6000 25,000 pages by hand, we'd never be truly rid of them.

Does this mean Teletraan I is doomed to be shilling for Wikia even after we leave?
No, actually. And there are 3 reasons why;

  • We are not legally obligated to retain their silly ads. Wikia's content is released under GFDL, just like Wikipedia. Wikipedia doesn't have silly messages! (No Mediawiki install does, Wikia added it, custom.) And nothing in Wikia's Terms of use (archive) or extended non-binding policy (archive) requires we retain the links back to them.

    In fact GFDL allows us to modify any versions of the pages on Teletraan I as we see fit, (just as wikia saw fit to do so by adding these notices,) including the history as long as the involvement of the original authors is acknowledged. We are importing complete page histories with the author of each revision credited. No where does GFDL require that we also credit the "web-host" for the content generated. (It'd be like crediting Geocities for Highlander fanfic.) We are not required to credit Wikia at all.

  • The Wikia notices are so poorly thought-out they break things. Though Namesoace:10 (templates) were excluded, all other pages get it which means, for example, the transcluded navigation used for Rodumus Prime's spirit guides ends up with the notice (and a bunch of ugly white space) repeated at the top on the article instead the bottom.

    Simply wrapping the whole thing in a <NOINCLUDE> would have prevented the problem- but this is typical of Wikia... they think something through far enough to create a problem, but not far enough to prevent one.

  • Finally, and most important: I don't want them there. I am an egotist, so my wishes trump the other two concerns, and I shall have my way.

Put another way: Our GFDL is GTFO. And wikia's BS ain't gonna stop us.

Normally I'd be wary of announcing weak points in someone's Terms of Service before the Airlift Out actually begins... but the user-generated article content was released to them under GFDL, which requires that all derived work and future versions versions also be GFDL. Wikia cannot change their Terms of Use to place their content under a more restrictive licensee without losing their own rights to use that content!
They're stuck up a legal tree that old walled garden service providers are not... if crowdsourcing community fumbles relations with its users, they can now move somewhere else without having to re-build everythign they left behind-- we can take it with us, which is both cool and totally necessary to the future of the internet. The web without open export is not the web-- it's just a fleeting SITE that will eventually be lost, even if it takes 20 years. 20 years of anything vanishing overnight makes an impact. Content has to be unshackled or sooner-or-later huge chunks of our collective knowledgebase will simpley start winking out of existance.

The tech part

The Problem: Wikia has embedded unsightly spam (that can break things!) in their page exports! >:(

The Solution: Clean each revision at the point-of-import, a quarter million times.

Code changes live in:
=====CODE CHANGES======

=====NEW FILES=====

In includes/SpecialImport.php add a line to the begining of the "importRevision()" function on line ~500.
if ( is_file('transition/tt1_revision_cleaner.php') ) include('transition/tt1_revision_cleaner.php');

The function will now look like this;
function importRevision( &$revision ) {
if ( is_file('transition/tt1_revision_cleaner.php') ) include('transition/tt1_revision_cleaner.php');
$dbw =& wfGetDB( DB_MASTER );
$dbw->deadlockLoop( array( &$revision, 'importOldRevision' ) );

Strictly speaking, you could just stick the code necessary here, but I prefer to isolate it, and including the is_file test means you can just delete or rename the file when it's no longer useful (or is causing problems) and the MW code will continue to function without a hickup.

In transition/tt1_revision_cleaner.php our code will look like this:

$find = '<div id="wikia-credits"><br /><br /><small>From [ Teletraan I: The Transformers Wiki], a [ Wikia] wiki.</small></div>';
$revision->text = str_replace($find, '', $revision->text);

In this case, we're searching for Wikia's offending inclusion and simply replacinng it with nothing, in every one of the quarter-million history states as they are imported. (If Wikia changes the text of its notice, this will no longer work, so just double-check before you actually perform an import and adjust accordingly.)

$revision is MediaWiki's WikiRevision object. Every history state is put into this object (and run through our cleaner) prior to importing.

Here is an outline of the object and its atteributes (This level fo detail is completely unnecessary for the task, but maybe someone eill find it useful.);
WikiRevision Object
    [title] => Title Object
            [mTextform] => Page Title
            [mUrlform] => Page_Title
            [mDbkeyform] => Page_Title
            [mNamespace] => 0        //Namespace #.
            [mInterwiki] => //Purpose obscure: seems to always be blank
            [mFragment] =>  //Purpose obscure: seems to always be blank
            [mArticleID] => 1484  //index# set by the importing software
            [mLatestID] =>  //Purpose obscure: seems to always be blank
            [mRestrictions] => Array
                    [edit] => Array
                            [0] => 

                    [move] => Array
                            [0] => 


            [mRestrictionsLoaded] => 1
            [mPrefixedText] => Page Title  //Includes namespace prefix.  EXAMPLE: Talk:Page Title
            [mDefaultNamespace] => 0
            [mWatched] => 

    [id] => 220508  //Revision ID# from the exporting wiki (set by the XML file)
    [timestamp] => 20080802211226
    [user] => 0     //I this this value is internal User ID my (importing) MediaWiki install.  In this case since "User:John Bravo" does not exist on my wiki it's 0.  (Which causes "John_Bravo"'s edits, when viewed in history, to direct to the Special:cotnributions page, just like an IP users.  If you import a user whose )
    [user_text] => John_Bravo
    [text] => [PAGE CONTENTS]  //This is probably the one you're looking for!
    [comment] => Revision Comment  //Obvious
    [minor] => 1  //Boolean

That's pretty much it as far as clearing Wikia's export-spam goes. In Teletraan 1's case- we do have one or two other thigns to tidy though.

A couple users expressed concerns about non-wiki links. Instead of [[Lunchables Brigade]], we do have a few links to [], HTML-style links. They're few and far between, usually inserted by inexperienced users... but now they're a nuisance since they point to wikia's server, not ours. (We also have a few links pointing to 'last good' history states on template pages that suffer the same problem. Also CSS that points to

Basically, I'm adding the following code as well;
$revision->text = str_replace('', '', $revision->text);
$revision->text = str_replace('', '', $revision->text);
$revision->text = str_replace('', '', $revision->text);

You might want to apply separate fixes depending on what subdirectory your MW install is going to live in, images etc. As always, YMMV.

Finally- if you're exporting/important just the MOST RECENT history state (instead of the entire history,) you might want to link back to Wikia with, say, the comment field!
$revision->comment = "Ofiginally from" . $revision->title->mUrlform . " :: " . $revision->comment

(I don't want to do so, I'm just saying... you might want to.)
You probably don't want to do this if you're importing the complete history, because it would stick the link in every edit-comment in the article's history.

Once you've completed your import you can delete this file- and things will operate as if you'd never made any code changes at all!

A word-
I think I've already made clear how much I disapprove of Wikia's export spam, but I just want to make this clear... though Wikia would doubtless claim it was targeted at sites like that re-present their content... it's not.
Wikia's export spam was targeted at exporters, not re-presenters. Re-presenters don't use the export function (which can cause endless trouble with templates they don't have,) they use &action=render, which yields a 'flattened' HTML version of the page in question.
This spam was targed at people who wanted to export pages in order to leave wikia. To make it harder to do so, and force them to link back to wikia if they did. It was coded to exclude Namespace:10 (Templates,) but not forum pages, community portal pages, talk pages, or user pages? Wikia went out of its way to do this. And ain't gonna be re-presenting my User Talk page.

But gosh, it sure makes it harder for us to leave, doesn't it? And when we do- why there's millions of links from us back to Wikia's near-identical content! Links that improve their Google ratings... and harm ours. (Google looks down on re-presented content.)

I leave you with this little nugget from Wikia's own Policy pages. It makes me laugh.
Wikia is not a link farm
     Wikis on Wikia should contain actual content, not just be a link repository pointing to other sites.

  • Post a new comment


    default userpic

    Your reply will be screened

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.