htmlxmlperlcpanwikitext

How to use MediaWiki::DumpFile to convert Wikipedia XML dump to HTML?


On page MediaWiki::DumpFile following code is present:

  use MediaWiki::DumpFile;

  $mw = MediaWiki::DumpFile->new;

  $sql = $mw->sql($filename);
  $sql = $mw->sql(\*FH);

  $pages = $mw->pages($filename);
  $pages = $mw->pages(\*FH);

  $fastpages = $mw->fastpages($filename);
  $fastpages = $mw->fastpages(\*FH);

  use MediaWiki::DumpFile::Compat;

  $pmwd = Parse::MediaWikiDump->new;

I'm completely new to Perl and don't know what to do with $fastpages to save all HTML pages (or text, it doesn't matter) from XML dump. Can you help me? And what is *FH ?


Solution

  • I haven't used it but the documentation for MediaWiki::DumpFile::FastPages has the following example for printing the title and text of each article in a dump file:

    use MediaWiki::DumpFile::FastPages;
    
    $pages = MediaWiki::DumpFile::FastPages->new($file);
    $pages = MediaWiki::DumpFile::FastPages->new(\*FH);
    
    while(($title, $text) = $pages->next) {
      print "Title: $title\n";
      print "Text: $text\n";
    }
    

    This will write everything to stdout. When you create the MediaWiki::DumpFile::FastPages object, you can pass either a file name, e.g.

    $file = "/path/to/dump/file";
    $pages = MediaWiki::DumpFile::FastPages->new($file);
    

    or a reference to a file handle, e.g.

    open FH, "<", "/path/to/dump/file" or die "Failed to open file: $!";
    $pages = MediaWiki::DumpFile::FastPages->new(\*FH);