perlrdfodp

How to extract only URLs from Dmoz ODP file (in RDF)


I need only the URL's from the dmoz/ODP file. But the file is in RDF. How do I get only the url's from the odp file? I want to extract all the url's in a text file.

Anyone knows of any script to parse only urls from rdf file ?


Solution

  • Maybe something like this then?

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my $file = "kt-content.rdf.u8";
    my @urls;
    
    open(my $fh, "<", $file) or die "Unable to open $file\n";
    
    while (my $line = <$fh>) {
        if ($line =~ m/<(?:ExternalPage about|link r:resource)="([^\"]+)"\/?>/) {
            push @urls, $1;
        }
    }
    
    close $fh;
    

    And then print the contents of @urls to a text file.