duplicatesrdftriples

Removing duplicate triples from an RDF file


Should I remove duplicate triples from my RDF file? For example, I have these blocks within a file:

<http://Group/row1>
    vocab:regione Campania ;
    vocab:nome Napoli ;
    vocab:codice NA .

and

<http://Group/row1>
    vocab:nome Napoli ;
    vocab:codice NA .

The triples in the second block all also appear within the first block. Should the second block be removed from the file?


Solution

  • RDF is graph based representation, and a graph (in this sense) is a set of edges. Sets, by definition, don't have duplicate elements. Of course, a specific serialization of an RDF graph could depict the same triple more than once, and there might be reasons that you would want to avoid that. As a note about terminology, the thing that you've called "Triple 1" is actually three triples:

    group:row1  vocab:codice  "NA" .
    group:row1  vocab:nome  "Napoli".
    group:row1  vocab:regione "Campania".
    

    and what you've called "Triple 2" is actually two triples:

    group:row1  vocab:codice  "NA" .
    group:row1  vocab:nome  "Napoli".
    

    At any rate: (i) it shouldn't actually be a problem that you have the same triples represented multiple times in your data; (ii) if you want to remove it, then reading in the graph (with just about any RDF processing tool) and writing it out again should give you a representation without duplicated information. For instance, suppose you have the following as data.rdf.

    <rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:group="http://stackoverflow.com/q/23241612/1281433/group/"
        xmlns:vocab="http://stackoverflow.com/q/23241612/1281433/vocab/">
      <rdf:Description rdf:about="http://stackoverflow.com/q/23241612/1281433/group/row1">
        <vocab:regione>Campania</vocab:regione>
        <vocab:nome>Napoli</vocab:nome>
        <vocab:codice>NA</vocab:codice>
      </rdf:Description>
      <rdf:Description rdf:about="http://stackoverflow.com/q/23241612/1281433/group/row1">
        <vocab:nome>Napoli</vocab:nome>
        <vocab:codice>NA</vocab:codice>
      </rdf:Description>
    </rdf:RDF>
    

    Here's what you get when you read it in with Jena's rdfcat and write it out again:

    $ rdfcat data.rdf
    <rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:group="http://stackoverflow.com/q/23241612/1281433/group/"
        xmlns:vocab="http://stackoverflow.com/q/23241612/1281433/vocab/">
      <rdf:Description rdf:about="http://stackoverflow.com/q/23241612/1281433/group/row1">
        <vocab:regione>Campania</vocab:regione>
        <vocab:nome>Napoli</vocab:nome>
        <vocab:codice>NA</vocab:codice>
      </rdf:Description>
    </rdf:RDF>