openrefinegrel

OpenRefine text transform unique() ignoring case


Is there a way to tell uniques() to ignore case?

I have a GREL that runs like

forEach(value.split(","),v,v.trim()).uniques().join(",")

This takes each value in the cell seperated by commas, and then spits out the unique value/s in that cell. Works great however if I have a cell that contains Paul, PAUL it will return both rather than just 'Paul'.

Is it possible to cast all the values temporarily to uppercase to compare and then return the first spelling you had of a unique?


Solution

  • Checking I understand the problem - for example if you start with:

    Paul,PAUL,Edward,edward

    Am I right to assume you want to end up with:

    Paul,Edward

    Assuming I've understood this, then I think the way I'd approach this is to create the unique values ignoring case, then look back to the original values and extract the first one that matches a particular key.

    So something like: Duplicate the data into a new column called 'keys' using "Edit column->Add column based on this column" with the GREL transformation:

    forEach(value.split(","),v,v.trim().toLowercase()).uniques().join(",")
    

    Now you have:

    | Col1                    | keys        |
    |-------------------------|-------------|
    | Paul,PAUL,Edward,edward | paul,edward |
    

    You can now iterate through the values in the keys column, and find the first value in Col1 which would convert to that key using the same transformation:. To do this on the 'key' column you can use the transformation:

    forEach(value.split(","),v,filter(cells["Col1"].value.split(","),w,w.trim().toLowercase()==v)[0]).join(",")
    

    That should leave you with

    | Col1                    | keys        |
    |-------------------------|-------------|
    | Paul,PAUL,Edward,edward | Paul,Edward |
    

    Of course it is worth noting that if your original data is in a different order you'd get a different final value - e.g. "PAUL,Paul,Edward,edward" would end up with "PAUL,Edward". It's possible to do some more work to improve this, but there are going to be limits.