openrefinemerging-data

In OpenRefine, how to import multiple selected columns at once from another project using GREL or Python


I have a second tsv file (Project2) with several columns out of which I want to import only a couple of selected ones into my existing project in OpenRefine. In some previous releases of OpenRefine, there was the option of "import from another project" but in recent releases this has apparently been decommissioned. I need a GREL or Python script to be able to do this operation for as many columns as I desire.

Project1

Column A Column B
id_1 value1
id_2 value2

Project2

Column C Column D Column E Column F
id_1 value_x value_z value_u
id_2 value_y value_v value_t

Expected merged project

Column A Column B Column E Column F
id_1 value1 value_z value_u
id_2 value2 value_v value_t

I found a solution here but it works for importing only one column.


Solution

  • There are currently several ways of merging data from different project in OpenRefine.

    The function in OpenRefine is to access data in other projects is called cross.

    Import a small number of columns

    So if you only want to "import" a small number of columns from another project you can use it like this:

    1. In "Project1" on "Column A" use "Edit column" => "Add column based on this column...". As name choose Column E and as GREL expression cell.cross("Project2", "Column C").cells["Column E"][0].value.
    2. Repeat step for "Column F", and so on...

    Import a bigger number of columns

    If you want to import a bigger number of columns you somehow have to make the names of these columns known to OpenRefine.

    1. In "Project1" on "Column B" use "Edit column" => "Add column based on this column...". As name choose Merge Columns and as GREL expression "Column E||Column F".
    2. On the new column "Merge Columns" use "Edit cells" => "Split multi-valued cells...". As separator choose ||.
    3. On "Column A" use "Edit cells => "Fill down".
    4. On "Column B" use "Edit cells" => "Fill down".
    5. On column "Merge Columns" use "Edit column" => "Add column based on this column...". As name choose "Merged Values" and as GREL expression row.cells["Column A"][0].value.cross("Project2", "Column C").cells[row.cells["Merge Columns"].value][0].value.
    6. On column "Merge Columns" use "Transpose" => "Columnize by key/value columns...". As key column choose "Merge Columns" and as value column choose "Merged Values".

    Update for OpenRefine 3.8

    Starting with OpenRefine 3.8 you can access the column names of another project (see GitHub Issues 5903 and 5633. With this we can import all the columns of the other project and then use Facets and Filters to remove the data we do not want to import.

    1. In "Project1" on "Column A" use "Transpose" => "Transpose cells across columns into rows...". As "From Column" use "Column B", for "Key column" use the value Keys and for "Value column" use the value Values.
    2. On the new column "Keys" use "Edit cells" => "Transform...". As expression use
    with("||", sep,
        with(row.cells["Column A"], id,
            if(isBlank(id), value,
                value + sep + with(id.cross("Project 2", "Column C")[0], mergeRow,
                    mergeRow.columnNames
                ).join(sep)
            )
        )
    )
    
    1. On the new column "Values" use "Edit cells" => "Transform...". As expression use
    with("||", sep,
        with(row.cells["Column A"], id,
            if(isBlank(id), value,
                value + sep + with(id.cross("Project 2", "Column C")[0], mergeRow,
                    forEach(mergeRow.columnNames, c,
                        mergeRow.cells[c].value
                    )
                ).join(sep)
            )
        )
    )
    
    1. On the column "Keys" use "Edit cells" => "Split multi-valued cells...".
    2. On the column "Values" use "Edit cells" => "Split multi-valued cells...".
    3. Use Text facets or filters to remove the data you don't want in the project.
    4. On "Column A" use "Edit cells" => "Fill down".
    5. On column "Keys" use "Transpose" => "Columnize by key/value columns...". As key column choose "Keys" and as value column choose "Values".

    More recipes on how to combine datasets can be found in the OpenRefine Wiki.