duplicatesdata-cleaningopenrefine

OpenRefine: How to create a unique row for each input in a column ( dilneated by comma)


I have a bunch of product data to clean prior to entry into a database that looks like this:

COL A COL B COL C... "N"
Option 1 A, B, C, D Option 1 attribute
Option 2 C, D, F Option 2 attribute
Option 3 D, J, Z Option 3 attribute

And I'd like for it to look like this with a unique row for every unique product option:

COL A COL B COL C... "N"
Option 1 A Option 1 attribute
Option 1 B Option 1 attribute
Option 1 C Option 1 attribute
Option 1 D Option 1 attribute
Option 2 C Option 2 attribute
Option 2 D Option 2 attribute
Option 2 F Option 2 attribute
Option 3 D Option 3 attribute
Option 3 J Option 3 attribute
Option 3 Z Option 3 attribute

I understand how I could do this with a python script, but I am already using OpenRefine, and I am hoping not to involve a whole new process to my data flow.

Is there an easy way to do this in OpenRefine? I am having a hard time finding a method or extensions for something like this.

Thanks!

EDIT

@magdmartin How can you fill down blank cells using delineated values from the first cell?

COL A COL B COL C... "N"
Option 1 A,B,C,D Option 1 attribute
Option 1 Option 1 attribute
Option 1 Option 1 attribute
Option 1 Option 1 attribute
Option 2 C,D,F Option 2 attribute
Option 2 Option 2 attribute
Option 2 Option 2 attribute
Option 3 D,J,Z Option 3 attribute
Option 3 Option 3 attribute
Option 3 Option 3 attribute

Turned into

COL A COL B COL C... "N"
Option 1 A Option 1 attribute
Option 1 B Option 1 attribute
Option 1 C Option 1 attribute
Option 1 D Option 1 attribute
Option 2 C Option 2 attribute
Option 2 D Option 2 attribute
Option 2 F Option 2 attribute
Option 3 D Option 3 attribute
Option 3 J Option 3 attribute
Option 3 Z Option 3 attribute

Thanks!


Solution

  • I recorded a video here walking through each options describe below here: https://youtu.be/3194zXoJtqI

    For this project, you will need to use two OpenRefine functions

    If you have a lot of columns you can use the All > Transform to speed up the process with the following expression row.record.cells[columnName].value[0]. The trick here is to fill down Col A last so we can keep the record mode when filling down other column (see screenshot below)

    enter image description here