regexopenrefinegoogle-refinegrel

Create column in OpenRefine using a logic test on an existing column?


I have an original data set with over 4,500 row entries representing land transactions. One of the existing columns is "Place," and in each cell value I've put the locality name usually as Some Neighborhood, but sometimes when there is a place I don't know, I'll add additional info from the notary records, e.g., Some Neighborhood, Some County or even as Some Neighborhood, Some County (nicknamed).

I want to create two new columns using the info from this one column. The first column would contain the upper-level administrative division, and the second column the administrative division above that -- kinda like zooming out.

I think I would need to create a new column based on existing column using a sort of logical test or if-if-if- test where I can enumerate the values of Some Neighborhood that belong to County 1 or County 2 or City 1, etc. Then, based on this new column, I'd create another that would have State 1, State 2, etc., presumably using the same sort of test.

How would I write this using regular expressions in Open Refine GREL? How can I string together multiple possible values that are found in the cells? I tried using cells and value.contains but I do not know how to string together multiples like this.

So again to recap, I want to script a way to create a new column that contains new cell values based on a column of existing values, something like

"if Street1,Street2, Street5, but not Street3, or Street4, then County1"

OR

"if Street1 or Street2 or Street5, then County1 AND THEN IF Street3 or Street4 then County2"


EDIT: Here's some of the data:

land sale   0.350   carreau 350 gourdes Bullet
land sale   1.000   carreau 700 gourdes Campèche
land sale   0.200   carreau 220 gourdes Bremont
land sale   0.500   carreau 150 gourdes Pierrette
land sale   5.000   carreau 225 gourdes Lagenivrée
land sale   0.125   carreau 200 gourdes Bullet
land sale   1.000   carreau 300 gourdes Tozin
land sale   0.125   carreau 100 gourdes Dufort
land sale   0.250   carreau 135 gourdes Charitte, Savann Brute
land sale   0.500   carreau 300 gourdes Ravines des Roches
land sale   0.500   carreau 80  gourdes Isidore (Nègre Libre)
land sale   0.500   carreau 215 gourdes Nordette (Boures)
land sale   0.250   carreau 200 gourdes Bullet (Morne Montègue)

And here's an example of the expected result with two new columns, let's say "Commune" and "Section":

land sale   0.350   carreau 350 gourdes Limonade    Bwadlans    Bullet
land sale   1.000   carreau 700 gourdes Limonade    Bwadlans    Campèche
land sale   0.200   carreau 220 gourdes Limonade    Bwadlans    Bremont
land sale   0.500   carreau 150 gourdes Limonade    Roucou      Pierrette
land sale   5.000   carreau 225 gourdes Limonade    Roucou      Lagenivrée
land sale   0.125   carreau 200 gourdes Limonade    Bwadlans    Bullet
land sale   1.000   carreau 300 gourdes Quart_Mor   Sablé       Tozin
land sale   0.125   carreau 100 gourdes Limonade    Bwadlans    Dufort
land sale   0.250   carreau 135 gourdes Limonade    Bwadlans    Charitte, Savann Brute
land sale   0.500   carreau 300 gourdes Limonade    Bwadlans    Ravines des Roches
land sale   0.500   carreau 80  gourdes Limonade    Bwadlans    Isidore (Nègre Libre)
land sale   0.500   carreau 215 gourdes Limonade    Bwadlans    Nordette (Boures)
land sale   0.250   carreau 200 gourdes Limonade    Bwadlans    Bullet (Morne Montègue)

Solution

  • There are a number of approaches and the one you choose may depend on how many values you have in the initial Place column.

    I recommend the first approach as a starting point

    Approach 1 - use facets:

    Approach 2 - use a lookup: If you have a mapping of Places -> Commune & Sections already, you can create this as a separate OpenRefine project, then use 'cross' to lookup a Commune/Section for each Place

    Approach 3 - use conditionals: This is I think the approach you asked for in the first place, but I don't think it is necessarily the best option

    Note that to use more than two conditions you have to nest 'or' statements e.g.: or(or(value=="Bullet",value=="Campéche"),value=="Bremont")

    Approaches 2 and 3 can be made easier by installing OpenRefine extensions which help with this process (recommend the VIB-BITS extension for Approach 2, and looking at the gokbutils extension with the 'inArray' function for Approach 3 - which would avoid complex 'or' statements)