openrefineclusterize

Openrefine: text facet by counting


I've a huge file primary composed of book metadata (author, title, date, url). My problem is that I want to operate on author names (which are often repeated: an author can have hundreds of records) and I want to operate on the subset of these authors that have more than X records.

For example, I have 200 records related to "William Shakespeare", but only one 1 record of "John Black", etc. The point is, being this a classic power law, I have hundred thousands authors, the majority of them with 1-2 records.

Using "Text facet" > "count" is impossible, because my computer freezes.

Is there a query to have the text facet of just some records, based on their count?


Solution

  • Create a custom text facet with the following GREL expression (replace COLUMNS_NAME by your actual column name):

    facetCount(value, "value", "COLUMN_NAME") > 100

    You can edit the comparison (in the example every count great than 100).

    To display only exact count match you need to use two == like this:

    facetCount(value, "value", "COLUMN_NAME") == 100

    More details on this video + tutorail on facet by facet count