rdataframesubstringgsub

extracting the instances of numeric value in dataframe


In each of the dataframe, I have a row entry that looks like:

[{'degree': 0.1, 'name': 'Colins Dental Clinic', 'rating': 9, 'level': 'fourth'}]

It is a direct import from read.csv of a file.

I would like to create two columns from each entry that extracts two numeric values, from above, 0.1, and 9, which are degree and rating information of a dental clinic.

I tried substring and using sep="," with quotes, but I am just not getting things correctly.

How do I do it correctly?


Solution

  • It looks like JSON data not correctly formated (because of the single quotes).
    You didn't provide a sample of your dataframe but you can try something like this (2 options) :

    ### Packages
    library(dplyr)
    library(stringr)
    library(jsonlite)
    
    ### Data
    df=data.frame(col1="dummy",
                  col2="dummy",
                  col3="[{'degree': 0.1, 'name': 'Colins Dental Clinic', 'rating': 9, 'level': 'fourth'}]")
    
    ### Option 1 : Transform your column data into a valid JSON then extract data from it
    df1=df %>% mutate(col3=str_replace_all(col3,"'","\\\""),
                      col3=str_replace_all(col3,"(?<=: )None(?=[,}])","\\\"None\\\""),
                      degree=fromJSON(col3)$degree,
                      rating=fromJSON(col3)$rating)
    
    ### Option 2 : Use regex to get directly what you need (less secure)
    df2=df %>% mutate(degree=str_extract(col3,"(?<=degree': )\\d\\.\\d"),
                     rating=str_extract(col3,"(?<=rating': )\\d"))
    

    Output :

    Output