rlanguage-detection

Detecting and Retrieving text from a column based on language model in R


I am using googleLanguageR to detect automatically text language from a text column from a data frame. For a particular sentence, I do the following:

library(googleLanguageR)
gl_auth("credential.json")

gl_translate_detect(df[[45, 'text']])

where text is a column in data frame df. 45 is the line number for which I want to detect the language. "credential.json" is a private API key from Google.

Which gives me the corresponding detected language as output. However, I want to apply for the entire text column which has mixed texts in English and German language and to get them separate.

I tried the following:

gl_translate_detect(df[['text']])

But gives me:

Error in nchar(string) : invalid multibyte string, element 13

My idea is to feed a corpus to detect the underlying language on a dataframe.


Solution

  • It may not be vectorized. We can use rowwise

    library(dplyr)
    df %>%
       rowwise %>%
       mutate(out = tryCatch(gl_tranlsate_detect(text), 
         error = function(e) NA_character_))
    

    Or with lapply to loop over each of the elements in 'text' column and apply the function

    lapply(df$text, gl_translate_detect)