rloopsreadabilitykorpus

Calculate readability scores for several files with R


I would like to calculate the readability scores in R-3.3.2(R-Studio 3.4 for Win) using koRpus package for several txt.files and save results to excel or sqllite3 or txt. Now I can only calculate the readability score for one file only and print them to console. I tried to improve the code using loop over directory but it fails to work correctly.

library(koRpus)
library(tm)

#Loop through files
path = "D://Reports"
out.file<-""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
  file <- read.table(file.names[i],header=TRUE, sep=";", stringsAsFactors=FALSE)
  out.file <- rbind(out.file, file)
}

#Only one file
report <- tokenize(txt =file , format = "file", lang = "en")

#SMOG-Index
results_smog <- SMOG(report)
summary(results_smog)

#Flesch/Kincaid-Index
results_fleshkin <- flesch.kincaid(report)
summary(results_fleshkin)

#FOG-Index
results_fog<- FOG(report)
summary(results_fog)

Solution

  • I ran to this same problem. I was looking through stackoverflow for a solution and saw your post. After some trial and error, I came up with the following code. Worked fine for me. I pulled out all the extra info. To find the index values of the scores i was looking for, i first ran it for one file and pulled the summary of the readability wrapper. It'll give you a table of a bunch of different values. Match the column with the row and you get the specific number to look for. There are lots of different options.

    In the path directory, your files should be independent text files.

    #Path
    path="C:\\Users\\Philipp\\SkyDrive\\Documents\\Thesiswork\\ReadStats\\"
    
    #list text files 
    ll.files <- list.files(path = path, pattern = "txt",  full.names = TRUE);length(ll.files)
    
    #set vectors
    SMOG.score.vec=rep(0.,length(ll.files))
    FleshKincaid.score.vec=rep(0.,length(ll.files))
    FOG.score.vec=rep(0.,length(ll.files))
    
    #loop through each file
    for (i in 1:length(ll.files)){
      #tokenize
      tagged.text <- koRpus::tokenize(ll.files[i], lang="en")
      #hyphen the word for some of the packages that require it
      hyph.txt.en <- koRpus::hyphen(tagged.text)
      #Readability wrapper
      readbl.txt <- koRpus::readability(tagged.text, hyphen=hyph.txt.en, index="all")
      #Pull scores, convert to numeric, and update the vectors
      SMOG.score.vec[i]=as.numeric(summary(readbl.txt)$raw[36]) #SMOG Score
      FleshKincaid.score.vec[i]=as.numeric(summary(readbl.txt)$raw[11]) #Flesch Reading Ease Score 
      FOG.score.vec[i]=as.numeric(summary(readbl.txt)$raw[22]) #FOG score
      if (i%%10==0)
        cat("finished",i,"\n")}
    
    #if you wanted to do just one
    df=cbind(FOG.score.vec,FleshKincaid.score.vec,SMOG.score.vec)
    colnames(df)=c("FOG", "Flesch Kincaid", "SMOG")
    write.csv(df,file=paste0(path,"Combo.csv"),row.names=FALSE,col.names=TRUE)
    
    # if you wanted to write seperate csvs
    write.csv(SMOG.score.vec,file=paste0(path,"SMOG.csv"),row.names=FALSE,col.names = "SMOG")
    write.csv(FOG.score.vec,file=paste0(path,"FOG.csv"),row.names=FALSE,col.names = "FOG")
    write.csv(FleshKincaid.score.vec,file=paste0(path,"FK.csv"),row.names=FALSE,col.names = "Flesch Kincaid")